33款可用来抓数据的开源爬虫软件工具

要玩大数据,没有数据怎么玩?这里推荐一些33款开源爬虫软件给大家。

爬虫,即网络爬虫,是一种自动获取网页内容的程序。是搜索引擎的重要组成部分,因此搜索引擎优化很大程度上就是针对爬虫而做出的优化。

网络爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成。传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。聚焦爬虫的工作流程较为复杂,需要根据一定的网页分析算法过滤与主题无关的链接,保留有用的链接并将其放入等待抓取的URL队列。然后,它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到系统的某一条件时停止。另外,所有被爬虫抓取的网页将会被系统存贮,进行一定的分析、过滤,并建立索引,以便之后的查询和检索;对于聚焦爬虫来说,这一过程所得到的分析结果还可能对以后的抓取过程给出反馈和指导。

世界上已经成型的爬虫软件多达上百种,本文对较为知名及常见的开源爬虫软件进行梳理,按开发语言进行汇总。虽然搜索引擎也有爬虫,但本次我汇总的只是爬虫软件,而非大型、复杂的搜索引擎,因为很多兄弟只是想爬取数据,而非运营一个搜索引擎。

171

Java爬虫

1. Arachnid

Arachnid是一个基于Java的web spider框架.它包含一个简单的HTML剖析器能够分析包含HTML内容的输入流.通过实现Arachnid的子类就能够开发一个简单的Web spiders并能够在Web站上的每个页面被解析之后增加几行代码调用。 Arachnid的下载包中包含两个spider应用程序例子用于演示如何使用该框架。

特点:微型爬虫框架,含有一个小型HTML解析器

许可证:GPL

2、crawlzilla

crawlzilla 是一个帮你轻松建立搜索引擎的自由软件,有了它,你就不用依靠商业公司的搜索引擎,也不用再烦恼公司內部网站资料索引的问题。

由 nutch 专案为核心,并整合更多相关套件,并卡发设计安装与管理UI,让使用者更方便上手。

crawlzilla 除了爬取基本的 html 外,还能分析网页上的文件,如( doc、pdf、ppt、ooo、rss )等多种文件格式,让你的搜索引擎不只是网页搜索引擎,而是网站的完整资料索引库。

拥有中文分词能力,让你的搜索更精准。

crawlzilla的特色与目标,最主要就是提供使用者一个方便好用易安裝的搜索平台。

授权协议: Apache License 2
开发语言: Java JavaScript SHELL
操作系统: Linux

特点:安装简易,拥有中文分词功能

3、Ex-Crawler

Ex-Crawler 是一个网页爬虫,采用 Java 开发,该项目分成两部分,一个是守护进程,另外一个是灵活可配置的 Web 爬虫。使用数据库存储网页信息。

  • 授权协议: GPLv3
  • 开发语言: Java
  • 操作系统: 跨平台

特点:由守护进程执行,使用数据库存储网页信息

4、Heritrix

Heritrix 是一个由 java 开发的、开源的网络爬虫,用户可以使用它来从网上抓取想要的资源。其最出色之处在于它良好的可扩展性,方便用户实现自己的抓取逻辑。

Heritrix采用的是模块化的设计,各个模块由一个控制器类(CrawlController类)来协调,控制器是整体的核心。

代码托管:https://github.com/internetarchive/heritrix3

  • 授权协议: Apache
  • 开发语言: Java
  • 操作系统: 跨平台

特点:严格遵照robots文件的排除指示和META robots标签

5、heyDr

heyDr

heyDr是一款基于java的轻量级开源多线程垂直检索爬虫框架,遵循GNU GPL V3协议。

用户可以通过heyDr构建自己的垂直资源爬虫,用于搭建垂直搜索引擎前期的数据准备。

  • 授权协议: GPLv3
  • 开发语言: Java
  • 操作系统: 跨平台

特点:轻量级开源多线程垂直检索爬虫框架

6、ItSucks

ItSucks是一个java web spider(web机器人,爬虫)开源项目。支持通过下载模板和正则表达式来定义下载规则。提供一个swing GUI操作界面。

特点:提供swing GUI操作界面

7、jcrawl

jcrawl是一款小巧性能优良的的web爬虫,它可以从网页抓取各种类型的文件,基于用户定义的符号,比如email,qq.

  • 授权协议: Apache
  • 开发语言: Java
  • 操作系统: 跨平台

特点:轻量、性能优良,可以从网页抓取各种类型的文件

8、JSpider

JSpider是一个用Java实现的WebSpider,JSpider的执行格式如下:

jspider [URL] [ConfigName]

URL一定要加上协议名称,如:http://,否则会报错。如果省掉ConfigName,则采用默认配置。

JSpider 的行为是由配置文件具体配置的,比如采用什么插件,结果存储方式等等都在conf\[ConfigName]\目录下设置。JSpider默认的配置种类 很少,用途也不大。但是JSpider非常容易扩展,可以利用它开发强大的网页抓取与数据分析工具。要做到这些,需要对JSpider的原理有深入的了 解,然后根据自己的需求开发插件,撰写配置文件。

  • 授权协议: LGPL
  • 开发语言: Java
  • 操作系统: 跨平台

特点:功能强大,容易扩展

9、Leopdo

用JAVA编写的web 搜索和爬虫,包括全文和分类垂直搜索,以及分词系统

  • 授权协议: Apache
  • 开发语言: Java
  • 操作系统: 跨平台

特点:包括全文和分类垂直搜索,以及分词系统

 

10、MetaSeeker

是一套完整的网页内容抓取、格式化、数据集成、存储管理和搜索解决方案。

网络爬虫有多种实现方法,如果按照部署在哪里分,可以分成:

服务器侧:

一般是一个多线程程序,同时下载多个目标HTML,可以用PHP, Java, Python(当前很流行)等做,可以速度做得很快,一般综合搜索引擎的爬虫这样做。但是,如果对方讨厌爬虫,很可能封掉你的IP,服务器IP又不容易 改,另外耗用的带宽也是挺贵的。建议看一下Beautiful soap。

客户端:

一般实现定题爬虫,或者是聚焦爬虫,做综合搜索引擎不容易成功,而垂直搜诉或者比价服务或者推荐引擎,相对容易很多,这类爬虫不是什么页面都 取的,而是只取你关系的页面,而且只取页面上关心的内容,例如提取黄页信息,商品价格信息,还有提取竞争对手广告信息的,搜一下Spyfu,很有趣。这类 爬虫可以部署很多,而且可以很有侵略性,对方很难封锁。

MetaSeeker中的网络爬虫就属于后者。

MetaSeeker工具包利用Mozilla平台的能力,只要是Firefox看到的东西,它都能提取。

特点:网页抓取、信息提取、数据抽取工具包,操作简单

 

11、Playfish

playfish是一个采用java技术,综合应用多个开源java组件实现的网页抓取工具,通过XML配置文件实现高度可定制性与可扩展性的网页抓取工具

应用开源jar包包括httpclient(内容读取),dom4j(配置文件解析),jericho(html解析),已经在 war包的lib下。

这个项目目前还很不成熟,但是功能基本都完成了。要求使用者熟悉XML,熟悉正则表达式。目前通过这个工具可以抓取各类论坛,贴吧,以及各类CMS系统。像Discuz!,phpbb,论坛跟博客的文章,通过本工具都可以轻松抓取。抓取定义完全采用XML,适合Java开发人员使用。

使用方法:

  1. 下载右边的.war包导入到eclipse中,
  2. 使用WebContent/sql下的wcc.sql文件建立一个范例数据库,
  3. 修改src包下wcc.core的dbConfig.txt,将用户名与密码设置成你自己的mysql用户名密码。
  4. 然后运行SystemCore,运行时候会在控制台,无参数会执行默认的example.xml的配置文件,带参数时候名称为配置文件名。

系统自带了3个例子,分别为baidu.xml抓取百度知道,example.xml抓取我的javaeye的博客,bbs.xml抓取一个采用 discuz论坛的内容。

  • 授权协议: MIT
  • 开发语言: Java
  • 操作系统: 跨平台

特点:通过XML配置文件实现高度可定制性与可扩展性

12、Spiderman

Spiderman 是一个基于微内核+插件式架构的网络蜘蛛,它的目标是通过简单的方法就能将复杂的目标网页信息抓取并解析为自己所需要的业务数据。

怎么使用?

首先,确定好你的目标网站以及目标网页(即某一类你想要获取数据的网页,例如网易新闻的新闻页面)

然后,打开目标页面,分析页面的HTML结构,得到你想要数据的XPath,具体XPath怎么获取请看下文。

最后,在一个xml配置文件里填写好参数,运行Spiderman吧!

  • 授权协议: Apache
  • 开发语言: Java
  • 操作系统: 跨平台

特点:灵活、扩展性强,微内核+插件式架构,通过简单的配置就可以完成数据抓取,无需编写一句代码

13、webmagic

webmagic的是一个无须配置、便于二次开发的爬虫框架,它提供简单灵活的API,只需少量代码即可实现一个爬虫。

webmagic

webmagic采用完全模块化的设计,功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化),支持多线程抓取,分布式抓取,并支持自动重试、自定义UA/cookie等功能。

webmagic

webmagic包含强大的页面抽取功能,开发者可以便捷的使用css selector、xpath和正则表达式进行链接和内容的提取,支持多个选择器链式调用。

webmagic的使用文档:http://webmagic.io/docs/

查看源代码:http://git.oschina.net/flashsword20/webmagic

  • 授权协议: Apache
  • 开发语言: Java
  • 操作系统: 跨平台

特点:功能覆盖整个爬虫生命周期,使用Xpath和正则表达式进行链接和内容的提取。

备注:这是一款国产开源软件,由 黄亿华贡献

14、Web-Harvest

Web-Harvest是一个Java开源Web数据抽取工具。它能够收集指定的Web页面并从这些页面中提取有用的数据。Web-Harvest主要是运用了像XSLT,XQuery,正则表达式等这些技术来实现对text/xml的操作。

其实现原理是,根据预先定义的配置文件用httpclient获取页面的全部内容(关于httpclient的内容,本博有些文章已介绍),然后运用XPath、XQuery、正则表达式等这些技术来实现对text/xml的内容筛选操作,选取精确的数据。前两年比较火的垂直搜索(比如:酷讯等)也是采用类似的原理实现的。Web-Harvest应用,关键就是理解和定义配置文件,其他的就是考虑怎么处理数据的Java代码。当然在爬虫开始前,也可以把Java变量填充到配置文件中,实现动态的配置。

  • 授权协议: BSD
  • 开发语言: Java

特点:运用XSLT、XQuery、正则表达式等技术来实现对Text或XML的操作,具有可视化的界面

15、WebSPHINX

WebSPHINX是一个Java类包和Web爬虫的交互式开发环境。Web爬虫(也叫作机器人或蜘蛛)是可以自动浏览与处理Web页面的程序。WebSPHINX由两部分组成:爬虫工作平台和WebSPHINX类包。

授权协议:Apache

开发语言:Java

特点:由两部分组成:爬虫工作平台和WebSPHINX类包

16、YaCy

YaCy基于p2p的分布式Web搜索引擎.同时也是一个Http缓存代理服务器.这个项目是构建基于p2p Web索引网络的一个新方法.它可以搜索你自己的或全局的索引,也可以Crawl自己的网页或启动分布式Crawling等.

  • 授权协议: GPL
  • 开发语言: Java Perl
  • 操作系统: 跨平台

特点:基于P2P的分布式Web搜索引擎

Python爬虫

17、QuickRecon

QuickRecon是一个简单的信息收集工具,它可以帮助你查找子域名名称、perform zone transfe、收集电子邮件地址和使用microformats寻找人际关系等。QuickRecon使用python编写,支持linux和 windows操作系统。

  • 授权协议: GPLv3
  • 开发语言: Python
  • 操作系统: Windows Linux

特点:具有查找子域名名称、收集电子邮件地址并寻找人际关系等功能

18、PyRailgun

这是一个非常简单易用的抓取工具。支持抓取javascript渲染的页面的简单实用高效的python网页爬虫抓取模块

  • 授权协议: MIT
  • 开发语言: Python
  • 操作系统: 跨平台 Windows Linux OS X

特点:简洁、轻量、高效的网页抓取框架

备注:此软件也是由国人开放

github下载:https://github.com/princehaku/pyrailgun#readme

19、Scrapy

Scrapy 是一套基于基于Twisted的异步处理框架,纯python实现的爬虫框架,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非常之方便~

  • 授权协议: BSD
  • 开发语言: Python
  • 操作系统: 跨平台

github源代码:https://github.com/scrapy/scrapy

特点:基于Twisted的异步处理框架,文档齐全

C++爬虫

20、hispider

HiSpider is a fast and high performance spider with high speed

严格说只能是一个spider系统的框架, 没有细化需求, 目前只是能提取URL, URL排重, 异步DNS解析, 队列化任务, 支持N机分布式下载, 支持网站定向下载(需要配置hispiderd.ini whitelist).

特征和用法:

  • 基于unix/linux系统的开发
  • 异步DNS解析
  • URL排重
  • 支持HTTP 压缩编码传输 gzip/deflate
  • 字符集判断自动转换成UTF-8编码
  • 文档压缩存储
  • 支持多下载节点分布式下载
  • 支持网站定向下载(需要配置 hispiderd.ini whitelist )
  • 可通过 http://127.0.0.1:3721/ 查看下载情况统计,下载任务控制(可停止和恢复任务)
  • 依赖基本通信库libevbase 和 libsbase (安装的时候需要先安装这个两个库)、

工作流程:

  • 从中心节点取URL(包括URL对应的任务号, IP和port,也可能需要自己解析)
  • 连接服务器发送请求
  • 等待数据头判断是否需要的数据(目前主要取text类型的数据)
  • 等待完成数据(有length头的直接等待说明长度的数据否则等待比较大的数字然后设置超时)
  • 数据完成或者超时, zlib压缩数据返回给中心服务器,数据可能包括自己解析DNS信息, 压缩后数据长度+压缩后数据, 如果出错就直接返回任务号以及相关信息
  • 中心服务器收到带有任务号的数据, 查看是否包括数据, 如果没有数据直接置任务号对应的状态为错误, 如果有数据提取数据种link 然后存储数据到文档文件.
  • 完成后返回一个新的任务.

授权协议: BSD

开发语言: C/C++

操作系统: Linux

特点:支持多机分布式下载, 支持网站定向下载

21、larbin

larbin是一种开源的网络爬虫/网络蜘蛛,由法国的年轻人 Sébastien Ailleret独立开发。larbin目的是能够跟踪页面的url进行扩展的抓取,最后为搜索引擎提供广泛的数据来源。Larbin只是一个爬虫,也就 是说larbin只抓取网页,至于如何parse的事情则由用户自己完成。另外,如何存储到数据库以及建立索引的事情 larbin也不提供。一个简单的larbin的爬虫可以每天获取500万的网页。

利用larbin,我们可以轻易的获取/确定单个网站的所有链接,甚至可以镜像一个网站;也可以用它建立url 列表群,例如针对所有的网页进行 url retrive后,进行xml的联结的获取。或者是 mp3,或者定制larbin,可以作为搜索引擎的信息的来源。

  • 授权协议: GPL
  • 开发语言: C/C++
  • 操作系统: Linux

特点:高性能的爬虫软件,只负责抓取不负责解析

22、Methabot

Methabot 是一个经过速度优化的高可配置的 WEB、FTP、本地文件系统的爬虫软件。

  • 授权协议: 未知
  • 开发语言: C/C++
  • 操作系统: Windows Linux

特点:过速度优化、可抓取WEB、FTP及本地文件系统

源代码:http://www.oschina.net/code/tag/methabot

C#爬虫

23、NWebCrawler

NWebCrawler是一款开源,C#开发网络爬虫程序。

特性:

  • 可配置:线程数,等待时间,连接超时,允许MIME类型和优先级,下载文件夹。
  • 统计信息:URL数量,总下载文件,总下载字节数,CPU利用率和可用内存。
  • Preferential crawler:用户可以设置优先级的MIME类型。
  • Robust: 10+ URL normalization rules, crawler trap avoiding rules.

授权协议: GPLv2

开发语言: C#

操作系统: Windows

项目主页:http://www.open-open.com/lib/view/home/1350117470448

特点:统计信息、执行过程可视化

24、Sinawler

国内第一个针对微博数据的爬虫程序!原名“新浪微博爬虫”。

登录后,可以指定用户为起点,以该用户的关注人、粉丝为线索,延人脉关系搜集用户基本信息、微博数据、评论数据。

该应用获取的数据可作为科研、与新浪微博相关的研发等的数据支持,但请勿用于商业用途。该应用基于.NET2.0框架,需SQL SERVER作为后台数据库,并提供了针对SQL Server的数据库脚本文件。

另外,由于新浪微博API的限制,爬取的数据可能不够完整(如获取粉丝数量的限制、获取微博数量的限制等)

本程序版权归作者所有。你可以免费: 拷贝、分发、呈现和表演当前作品,制作派生作品。 你不可将当前作品用于商业目的。

5.x版本已经发布! 该版本共有6个后台工作线程:爬取用户基本信息的机器人、爬取用户关系的机器人、爬取用户标签的机器人、爬取微博内容的机器人、爬取微博评论的机器人,以及调节请求频率的机器人。更高的性能!最大限度挖掘爬虫潜力! 以现在测试的结果看,已经能够满足自用。

本程序的特点:

  1. 6个后台工作线程,最大限度挖掘爬虫性能潜力!
  2. 界面上提供参数设置,灵活方便
  3. 抛弃app.config配置文件,自己实现配置信息的加密存储,保护数据库帐号信息
  4. 自动调整请求频率,防止超限,也避免过慢,降低效率
  5. 任意对爬虫控制,可随时暂停、继续、停止爬虫
  6. 良好的用户体验

授权协议: GPLv3

开发语言: C# .NET

操作系统: Windows

25、spidernet

spidernet是一个以递归树为模型的多线程web爬虫程序, 支持text/html资源的获取. 可以设定爬行深度, 最大下载字节数限制, 支持gzip解码, 支持以gbk(gb2312)和utf8编码的资源; 存储于sqlite数据文件.

源码中TODO:标记描述了未完成功能, 希望提交你的代码.

  • 授权协议: MIT
  • 开发语言: C#
  • 操作系统: Windows

github源代码:https://github.com/nsnail/spidernet

特点:以递归树为模型的多线程web爬虫程序,支持以GBK (gb2312)和utf8编码的资源,使用sqlite存储数据

26、Web Crawler

mart and Simple Web Crawler是一个Web爬虫框架。集成Lucene支持。该爬虫可以从单个链接或一个链接数组开始,提供两种遍历模式:最大迭代和最大深度。可以设置 过滤器限制爬回来的链接,默认提供三个过滤器ServerFilter、BeginningPathFilter和 RegularExpressionFilter,这三个过滤器可用AND、OR和NOT联合。在解析过程或页面加载前后都可以加监听器。介绍内容来自Open-Open

  • 开发语言: Java
  • 操作系统: 跨平台
  • 授权协议: LGPL

特点:多线程,支持抓取PDF/DOC/EXCEL等文档来源

27、网络矿工

网站数据采集软件 网络矿工采集器(原soukey采摘)

Soukey采摘网站数据采集软件是一款基于.Net平台的开源软件,也是网站数据采集软件类型中唯一一款开源软件。尽管Soukey采摘开源,但并不会影响软件功能的提供,甚至要比一些商用软件的功能还要丰富。

  • 授权协议: BSD
  • 开发语言: C# .NET
  • 操作系统: Windows

特点:功能丰富,毫不逊色于商业软件

PHP爬虫

28、OpenWebSpider

OpenWebSpider是一个开源多线程Web Spider(robot:机器人,crawler:爬虫)和包含许多有趣功能的搜索引擎。

  • 授权协议: 未知
  • 开发语言: PHP
  • 操作系统: 跨平台

特点:开源多线程网络爬虫,有许多有趣的功能

29、PhpDig

PhpDig是一个采用PHP开发的Web爬虫和搜索引擎。通过对动态和静态页面进行索引建立一个词汇表。当搜索查询时,它将按一定的排序规则显示包含关 键字的搜索结果页面。PhpDig包含一个模板系统并能够索引PDF,Word,Excel,和PowerPoint文档。PHPdig适用于专业化更 强、层次更深的个性化搜索引擎,利用它打造针对某一领域的垂直搜索引擎是最好的选择。

演示:http://www.phpdig.net/navigation.php?action=demo

  • 授权协议: GPL
  • 开发语言: PHP
  • 操作系统: 跨平台

特点:具有采集网页内容、提交表单功能

30、ThinkUp

ThinkUp 是一个可以采集推特,facebook等社交网络数据的社会媒体视角引擎。通过采集个人的社交网络账号中的数据,对其存档以及处理的交互分析工具,并将数据图形化以便更直观的查看。

ThinkUp

ThinkUp

  • 授权协议: GPL
  • 开发语言: PHP
  • 操作系统: 跨平台

github源码:https://github.com/ThinkUpLLC/ThinkUp

特点:采集推特、脸谱等社交网络数据的社会媒体视角引擎,可进行交互分析并将结果以可视化形式展现

31、微购

微购社会化购物系统是一款基于ThinkPHP框架开发的开源的购物分享系统,同时它也是一套针对站长、开源的的淘宝客网站程序,它整合了淘宝、天猫、淘宝客等300多家商品数据采集接口,为广大的淘宝客站长提供傻瓜式淘客建站服务,会HTML就会做程序模板,免费开放下载,是广大淘客站长的首选。

演示网址:http://tlx.wego360.com

授权协议: GPL

开发语言: PHP

操作系统: 跨平台

ErLang爬虫

32、Ebot

Ebot 是一个用 ErLang 语言开发的可伸缩的分布式网页爬虫,URLs 被保存在数据库中可通过 RESTful 的 HTTP 请求来查询。

  • 授权协议: GPLv3
  • 开发语言: ErLang
  • 操作系统: 跨平台

github源代码:https://github.com/matteoredaelli/ebot

项目主页: http://www.redaelli.org/matteo/blog/projects/ebot

特点:可伸缩的分布式网页爬虫

Ruby爬虫

33、Spidr

Spidr 是一个Ruby 的网页爬虫库,可以将整个网站、多个网站、某个链接完全抓取到本地。

  • 开发语言: Ruby
  • 授权协议:MIT

特点:可将一个或多个网站、某个链接完全抓取到本地

 

本文由36大数据收集整理

原文地址:http://www.36dsj.com/archives/34383

 

Your Site is slow? - Read this for performance increase!!

Hi,
you think your server is performing poorly and you want to improve the output speed of Joomla. This article should give you some things to check and solve to sometimes drastically enhance your servers performance. Lets make a list:

General Optimizations:

  • Hosting package
  • HTML code
  • Images
  • PHP accelerators

Joomla specific Optimizations

  • Used extensions
  • Database
  • Debugging your site
  • SEF/SEO


General Optimizations:
Hosting package
A lot of people out there buy a shared hosting package and expect to be able to run a site with hundreds of hits per second on that one. We have to make it clear here, that Joomla is not static HTML. You may be able to service hundreds, if not thousands of users per second with static HTML pages, but Joomla needs quite some CPU power and if you plan a site with more than about 5 hits per second, you should directly look for a dedicated server, even more hits, and you should think about a load balancing system, several servers and a MySQL cluster. There is an interesting thread on this one here. For a normal set up, look around in this forum, there are several threads with the experience of the users. Especially the experiences with different providers are very interesting.

HTML code
When you create your page, you should pay attention to two things. First, use valid code. The more bugs your code consists of, the longer a browser needs to render it. Second, use less objects. A lot of people are using to many images, flash and javascript on their site. There have been great improvements in the connection speed of a lot of users and broadband connections exceeding 1 MBit are not as uncommon as just a few years ago, but the vast majority of all users are still dial-ups with a modem or ISDN connection. If you don't want to exclude 90% of the web community, you should keep your site small. There are several services out there that can analyze your site, I will only mention the Web Page Analyzer from Websiteoptimization. For the validation of your HTML code and Javascript, I can only recommend the Web Developer Extension for Firefox. You can get it here.
These things can improve your sites speed drastically and you should really look into this!

Images
As I wrote in the previous paragraph, small is beautifull. ;) This also applies to images. Take a look at your images and see if you really need them in a resolution of this size. Wouldn't a picture with half the dpi look as good? Often people don't notice that they are using images the size of lower Manhattan because they set the size in the HTML code to something way smaller and their browser has the images in his cache, but others first have to load a huge image and then it gets reduced to the size of a stamp. This takes a lot of time for them and they don't want to wait that long. So see to a reasonable size of your images both in pixel and in bytes.
Another thing that's often slowing down the performance are missing images. If you have a reference to a non-existent file, the server will need a lot of time to notice that and answer with a 404. In some cases your server is configured that way that you are redirected to the frontpage when the server can't find the file and in that case the browser receives a complete new page each time he requests a missing file. Imagine just one page with an image that is used at a dozen places in the layout and is not present. This can slow down the delivery to up to half a minute!

PHP accelerators
If you can't afford a new server and you are on your performance limit, think about buying a PHP accelerator. There are again numerous available, like Zend or APC. Just look in this forum for further information. They can do little wonders about your performance.

Joomla specific Optimizations
Used extensions
You have installed Joomla and you have found all those neat little extensions available on extensions.joomla.org. You have installed numerous of them and think your site looks really cool, but lately you have to wait several seconds till your site pops up.
Again, this is another example of small is beautifull. Joomla needs a lot of extra time for each module or plugin it has to load. I guess you can think where this is heading: Use less extensions and try to make the best out of the few you really need.
In this regard, you should take a look at your extensions and see if any of those uses an external datasource. If your server first has to load the data from another server, process it and then send it to you, this can take up several seconds. Try to find an extension that does the same but stores the data on your server. If the other server has a failure or is getting DoSed, your server will not be affected by this, too, since it is just reading from its local data source.

Database
Great potential lies within the database. In general we recommend you to allways use the latest version of Joomla, since for example there were big improvements from 1.0.7 to 1.0.8 and there are even bigger improvements coming in 1.5. (1.0.8 uses about half as much queries as 1.0.7 and 1.5 does the same trick with about a quarter of the amount of queries from 1.0.7)
If this is not enough for you, you can also enable query caching in MySQL. This will cache a lot of queries in the RAM of your server and make it fast as a bullet, but be carefull, you really need A LOT of RAM. 2 GByte is not uncommon and more like the lower starting point for this. This is nothing for shared hosting!

Debugging your site
I also recommend switching on error reporting in PHP and setting it to its highest setting. You will probably have to correct some errors in the code to get them all away, but they, too, are wasting precious CPU time and therefore performance.
If you switch on debugging in the Joomla backend, you get a list of all queries executed for this site in the frontend. If you see a number thats to high, you should hunt those down and see what can be done about reducing that number.

SEF/SEO
Last but not least we have SEF/SEO. This takes a lot of CPU power and if you have problems with your server, you should switch that off as the first emergency measure. Due to some errors in erlier Joomla versions, SEF sometimes produced dozens or even hundreds of useless queries.

With these things you should have your server going a lot faster than previously. If you still have problems, please post into this board and we will try to give you specific help on this one.

sql connection limit reached: 150-可以可尝试禁用SEF

i am running a small community site: http://www.schlachtumeuropa.de

when many ppl login to the same time the mysql connection limit, which is set to 150, is every time reached and the site goes "offline". are there any hints how i could set up the joomla 1.0.3 page so that this will not happen anymore ?

i have sitestats disabled, sitecache enabled

mysql> status
--------------
mysql  Ver 14.7 Distrib 4.1.11, for pc-linux-gnu (i386)
Server version:        4.1.11-Debian_4-log
Last edited by And_One on Fri Oct 14, 2005 6:52 pm, edited 1 time in total.
Top
 
User avatar
DeanMarshall
Joomla! Hero
Joomla! Hero
Posts: 2352
Joined: Fri Aug 19, 2005 2:26 am
Location: Lancaster, Lancashire, United Kingdom
Contact:

Re: sql connection limit reached: 150

Post by DeanMarshall » Fri Oct 14, 2005 7:05 pm

I am going to bet that you have SEF enabled.

If so you may want to check out this thread:
http://forum.joomla.org/index.php/topic,11139.0.html

Dean Marshall
Dean Marshall Consultancy - six Joomla experts - http://www.deanmarshall.co.uk/

Joomla Experts - Joomla Support http://www.deanmarshall.co.uk/joomla-se ... pport.html
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Fri Oct 14, 2005 7:45 pm

SEF ?

i have joomla 1.03 installed and the htaccess file from the post ^ is included in this release. that did not solve my problem.
Top
User avatar
DeanMarshall
Joomla! Hero
Joomla! Hero
Posts: 2352
Joined: Fri Aug 19, 2005 2:26 am
Location: Lancaster, Lancashire, United Kingdom
Contact:

Re: sql connection limit reached: 150

Post by DeanMarshall » Fri Oct 14, 2005 8:59 pm

Perhaps it is a search engine crawler then.
Is your robots.txt file up to scratch?
Do you have access to logfiles or a hosting control panel that will show you 'latest visitors'?
If so is there any discernable pattern in the usage.

If you do use the Search Engine Friendly (SEF) Urls option and the .htaccess file then I should also add that some servers don't use the specific environment variable used in the specimen .htaccess file, I am with 1and1.co.uk linux hosting and have to ammend the first line or it has no effect.
RewriteCond %{REQUEST_URI} !\.(jpg|jpeg|gif|png|css|js|pl|txt|ico)$

Try the Web page optimiser in the other thread and look for page elements very close in size to the page HTML.

Dean.
Dean Marshall Consultancy - six Joomla experts - http://www.deanmarshall.co.uk/

Joomla Experts - Joomla Support http://www.deanmarshall.co.uk/joomla-se ... pport.html
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Fri Oct 14, 2005 11:40 pm

i have root access

SEF is off, htaccess is correct
Top
User avatar
DeanMarshall
Joomla! Hero
Joomla! Hero
Posts: 2352
Joined: Fri Aug 19, 2005 2:26 am
Location: Lancaster, Lancashire, United Kingdom
Contact:

Re: sql connection limit reached: 150

Post by DeanMarshall » Sat Oct 15, 2005 12:08 am

So what about..
DeanMarshall wrote: Perhaps it is a search engine crawler then.
Is your robots.txt file up to scratch?
Do you have access to logfiles or a hosting control panel that will show you 'latest visitors'?
If so is there any discernable pattern in the usage.
Or perhaps you have got that many users? If you do have 150 concurrent users then you need to edit MySQL's ini file to increase the number of allowable connections. If your hosting package doesn't allow this then you may need to upgrade your hosting package or move to a larger host.

Dean
Dean Marshall Consultancy - six Joomla experts - http://www.deanmarshall.co.uk/

Joomla Experts - Joomla Support http://www.deanmarshall.co.uk/joomla-se ... pport.html
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Sat Oct 15, 2005 12:30 am

well, i don´t think that there are really 150 concurrent users ... but what i saw is that each connection (process) is open ~40000 ... i think that value is a bit to long. is this the session live time ?
Top
User avatar
DeanMarshall
Joomla! Hero
Joomla! Hero
Posts: 2352
Joined: Fri Aug 19, 2005 2:26 am
Location: Lancaster, Lancashire, United Kingdom
Contact:

Re: sql connection limit reached: 150

Post by DeanMarshall » Sat Oct 15, 2005 1:04 am

Okay, Me again,

Could this be anything to do with your frames based redirect?

Does the livesite variable in your site reflect the true url rather than the schlachtumeuropa.de address?
This may be an issue. I don't know if I just happened to try while you were making changes but I got a blank screen everytime.

Your host seems to be having DNS problems. I get long delays and a domain not found when trying the 'real' address - the one in the frameset src parameter.

Dean Marshall
Dean Marshall Consultancy - six Joomla experts - http://www.deanmarshall.co.uk/

Joomla Experts - Joomla Support http://www.deanmarshall.co.uk/joomla-se ... pport.html
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Sat Oct 15, 2005 1:10 am

i know that there are for the first time dns problems ... i will talk to my hoster if they persists.

i don´t think that the "frames" are causing much troubles with high amount of connections.

is there any way of building a connection pool with mysql and joomla ? or shared connections ? i think it is not really neccessary to have one connection per user or ?
Top
User avatar
DeanMarshall
Joomla! Hero
Joomla! Hero
Posts: 2352
Joined: Fri Aug 19, 2005 2:26 am
Location: Lancaster, Lancashire, United Kingdom
Contact:

Re: sql connection limit reached: 150

Post by DeanMarshall » Sat Oct 15, 2005 1:15 am

Beyond my level of 'inexpertise' I am afraid. You need a server guy.
Dean Marshall Consultancy - six Joomla experts - http://www.deanmarshall.co.uk/

Joomla Experts - Joomla Support http://www.deanmarshall.co.uk/joomla-se ... pport.html
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Sat Oct 15, 2005 1:17 am

a question: most users are using the log in with cookie feature, so every time when they come on the page a connection(process) is generated ?
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Thu Oct 27, 2005 10:06 am

i really need this solved because under heavy load / many requests my page is after a short time "offline" ... i would be really happy if anyone has some more suggestions for me


would be postgre or sqlrelay a solution ? anyone familiar with this ?
Top
User avatar
DeanMarshall
Joomla! Hero
Joomla! Hero
Posts: 2352
Joined: Fri Aug 19, 2005 2:26 am
Location: Lancaster, Lancashire, United Kingdom
Contact:

Re: sql connection limit reached: 150

Post by DeanMarshall » Thu Oct 27, 2005 10:29 am

Your site is *very* image heavy.

Could it be that impatience on the part of users leads them to 'reload' a partially loaded page forcing a scaling of server load.
You have nearly 200KB of images:

I used this service to analyse your site.
http://www.websiteoptimization.com/services/analyze/

Estimated Page load times:
14.4K  208.82 seconds
28.8K 104.61 seconds
33.6K 89.72 seconds
56K 53.99 seconds

1  50307  CSS IMG  http://v2.sue.vs7509.vserver4free.de/te ... header.jpg
1 32049 HTML http://v2.sue.vs7509.vserver4free.de/
1 Not found CSS IMG http://v2.sue.vs7509.vserver4free.de/te ... in/top.gif
1 Not found CSS IMG http://v2.sue.vs7509.vserver4free.de/te ... /right.gif
1 Not found CSS IMG http://v2.sue.vs7509.vserver4free.de/te ... n/left.gif
1 Not found CSS IMG http://v2.sue.vs7509.vserver4free.de/te ... bottom.gif
1 77744 IMG http://v2.sue.vs7509.vserver4free.de/im ... 0kopie.jpg
2 37153 SCRIPT http://v2.sue.vs7509.vserver4free.de/in ... ib_mini.js
1 29471 IMG http://v2.sue.vs7509.vserver4free.de/im ... rs/sue.jpg
1 16769 CSS http://v2.sue.vs7509.vserver4free.de/te ... te_css.css
2 13966 CSS http://v2.sue.vs7509.vserver4free.de/co ... ations.css

I still think it could be badly behaving robots or possible a badly coded module.

A CMS like Joomla makes multiple queries per page load. If each query was a concurrent connection then it wouldn't take too much to hit 150.
>> mysql connection limit, which is set to 150

Dean.
Dean Marshall Consultancy - six Joomla experts - http://www.deanmarshall.co.uk/

Joomla Experts - Joomla Support http://www.deanmarshall.co.uk/joomla-se ... pport.html
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Sat Oct 29, 2005 2:28 pm

thx for your tips. i had a closer look into css and the used modules. the page is now alot faster with a few users but still stuck with many users. my last idea is to use the unix odbc driver with connection pooling, any experience how to configure joomla to use this odbc thing ?

http://www.unixodbc.org/
Last edited by And_One on Sat Oct 29, 2005 3:52 pm, edited 1 time in total.
Top
User avatar
DeanMarshall
Joomla! Hero
Joomla! Hero
Posts: 2352
Joined: Fri Aug 19, 2005 2:26 am
Location: Lancaster, Lancashire, United Kingdom
Contact:

Re: sql connection limit reached: 150

Post by DeanMarshall » Sat Oct 29, 2005 4:28 pm

Hi,

1  77744  IMG  http://v2.sue.vs7509.vserver4free.de/im ... 0kopie.jpg
1 50307 CSS IMG http://v2.sue.vs7509.vserver4free.de/te ... header.jpg
1 31666 CSS http://v2.sue.vs7509.vserver4free.de/co ... ations.css
1 29471 IMG http://v2.sue.vs7509.vserver4free.de/im ... rs/sue.jpg
1 28904 HTML http://v2.sue.vs7509.vserver4free.de/
1 Not found CSS IMG http://localhost/ak/images/camping.jpg
2 Not found CSS IMG http://localhost/ak/images/bg-left.gif
2 37153 SCRIPT http://v2.sue.vs7509.vserver4free.de/in ... ib_mini.js

Page load times:
14.4K  220.87 seconds
28.8K 110.64 seconds
33.6K 94.89 seconds
56K 57.09 seconds

I don't really see much difference in your total page load - it is not the .css file itself, more the images it references, they are *very* large.
In particular the first two listed above could do with optimising. And the two that reference localhost??? what is that about?

On the issue of caching database queries - I have zero knowledge or experience of such things, but I don't think this is where your problems lie. You appear to be using a free hosting provider - perhaps this is an issue?  I don't know how many users your site has, or whether search engine bots or similar could be an issue.  Also have you tried turning on the caching option within Joomla?

I mean no disrespect but seriously I wouldn't like to be visiting your site on dial-up - 1 minute for the front page to load!

Here is my advice for what it is worth:
1. Optimise your images - get their file sizes down.

2. To optimise your .css files somewhat you could try this:

place the PHP snippet above into the top of your CSS document. Then, rename your CSS file with a 'php' extension, and then refer to that file in the section of your template when linking your css file, for example:



3. You might also turn on gzip compression in your site's global configuration screen if the option is available to you.
4. Turn on the caching options in Joomla.

Good luck.

Dean.
Dean Marshall Consultancy - six Joomla experts - http://www.deanmarshall.co.uk/

Joomla Experts - Joomla Support http://www.deanmarshall.co.uk/joomla-se ... pport.html
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Sat Oct 29, 2005 5:57 pm

i can do all the changes but this will not solve my main problem: sql connection limit reached 150 !!

it is my own linux root server (vserver) and i have full control over it. when alot of users connect to the site it will reach the connection limit in a short period of time. during the day the page loads "normal" but when we have an event the page reaches the connection limit to the mysql server. the only thing i can see with the myphpadmin is that every logged in user has its own process ... so the limit of 150 will be reached very soon if more users log into the page. my main question how can i build a connection pool with ~20-50 connections ?

odbc or sqlrelay would be an option or ?
Top
User avatar
DeanMarshall
Joomla! Hero
Joomla! Hero
Posts: 2352
Joined: Fri Aug 19, 2005 2:26 am
Location: Lancaster, Lancashire, United Kingdom
Contact:

Re: sql connection limit reached: 150

Post by DeanMarshall » Sat Oct 29, 2005 6:13 pm

In that case then you don't need any of this caching stuff - just remove the limits from the database user.

Find out which user your site is running as - you entered this when you set up Joomla.
Load up PHPMyAdmin and navigate to the users table

Server: localhost  Database: mysql  Table: user 

Click 'browser' in the top bar of the right hand frame.
Find your previously identified 'user' in the list - it should have max connections 150 in the rightermost column.
Put a tick / check mark in the left most column and click the edit icon.
Find the 'max_connection' row and put a zero in the value column - this will remove any limit.
Make sure the save radio button is selected and click go.
You may need to restart your MySQL server
For all I know you may need to restart Apache ??

That should - on a good day with a fair wind - solve your main issue.
Then you can look at the rest later.

Dean.
Dean Marshall Consultancy - six Joomla experts - http://www.deanmarshall.co.uk/

Joomla Experts - Joomla Support http://www.deanmarshall.co.uk/joomla-se ... pport.html
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Sat Oct 29, 2005 7:17 pm

thanx for that tip, but all my users have it set to 0.

my.cnf

set-variable=max_connections=200

^^this limit is the problem

but when i set it to e.g 2000 than the server is a bit overloaded and after 300 the loading times increase rapidly
Last edited by And_One on Sat Oct 29, 2005 8:07 pm, edited 1 time in total.
Top
User avatar
kenmcd
Joomla! Champion
Joomla! Champion
Posts: 5672
Joined: Thu Aug 18, 2005 2:09 am
Location: California
Contact:

Re: sql connection limit reached: 150

Post by kenmcd » Sat Oct 29, 2005 11:44 pm

I have tuned a number of MySQL servers for high volume ad servers.
(phpAdsNew and Max Media Manager)

There are a few settings which may eliminate the problem rather quickly.
The MySQL defaults are really bad for any volume.

Best way to see what is happening is to use the MySQL Administrator to watch the server for awhile, and then tweak the settings.

If you are interested - PM me a username and password.
If you are concerned about the security of web access, the best thing to do is to setup a new MySQL user and restrict access to only my IP address (which I will PM).

I can login and watch what is happening and then make the changes (they are only temporary until added to the config file), and then tell you what needs to be changed in the config file.

I have been wanting to tune a Joomla site.
██ LibreTraining
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Sat Oct 29, 2005 11:50 pm

i did some changes some hours ago after searching the web ....

i am not sure if this will help, i will know on wednesday afternoon when our next event is planned. you can contact me on icq: 72986499 if you wanna look at the mysql host if the problem still persists after wedneday, ok ? next scheduled event is after wednesday the next saturday, so we have 2 day per week to look on the page when the page is on heavy load. thx for your offer!


my.cnf looks now:


#
# The MySQL database server configuration file.
#
# You can copy this to one of:
# - "/etc/mysql/my.cnf" to set global options,
# - "/var/lib/mysql/my.cnf" to set server-specific options or
# - "~/.my.cnf" to set user-specific options.
#
# One can use all long options that the program supports.
# Run program with --help to get a list of available options and with
# --print-defaults to see which it would actually understand and use.
#
# For explanations see
# http://dev.mysql.com/doc/mysql/en/serve ... ables.html

# This will be passed to all mysql clients
# It has been reported that passwords should be enclosed with ticks/quotes
# escpecially if they contain "#" chars...
# Remember to edit /etc/mysql/debian.cnf when changing the socket location.
[client] port = 3306
socket = /var/run/mysqld/mysqld.sock

# Here is entries for some specific programs
# The following values assume you have at least 32M ram

# This was formally known as [safe_mysqld]. Both versions are currently parsed.
[mysqld_safe] socket = /var/run/mysqld/mysqld.sock
nice = 0
open-files-limit=8192


[mysqld] #
# * Basic Settings
#
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
language = /usr/share/mysql/english
skip-external-locking
set-variable=max_connections=500

wait_timeout = 500
connect_timeout = 10

# For compatibility to other Debian packages that still use
# libmysqlclient10 and libmysqlclient12.
old_passwords = 1
#
# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address = 127.0.0.1
#
# * Fine Tuning
#
key_buffer = 16M
max_allowed_packet = 16M
thread_stack = 128K
#
# * Query Cache Configuration
#
query_cache_limit = 1048576
query_cache_size        = 16777216
query_cache_type        = 1
#
# * Logging and Replication
#
# Both location gets rotated by the cronjob.
# Be aware that this log type is a performance killer.
#log = /var/log/mysql.log
#log = /var/log/mysql/mysql.log
#
# Error logging goes to syslog. This is a Debian improvement :)
#
# Here you can see queries with especially long duration
#log-slow-queries = /var/log/mysql/mysql-slow.log
#
# The following can be used as easy to replay backup logs or for replication.
#server-id = 1
log-bin = /var/log/mysql/mysql-bin.log
# See /etc/mysql/debian-log-rotate.conf for the number of files kept.
max_binlog_size        = 104857600
#binlog-do-db = include_database_name
#binlog-ignore-db = include_database_name
#
# * BerkeleyDB
#
# According to an MySQL employee the use of BerkeleyDB is now discouraged
# and support for it will probably cease in the next versions.
skip-bdb
#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
#
# * Security Features
#
# Read the manual, too, if you want chroot!
# chroot = /var/lib/mysql/
#
# If you want to enable SSL support (recommended) read the manual or my
# HOWTO in /usr/share/doc/mysql-server/SSL-MINI-HOWTO.txt.gz
# ssl-ca=/etc/mysql/cacert.pem
# ssl-cert=/etc/mysql/server-cert.pem
# ssl-key=/etc/mysql/server-key.pem



[mysqldump] quick
quote-names
max_allowed_packet = 16M

[mysql] #no-auto-rehash # faster start of mysql but no tab completition

[isamchk] key_buffer = 16M
Top
User avatar
kenmcd
Joomla! Champion
Joomla! Champion
Posts: 5672
Joined: Thu Aug 18, 2005 2:09 am
Location: California
Contact:

Re: sql connection limit reached: 150

Post by kenmcd » Sun Oct 30, 2005 12:46 am

Best time to watch the server is during the heavy load to see what is happening.

The MySQL config file is the bare minimum and has not been tuned.
██ LibreTraining
Top
 
And_One
Joomla! Apprentice
Joomla! Apprentice
Posts: 13
Joined: Tue Oct 04, 2005 7:56 pm

Re: sql connection limit reached: 150

Post by And_One » Sun Oct 30, 2005 12:55 am

maybe you can share a better my.cnf with us with the infos from my server?

well my root vserver is this: http://neu.star-hosting.de/c/cms/front_ ... 37&idcat=6
Top