网络爬虫学习笔记笔记

网络爬虫学习笔记笔记Web Crawler A Review 1.分类：通用爬虫、聚焦爬虫、分布式爬虫 2.网络是一个有向图，所以搜索操作可以总结为对有向图的遍历。爬虫通过web页面的图结构从一个页面到另一个页面。 3.Google具有较少的垃圾结果和公平的搜索结果两大优点，其来源于google的pagerank算法和锚点词权重。 4.爬虫技术：A，通用爬虫：从每一个网页尽可能多的找寻链接，去点速度慢占带宽。B,聚焦爬虫：爬取某一主题的文件，节省带宽。C分布式爬虫：多线程爬取。 5.现在爬虫都是同步进行的，有重载、质量和网络等问题。...

Web Crawler A Review 1.分类：通用爬虫、聚焦爬虫、分布式爬虫 2.网络是一个有向图，所以搜索操作可以总结为对有向图的遍历。爬虫通过web页面的图结构从一个页面到另一个页面。 3.Google具有较少的垃圾结果和公平的搜索结果两大优点，其来源于google的pagerank算法和锚点词权重。 4.爬虫技术：A，通用爬虫：从每一个网页尽可能多的找寻链接，去点速度慢占带宽。B,聚焦爬虫：爬取某一主题的文件，节省带宽。C分布式爬虫：多线程爬取。 5.现在爬虫都是同步进行的，有重载、质量和网络等问题。 6.Selberg, E. and Etzioni, O. On the instability of Web search engines. In Proceedings of RIAO ’00, 2000. Teevan, J., E. Adar, R. Jones, and M. A. Potts. Information reretrieval: repeat queries in Yahoo's logs. SIGIR ‘07, 151-158, 2007. 这两篇主要研究识别结果动态改变。 K. S. Kim, K. Y. Kim, K. H. Lee, T. K. Kim, and W. S. Cho “Design and Implementation of Web Crawler Based on Dynamic Web Collection Cycle”, pp. 562-566, IEEE 2012 动态web数据爬取技术包含对web变化的监视，动态获取网页。 Junghoo Cho and Hector Garcia-Molina “Parallel Crawlers”. Proceedings of the 11th international conference on World Wide Web WWW '02”, May 7–11, 2002, Honolulu, Hawaii, USA. ACM 1-58113-449-5/02/0005. 高效的并行爬虫 Alex Goh Kwang Leng, Ravi Kumar P, Ashutosh Kumar Singh and Rajendra Kumar Dash “PyBot: An Algorithm for Web Crawling”, IEEE 2011 广度优先搜素，会输出一个Excel CSV 形式的web架构，存储下来的网页与web结构用于排名， Rajashree Shettar, Dr. Shobha G, “Web Crawler On Client Machine”, Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 Vol II IMECS 2008, 19-21 March, 2008, Hong Kong 异步多线程下载模块 Eytan Adar, Jaime Teevan, Susan T. Dumais and Jonathan L. Elsas “The Web Changes Everything: Understanding the Dynamics of Web Content”, ACM 2009. 对特征化网络变化提出更细微的分析。 A. K. Sharma, J.P. Gupta and D. P. Agarwal “PARCAHYD: An Architecture of a Parallel Crawler based on Augmented Hypertext Documents”, International Journal of Advancements in Technology, pp. 270-283, October 2010. 实现三个层次的并行化，分别是document、mapper、crawl worker level。详细说明了爬虫主要模块算法细节。 Lili Yana, Zhanji Guia, Wencai Dub and Qingju Guoa “An Improved PageRank Method based on Genetic Algorithm for Web Search”, Procedia Engineering, pp. 2983-2987, Elsevier 2011 Pagerank的启发式算法 Design and Implementation of Web Crawler Based on Dynamic Web Collection Cycle 1.目前web特点：复杂的非层次结构；更短的创建和销毁周期；没有物理边界。 2.针对此特点需要设计时间短的数据收集周期。 3.本文主要内容：提出动态web数据的爬取方法，包括可以敏感检测网站的变化，动态检索目标网站的网页。根据web内容更新特征设计了一个最佳的收集周期模型。通过计算收集周期的分数动态预测web内容的收集周期。 4.Web收集周期有三个参数决定：当前收集周期、平均收集周期、先去的收集周期。所以这个周期是动态。可以减少网络负担。 5.这个最佳周期时间是本文关键。 6.本文提到一个高效爬虫需要研究三点：1）搜索临近网页的策略；2）设计并行爬虫的架构；3）网页重构。可以参考：A. K. Sharma, et al., PARCAHYD: An Architecture of a ParallelCrawler based on Augmented Hypertext Documents, InternationalJournal of Advancements in Technology, Vol 1, No 2 (Oct. 2010). 7.谷歌搜索引擎的爬虫有五个功能模块：url服务模块、爬虫模块、存储模块、索引建立、url解析 8.使用网站探测（website probing）？得知网站是否更改。 9.动态爬虫详细流程：1）读取数据库获得url跟收集周期; 2)根据收集时间爬取网页； 3）对比抓取网页跟数据库中网页异同； 4）计算网页收集周期并存储到数据库； 5）如果网页没有改变收集周期变长。实现主要考虑三点：1）怎么检查网页的改变；2）怎么增强收集的效果；3）怎么持有版权？How to keep the copyright of the web contents Crawling Ajax-driven Web 2.0 Applications 1.introduction：主要使用rbNarcissus, Watir and Ruby .解决Ajax带来挑战——Past articles ? Vulnerability Scanning Web 2.0 Client-Side Components [] ? Hacking Web 2.0 Applications with Firefox [] rbNarcissus（验证和分析Javascript代码，非执行）[5], Watir（一款基于ruby的自动化测试工具，通过代码操作浏览器）Ruby（一种面向对象程序设计的脚本语言） Watir：全称是“Web Application Testing in Ruby”，发音类似“water”。它是一种基于网页模式的自动化功能测试工具。Watir可以模拟用户访问网页、点击链接，填写表单，点击按钮。Watir可以模拟用户验证页面内容。Watir不能用于Ajax control的测试。Watir不支持Activex的测试。Watir不支持IE Dialog的支持(以前曾经支持过)。 2.一般的爬虫引擎一般是协议驱动，链接建立后爬虫发送http请求并且试图截获响应。资源解析通过链接、脚本、flashi components 和其他数据获得另外的一些资源。但是不能有效应对Ajax。这是因为所有的目标资源都是js编码的一部分并且是植入DOM中，所以就需要理解并且可以触发基于DOM的activity。 3.所以需要事件驱动的爬虫。有以下三个关键部分组成：1）js的分析和解释（Javascript analysis and interpretation with linking to Ajax）；2）DOM事件处理和调度（DOM event handling and dispatching）；3）动态DOM内容的提取（DOM event handling and dispatching） 4.事件驱动爬虫的解决方法。需要浏览器上下文来理解DOM和可能的fireevent？几个工具和插件可以使用。例如本文使用Watir。 5.一般爬虫只获得html得不到js，需要使用XHR Objecet来获得js。（XHR 注入技术是通过XMLHttpRequest来获取javascript的。但与eval不同的是，该机制是通过创建一个script的DOM元素，然后把XMLHttpRequest的响应注入script中来执行javascript的。在某些情况下使用eval可能比这样机制慢。XHR injection 通过XMLHttpRequest获取的内容必须部署在和主页相同的域中。） 6.分析步骤 <1>分析js 代码。通过XHR调用来解析js获得所有可能的函数。可以看出getQuote,loadmyarea and loadhtml 调用the XHR object。而getPrice调用getQuote <2>Automating IE with Watir.使用Watir来自动操作IE，其他的工具也可以，只要保证他们可以触发事件。 Design and Implementation of a High-Performance Distributed Web Crawler 1.提出一个健壮移植性好的分布式爬虫系统。 2.一个好的爬虫应满足俩方面：1）有一个好的爬取策略，决定下载哪一个页面2）有一个高优化的系统结构，可以在下载大量网页的同时应对崩溃等。同时面临的挑战：系统设计、I/O、网络效率、健壮性、易用性。爬虫策略： 1）爬取重要网页优先： J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 117–128, May 2000 M. Najork and J. Wiener. Breadth-first search crawling yields high-quality pages. In 10th Int. World Wide Web Conference, 2001. 2）爬取特定主题或类型： S. Chakrabarti, M. van den Berg, and B. Dom. Distributed hypertext resource discovery through examples. In Proc. of 25th Int. Conf. on Very Large Data Bases, pages375–386,September 1999. S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proc. of the 8th Int. World Wide Web Conference(WWW8), May 1999. M. Diligenti, F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused crawling using context graphs. In Proc. of 26th Int. Conf. on Very Large Data Bases, September 2000. J. Rennie and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proc. of the Int. Conf. on Machine Learning (ICML), 1999. 3）爬取更新页面 J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proc. of 26th Int. Conf. on Very Large Data Bases, pages 117–128, September 2000. J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 117–128, May 2000. 4）安排超时爬取（scheduling of crawling activity over time）? J. Talim, Z. Liu, P. Nain, and E. Coffman. Controlling robots of web search engines. In SIGMETRICS Conference, June 2001. 架构详细设计 A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web,2(4):219–229, 1999.（AltaVista使用搜索引擎的详细设计）需要研究爬虫策略： 1.爬虫策略：深度优先、广度优先、加权优先。 2.爬取更新页面：在一些简单方法中可以另外开启另一个爬虫或单纯的再次请求所有的url。一个好的爬虫需要保持一个带有有限爬行带宽（with limited crawling bandwith）最新的搜索索引。相关文献：J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 117–128, May 2000. J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proc. of 26th Int. Conf. on Very Large Data Bases, pages 117–128, September 2000. 3.聚焦爬虫：一些有专业化的搜索引擎则是爬取一些确定的页面或一些特定主题（某一特定语言、图片、mp3等）。在启发式算法方面，很多基于线性结构（S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proc. of the 8th Int. World Wide Web Conference (WWW8), May 1999；S. Chakrabarti, M. van den Berg, and B. Dom. Distributed hypertext resource discovery through examples. In Proc. of 25th Int. Conf. on Very Large Data Bases, pages 375–386, September 1999）分析和机器学习（J. Rennie and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proc. of the Int. Conf. on Machine Learning (ICML), 1999；M. Diligenti, F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused crawling using context graphs. In Proc. of 26th Int. Conf. on Very Large Data Bases, September 2000.）的一般搜索算法被提出。 4.随机的爬行抽样：

                    本文档为【网络爬虫学习笔记笔记】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

网络爬虫学习笔记笔记

你可能还喜欢