首页 文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫

文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫

举报
开通vip

文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫 13Proceedings of 2* National Conference on Challenges & Opportunities in Iofonuation Technology (C01T-2008) RIMT-IET, Mandi Gobmdgarh. March 29, 2008 1Hypertext and Semi StructuredDaia\ 2003. S. Chakrabarti, K. Puncra, M. Sub...

文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫
文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫 13Proceedings of 2* National Conference on Challenges & Opportunities in Iofonuation Technology (C01T-2008) RIMT-IET, Mandi Gobmdgarh. March 29, 2008 1Hypertext and Semi StructuredDaia\ 2003. S. Chakrabarti, K. Puncra, M. Subramanyam, “Accclcralcd 3 Jun Hirai, Snram Raghavan, Hector Garcia-.Molina, and focuscd crawling llirough online rclcvancc feedbackWWW ”,Andreas Pacpckc, WebBasc: A repository of Web pages. In 2002, pp. 148-159. Proceedings of the Ninth International World Wide Web M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles. M. Gon, Conference, pages 277-293,May 2000. Vl.DB 2000. pp. 'Tocused Crawling Using Context Graphs”, The liUcmct Archive, http: \vww.archivc.orn Martijn Kosier. 527-534. The Web Robots Pages. httninfo/Webcraw ler.com .C. Agganval, I*. Al-Garawi, P. Yu, ••Intelligent crawling on lhe 1maknroicctsrobiits.'robots. World Wide Web with arbitrary predicates**, WTVW2001. pp. 5 International World Wide Web Conference, |>agcs 79- 96-105. 90, 1994.C. Chung, C. Clarke, **Topic-orientci1 collaborative [25]. Brian Pinkerton. Finding What People Want: Experiences crawling*, CIKM2002, pp. 3^f2 16 wilh the WebCrawler. In Proceedings of the Second Brin. Sergey and Page LawTcncc. *Thc anatomy of a 1Iiiteniational World Wide Web Conference, 1994. largc-scalc hvpcrtcxlual Web scarch cngmc”. Computer 7 [26J. Mike Burner. Crawling towards lilcmity: Building an Networks and ISDN Systems, April 1998 archive of !hc World Wide Web. Web Techniques Grossan, B. “Search Engines: Whal !hey are. how they work, Magazine, 2(5), May 1997. and practical suggestions for getting the most out of them,” 1[27J. Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. February1997. 8 Efficient crawling through URL ordering.,,Chakrabarfi, Soui?cn.Mining the Web: Analysis of 19 McBryan. GENVL and WWWW: Tools lhe 2Web. Ui Proceedings of the First0 2 21 3 2 7html [24J. 1 Oliver A. for fL 1'aniing rl 230 ProcccdtTigs of 广 National Cor.fcrmcc on Challenges & Opportunities in Information Technology (COIT-20081 RJMT-IET. Mandi Gobmdgarh. March 29,2008 Discussion on Web Crawlers of Search Engine M.P.S.Bhalia*, Divya Gupta** *Netaji Subhas Institute of Technology, University of Delhi, India, **Guru Prem Sukh Memorial College of Engineering. GGSIP University, Delhi Abstntct With the precipitous expansion of the Web. due to the compel itivc nature of Ihc scarch engine business, Ihc extracting knowledge fixtm the Web is becoming gradiuilly designs of these crawlcrs have noi been publicly described iwportatu tuid popular. This is due to the Web s convenience There are Iwo notable exceptions: Ihe Google crawler and and richness of information. To find Web pages, one typically uses the rnlcmcl Archive crawler. Unfortunaloly. the descriptions of search engines fhat are based on the IVeb crawling framework. This these crawlers in the literature arc too terse lo enable paper describes the basic task performed search engine. reprodiicibilily. Overview of how the Web crawlers are related with search The original Google crawler (developed al Stani'ord) engine. consisted of five functional compouenb running in difiereni URL server process read URLs out of a file and processes. A Keywords Distributed Crawling. Focused Crawling, Web forwarded I hem to multiple crawler processes. Each crawler Crawlers process ran on a different machmc, was single-threaded, aud used asynchronous I/O lo fetch data from up to 300 Web servers I. INTRODUCTION m parallel. The crawlcrs transmitted downloaded pages to a WW on the Web is a scrvicc that resides on computers that single Store Server process, which compressed the pages and arc coiincctcd !o the Ljilcmcl and allows end users lo access dala stored them to disk. The pages were then read back from disk by that is stored on the computers using slandard inicrfacc software. an indexer process, which extracted links from II I'M L pages and The World Wide Web is the universe of nelwork-accessible saved (hem lo a difterenl disk file. A URL resolver process read information, an embodiment of human knowledge. Ihe lrnk file, derelativized the URLs contained there in, iind saved Search engine is a computer program that searches for Ihe absolute URLs lo the disk file that was read by the URL particular keywords and returns a lisl of documents in which they server. Typically, three to four crawicr machines were used, so were found, especially a commercial scrvicc lhal scans the entire system rcqmrcd between four and eight machines. documents on the Internet. A search engine finds infomialion for Research on Web crawling continues al Stanlbrd even after its database by accepting listings scnl in by authors who waiil Google has been liansformcd into a commcrcial effort. The Mexposure, or by gelling the information from their Web Stanford WebBase projec! has implemented a high- performance crawlers," "spiders, or "robots,” programs that roam the distribuleii crawler, enable of downloiuiing 50 to 100 documents Lnlemel storing links lo and mfomiatiou about each page they per second [21]. Cho and others have also developed models of visit [6J. documcnl update frequencies to inform llie download schedule of ySfeb Crawler is a program, which fctchcs information from incremental crawlers [23J. 丨he World Wide Web in an automated manner. Web crawling The Internet Archive also used multiple machines to craw! [32] is an important rcscarch issue. Crawlers arc software the Web [26, 22]. Each crawler process was assigned up components, which visit porlions of Web trees, according lo lo 64 sites to crawl, and no site was assigned to more than one ccrlain strategies, and collect retrieved objccls in local crawicr. Each single-threaded crawicr process read a lisl of seed repositories [7], URLs for its assigned sites from disk into per-sile queues, and The rest of the paper is organized as: in Section 2 we then used asynchronous LO lo fetch pages from these queues in explain the background details of Web crawlcrs. In Section 3 wc parallel. Once a page was downloaded, the crawler extracted the discuss on types of crawler, in Section 4 we will explain the links contained in it. If a link referred to Ihe site of the page it was working of Web crawler. Ill Section 5 we cover the Iwo advanced contained m, it was added to the appropriate silc queue: techniques of Web crawlers. In the See lion 6 we discuss ihc otherwise it was logged to disk. Periodically, a batch process problem of selecting more inlereslmg pages. merged Ihese logged “cross-sile” URLs into Ihe site-specific seed sets, filtering out duplicates in the process. II. II. SURVEY OF WliB CRAWLIiRS Web crawlers are almost as old as the Web itself [23]. Ihc first crawicr, Matthew Gray's Wanderer, was wTitlcn imhepring of 1993, roughly coinciding with the first release oCSA Mosaic. Several papers about Web crawling were prescnlcd al Ihc first Iwo World Wide Web conferences [29, 24, 25J. However, at Ihe lime, the Web was Ihree to four orders of magnitude smaller than it is today, so those systems did not address the scaling problemii inhercul iu a crawl of today's Web. Obviously, all of ihe popular scarch engines use crawlcrs (hat must scale up to substantial porhons of the Web. However, 227 1135Proceedings of 2Proceedings of 2** National Conference on Challenges National Conference on Challenges && Opportunities in Iofonuation Technology (C01TOpponucilies in Inionuattor. Techcology (COIT--200S) RJMT20081 RIMT--IET. MamU Gobmdgarh. IET, Mandi Gobmdgarh. March 29, 2008 March 29, 2008 3. For each link retrieval, repeal (he process. The WcbFountain crawlcr shares several of Mcrcalor's characteristics: it is distributed, continuous (the authors use the The Web crawler can be used for crawling through a whole term ''incremental*'), polile, and con- figurablc [28], Unfortunately, site on the Inter-Tntranet as of this writing, WebFouiKain is in lhe early stages of its You spccify a start-URL and the Crawlcr follows all links found in development, and data about its performance is not yet available that HTML page. This usually leads lo more links, which will be III. BASIC TYPES OF SEARCH ENGINE IMI. A site can be seen as a tree-structure, followed again, and so lhe root is the slarl-URL; all links in that rool- HTML-page arc tlircct sons of the root. Subsequent links are then sons of lhe A. Crawler Based Search Engines previous sons. Crawlcr based scarch engines creatc their listings aulormtlically. Computer programs 'spiders' build Ihcm not by human selection A single URL Server serves lists of URLs to a number of [31J. They arc not organized by subjccl categories: a compulcr crawlers. Web crawler starts by parsing a specified Web page, algorithm ranks all pages. Such kinds of search engines arc huge noting any hypertext links on thal page that point to other Web and often retrieve a lot of information — for complcx scarchcs it pages. They then parse those pages for new links, and so on, allows to scarch within the results of a previous search and recursively. WebCrawler software docsii'l actually move around lo enables you to refine search resulls. These lypes of search di?fcrcnl computers on the hilcrael. as viruses or intelligent engines coutaui full texl of the Web pages they link lo. So one can agents do. Each crawlcr keeps roughly 300 connections open at find pages by matching words in lhe pages one wants [15]. oncc. This is ncccssary to retrieve Web pages al a fast enough pacc. A crawlcr resides on a single machine. The crawlcr simply If. Human Powered Directories sends HTTP requests for documents lo other machines on the Lnlemcl, just as a Web browser docs when the user clicks on These arc built by human selection i.e. they depend on humans to links. All the crawlcr really does is lo automate the process of create listings. They are organized into subject categories and following links. subjects do classification of pages. Human powered directories Web crawling can be regarded as processing items in a queue. never contain Hill text of the Web page they link to. They arc When the crawlw visits a Web page, it extracts links to other Web smaller than most scarch engines [16J. pages. So lhe crawlcr puts these URLs at lhe end of a queue, and C. Hybrid Search Engine conlinues crawling to a URL that it removes from lhe front of the queue [1]. A hybrid search engine differs from traditional text oriented scarch engine such as Google or a directory-based scarch engine such A. Resource Constraints as Yahoo in which each program operates by comparing a set of Crawlcrs consumc resources: network bandwidth to metadata, the primary corpus being lhe nictaciata derived from a download pages, memory lo niainlain private data structures in Web crawler or taxonomic analysis of all inlcmci icxl, and a user support of llieir algorithms, CPU to evaluate and select URLs, scarch query, fn contrast, hybrid scarch engine may use these and disk storage !o store tfie lexl and links of fetched pages as two bodies of metadata in addition lo one or more sets of well as other persistent data. metadata that can, for example, include situational metadata B. Robot Protocol derived trom lhe client's network tlial would model the context The robot.txt file gives directives for excluding a portion of a awareness of lhe client. Web site lo be crawlcil. Analogously, a simple tcx! file can furnish information about the freshness and popularity of published objects. This information permits a crawlcr to optimize its strategy IV. WORKING OF A WEB CRAWLER for refreshing collected dala as well as replacing object policy. C. Meta Search Engine Web crawlers are an essential component lo search engines: A nicta-scarch engine is (he kind of search engine that does running a Web crawlcr is a challenging task. There arc tricky not have its own database of Web pages. It sends search terms jxrrformancc and reliability issues and even more importanlly, io lhe databases mainlamed by other search engines and gives there arc social issues. Crawling is the niosl fragile application users lhe result that come from all the search engines queried. sincc it involves interacting with hundreds of ihousaiKis of Web Fewer meta searchers allow you to delve into the largest, most servers and various name servers, which are all beyond (he useful search engine databases. They tend tocontrol of lhe system. Web crawling speed is governed not only by the speed of one's own Internet connection, but also by the speed of the sites that arc to be crawlcd. Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced, if many downloads are done in parallel. Despite the numerous applications for Web crawlcrs. at the corc they arc all fundamentally lhe same. Following is the process by which Web crawlers work: 1 • Download the Web page. 2. Parse through the downloaded page and retrieve all the links. 228 return results from smaller and/or free search engines and user queries. The crawling proccss pnontizes URLs according lo miscellaneous free directories, often small and highly an importance metric such as similarity (lo a driving query), commcrcial. back-link count, Page Rank or their combmations>\ariations [8J, [9J Rcccnlly Najork cl al. showed that brcadlh-firsl scarch collects V. CRAWLING TECHNIQUES high-quality pages first and suggested a variant of Page Rank A. Focused Crawling [10]. However, al Ihe moment, search strategies are unable to heir knowledge is A general purpose Web crawler gathers as many pages as exactly select the ‘*best” paths because 丨 il can from a particular set of URL's. Where as a focused crawler only partial. Due to the enormous amount of information available is designed lo only gather documents on a specific topic, thus on the Internet a total-crawling is at the moment impossible, thus, reducing the amount of network traffic and downloads. I"hc goal prune strategies must be applied. Focused crawling [11], [12] and intelligent crawling [13J, are techniques for discovering Web of Ihc foe used crawicr is to selectively out pages thal are pages relevant lo a specific topic or scl of topics [14]. relevant lo a pre-defined set of topics, topics are specified nol using keywords, but using exemplary documents. CONCLUSION Rather than collecling and indexing all accessiblc AVeb documents lo be able lo answer all possible ad-hoc queries, a In this paper we conclude thal complete web crawling I'ocusctl crawicr analyzes its crawl boundary lo find ihc links lhal coverage cannot be achieved, due lo Ihe vast size of Ihe whole arc likely lo be most relevant for the crawl, and avoids irrelevant WWW and to resource availability. Usually a kind of threshold is regions of Ihe Web. set up (number of visited URLs, level in Ihc websile tree, This leads to significant savings in hardware and network conipliancc with a topic, etc.) lo limit Ihc crawling process over a resources, and helps keep the crawl more up-to-date. The selcclcd wcbsilc. This infomialion is available in scarch engines focuscd crawicr has three main components: a classificr, which to slore/refrcsh most relevant and updated web pages, thus makes relevancc judgments on pages, crawled to decide on link improving quality of retrieved contents while reducing stale expansion, a distiller which delemunes a measure of centrality of contcnt and missing pages. crawled pages to determine visit priorities, and a crawler with dynamically reconfigurablc priority controls which is governed by REFERENCES the classificr and distiller. I hc most crucial evaluation of focuscd crawling is to [1] . Garcia-Molina, Hector. Searching Ihc Web. August measure the harvest ralio, which is rale al which relevant pages //2001 hllp:oak.cs.ucla.edu/-chopaperii/cho-loil01 .pdf are acquired aiui irrelevant pages arc effectively filtered off from [2] . Grossan, B. “Scarch Engines: What they arc, how they ihe crawl. This harvesl ratio musl be high, otherwise the focused work, and practical suggestions for getting Ihc most out of crawler would spend a lot of lime merely eliminating irrelevant them,” February 1997. pages, and it may be better lo use an ordinary crawicr instead [3] . http:/www.Webrcfcrcncc.com [17]. [4] . Baldi, Pierre. Modeling the rntemct and the Web: B. Distributed Crawling I*robabilistic Methods and Algorithms, 2003. Indexing the Web is a challenge due to its growing and [5] . Pant, Gaulam, Padmini Srinivasaii and Filippo Meiiczer: dynamic nature. As Ihc size of Ihc Crawling ihe Web, 2003. Web is growing it has become imperative to parallelize the [6J. http: dollar.biz.uiowa.edu/-pant Taperscrawling.pdf crawling process in order to finish downloading Ihe pages in a reasonable amount of time. A single crawling process even if [7]. Chakrabarli, Soumcn. Mining Ihc Web: Analysis of multithrcadmg is used will be insufficient for large scale engines Hypertext and Senu Structured Data, 2003. that need to fetch large amounts of data rapidly. When a single [8J. hltp://www.google.co.in/ centralized crawicr is used all the fetched data passes through a [9j. Marina Buzzi, Cooperative crawling Proceedings of the single physical link. Dislnbuting the crawling activity via First Latin American Web Congress (LA-WEB 2003) 0- multiple ]M-ocesscs can help build a scalable, easily configurable 7695-2058-8/03 $17.00 K' 2003 IEEE system, which is fault tolerant system. Splitting Ihe load [10] . J. Cho. H. Garcia-Moiina, L. Page, “Ellkicnl Crawling decreases hardware requirements and at Ihc same lime rIhrough LRL Ordering”. WWW7.Computer Networks increases Ihc overall download speed and reliability. Ivach task is 30(1- 7): 161-172 (1998). performed in a i'ully distributed fashion, that is, no ccntral [11] . Arasu, J. Cho, II. Garcia-Moiina, A. Paepckc, S. coordinator exists [3J. Raghavan, “Searching Ihc Web”,ACM Transactions on Internet Tcchnologv, Vol. I, Num. I, August 200i, pp. 2-43. VI. I'ROBLKM OF SELRCTING MORI- “IN ITiRf:S HNCi” [12J. M. Najork. J. Wiener, “Breadth-first crawling yields OBJECTS highquality pages”,WWW2001. pp IN-118.A scarch engine is aware of hot topics bccausc il collccts ii“第二怡息技术杂国学术挑战和机会”研讨会论文集 229 探讨搜索引擎爬虫 M.P.S.Bhatia*,Divya Gupta** *内塔告萨布哈•技术研究所,印度徳H1大学 m德里工程纪念中学,印度徳狙大学 随蔚网络难以想象的急剧扩张,从Web屮提取知识逐渐正在成为一个受欢迎的茁耍途径。这是山于网络的便利性和丰 富性信息。通常需要使川鹅于网络爬行的搜索引擎來找到我们黹要的网页。本文描述了搜尜引擎的娲本丁.作任务。概述了 搜尜•) I擎与网络爬虫之间的联系 。关键词: 爬行,集中爬行,网络爬虫 1.导言 在网络上WWW是一种服务,驻留在连接到互联网的电脑1:,并允许最终相户访问该楚用标准的接U软件的计算机中 存储的数据。万维网楚获取访问网络信息的孙宙,是人类知识的体现。 搜索引擎是一个计算机程序,它能够从网丨:搜索并扫描特定的关键字,尤其是商业服务,返冋的它们发现的资料消单。抓 取搜索引擎数据库的倍怠主要通过接收想要发表自u•作品的作家的淸单成者通过“网络爬虫”、“蜘蛛”或“机器人” m 互联网捕捉他们访w过的页面的相太链接和信息。 M络爬虫是一个能自动获取万维网的信息程序。M页检尜丨32】是一个要的研究课题。爬虫是软件组件,它访问网络中 的树结构,按照--定的策略,搜索并收集当地库中检索对象。 本文的其余部分组织如下:第.•.节中,我们解释了 Web爬虫的竹梁细节。在第3节,我们讨论爬虫的类型,在第4节 我们将介绍网络爬虫的工作原理。在第5节,我们搭建两个网络爬虫的先进技术。在第6节我们讨论如何挑选更有趣的网 页的问题。 2.调查网络爬虫 网络爬虫儿乎同阚络本身一样古老。第一个爬虫,H修格黹流浪者,写f 1993年,人约正好与首次发布的OCSA Mosaic 网络同时发布《在最初的两次万维网会议丨.发表了许多关于网络爬虫的文章。然而,在3时,网络比起现在要小上」到四 个数顼级,所以这些系统没有处现好3今网络中一次爬M屮同打的缩放问题。 显然.所冇常用的搜索引擎使用的爬M程序必须扩展到网络的实质性部分。{H足,山于搜索引擎是一项竞争性质的业 务,这些抓取的设计并没有公开描述。 有两个明显的例外:谷歌瓶带式和网络裆案腹带式。不幸的是,说明这些文献屮的爬虫程序是太简洁以至于能够进行 重复。 原芥歌爬虫:在斯圯福大学幵发的:组件包括五个功能不同的运行流程。服务器进程读取一个文件的URL出来然后 通过履带式转发到多个进程。每个履带进程运行在不同的机器,是单线程的,使川异步I/O釆用并行的模式从敁多300个 网站來抓取数据。爬虫传输下载下载的页面到一个能进行网KfR缩和存储的存储服务器进程。然后这呰网页山一个索引程 序进行解读.从HTML页面中提取链接并将他们保存到>[、同的磁盘文件中。一个URL解析器进程读取链接文件,并将相 对的网址进行存储,并保存了完整的URL到磁盘文件然后就可以进行读取了。逝常悄况下,W为::辛:叫个爬虫程序被使用, 所以盤个系统需要四到八个完整的系统。 在答歌将网络爬虫转变成为一个商业成果之后在斯坦福大学仍然在进行这方面的研究。斯坦福WebBase项11已实施一 个高性能的分布式爬虫,具苻毎秒可以下载50至100[21]件文件的能力。赵等人乂发展了文件更新频率的模型以报吿爬行 下载 计划 项目进度计划表范例计划下载计划下载计划下载课程教学计划下载 的增量。 互联网裆案馆还利用多台机器来检尜网页•每个爬虫程序被分配到64个站点进行检索,并没有网站被分配到一个以上 的爬虫。•每个单线程爬1U程序读取其指定网站网址列表的种了•从磁盘到每个站点的队列,然后用异步I/O来从这些队列同 时抓取网页 一M.—个页面下载完平,爬虫提取包含在真中的链接。如果一个链接提到它波包含在网页中的网站,它被《 添加 到适尚的站点排队;杏则被记录到磁盘中。每隔一段时间合并成一个批处理程序的W体地点的种子设置这搜记读“跨, 网站 “的网址.过滤掉进程中的取复项。WebFountain爬虫程序分莩了禺卡托结构的儿个特点:它是分布式的,连续(作者使用 术语“增萤“),有礼貌,可配置的。不幸的足,M这篇文章,WebFountain逛在其发展的?期阶段,并尚未公布其性能数 据。 3.搜索引擎基本类型 A. 蕋于爬虫的搜索引擎 骓于爬虫的搜索引擎自动创建自匕的淸单。计算机程序“’蜘蛛”建立他们没有通过人的选样[31]。他们不逛通过学术 分类进行组织,而是通过计算机算法把所打的网页排列出来。这种类型的搜索引擎杵往足巨大的,常常能取得了大笊的信 息-它允许鉍杂的搜索范阑内搜索以前的搜索的结果,使你能够改进搜索结果。这种类沏的搜索引擎包含了网K中所钉的链 接。所以人们可以通过卩L:配的单词找到他们想要的网苋。 B. 人力页面目录 这是通过人类选择建造的,即他们依赖人类创建列表。他们以主题类别和科1?丨做阚页的分类。人力驱动的同朵,永远不 会 包含他们网贞所钉链接的。他们婼小于大多数搜索引擎。 C. 混合搜索引擎 一种混合搜索引擎以传统的文字为导问,如谷歌搜索引擎,如雅虎网录为猫础的搜索引擎,?中每个 方案 气瓶 现场处置方案 .pdf气瓶 现场处置方案 .doc见习基地管理方案.doc关于群访事件的化解方案建筑工地扬尘治理专项方案下载 比较操作的 元数据集不同,当苒元数据的主要资料來自一个网络爬虫或分类分析所苻互联网文字和用广的搜尜作询。U此相反,混合 搜索引擎可能苻一个或多个元数据集,例如,包拈來自客户端的网络元数据,将所得的悄境模型中的客户端上下文元数据 的来认识这两个机构。 4.爬虫的工作原理 网络爬虫是搜索引擎必不可少的组成部分:运行一个网络爬iU是一个极丨4挑战性的任•务。有技术和可靠性问题,更琨 要 【的是打社会问题。爬虫楚敁脆弱的成?程序,因为它涉及到交互的儿lf儿丁•个Web服务器和各种域名服务器,这些都人 人超出了系统的控制。网页检索速度不仅山一个人的自己的互联网连接速度,W时也受到了要抓取的网站的速度。特别是 如* •个楚从多个服务器抓取的网站,总爬行时丨’?吋以大大减少.如果许多下载是并行完成。ill然行众多的网络爬虫应用 程序,他们在核心内容上摇本丨-.楚相同的。以下足应用程序网络爬虫的工作过程: (1) 、下载网页。 通过下载的?面解析和检索所存的联系。 (2) 、 (3) 、对于每一个环节检索,重复这个过程。 网络爬虫可川于通过对完整的网站的局域网进行抓取。 您可以指定一个沿动程序爬虫跟随在HTML页屮找到的所钉链接。这通常导致更多的链接,这之后将再次跟随,等等。 -个网站可以被视为一个树状结构釕•根本是;「I动裎序,在这根的HTML贝的所苻链接楚根丫•链接。随后循环获得史多 的链接。 一个网页服务器提供若T•网址淸单给爬虫。网络平爬虫幵始通过解析一个指定的网页,标注该网页中指向其他网站页 面的超文本链接。然后他们分析这些网页之间新的联系.等等循环。网络爬虫软件_个实际移动到各地不閜的互联网上的电 脑,而是像电脑病海一样通过智能代理进行。每个爬虫毎次大概打幵大约300个链接。这是检索网页必须的足够快的速度。 一个爬虫驻留在一台机器。爬虫只是简中.的将的HTTP请求的文件发送到互联网上的其他机器,就像一个网上浏览器的链 接,当用户点击。所夼的爬虫事实上是自动化追寻链接的过程。网页检索可被视为一个队列处理的项0 o当检索器访问一 个网页,它提取到其他网页的链接。因此,爬也晋身于这狴网址的一个队列的末尾,并继续爬行到下一个网页,然后它从 队列的前面删除。 A. 资源约束 爬行消耗资源:下载阚贝的带宽,支持私人数据结构存储的内存.來评价和选折网址的CPU,以及存储文本和链接以 及K他持久性数据的磁盘存储。 B. 机器人 协议 离婚协议模板下载合伙人协议 下载渠道分销协议免费下载敬业协议下载授课协议下载 机器人文件给出排除一部分的网站被抓取的指令。类似地,一个简单的文本文件可 以提供对关的新鲜和出版对象的流行信息,此信息允许抓取工具优化其收集的数据刷新策略以及更换对象的政策。 C. 元搜索引擎 一个元搜索引擎茫-•种没奵它自d的的网页数据库的搜索引擎。它发出的搜索字同他搜索引擎所衍的数据库,从所 饤的搜索引擎来杏询并为用广提供的结果。较少的元搜尜"f以让您深入到煅人.敁行丨Ti的搜索引擎数据库。他们往往返M 小成免费的搜索引擎和其他免费丨丨朵并.H.通常足小和高度商业化的结*。 5.爬行技术 A:主题爬行 一个通川的网络爬虫根据一个URL的特点设置來收集网页。凡为生题爬虫的设计只收集苻一个特定的主题的文件,从 而减少了网络流S和下载罱。主题爬虫的丨I标楚介选择地导找相关的网贝‘的主题进行预先矩义的设置。指记的主题不使用 关键字,但使用示范文件。 不趟所冇的收集和索引访问的Web文件能够问答所有可能的特殊作询,一个主题爬虫爬虫分析K抓取边界,找到链接, 很可能珐磁适合抓取相关,并避免小相关的区域的Web。 这导致在硬件和网络资源极人地节符,并夼助于保持抓取史多保持在敁新状态的数据。主题爬虫打..个主要组成部分: 一个分类器,这能够判断相关的网页,决定抓取链接的拓展.蒸馏器决矩了蒸馏器抓取的网页,以确记优先访问中心次序 的措施,以及均受蛩词和蒸馏器动态:獻新配迓的优先的控制的爬iiu 域关键的评价是衡f;主题爬行收获的比例,这是在抓取过程中行多少比例相关网页被采用和不相•丨的网页是对效地过 滤掉。这收获率要高,丙则主题爬虫会花很多时间在消除不相关的网页,而Ji.使丨丨]一个普通的爬虫可能会史好。 B:分布式检索 索引网络是一个挑战,因为它的成长性和动态性。随翁网络规模越来越人,已成为必须并行处现检索程序,以完成在 合理的时间内下载网页。一个单•的检索程序,即使是使用多线程在大锻引擎需要获取人量数据的快速h也存在不足。肖 •个爬虫通过-•个单一的物拙链接被所丫/被提取的数据所使用。通过分配多种抓取活动的进程可以报助建立•个叫扩展的, 场 于配a的系统,它逛具有容错性的系统。拆分负载降低硬件要求,并在同•时m增加整体下载速度和可靠性。每个任务 都茫在-个完全分布式的方式,也就足说,没行屮央协调器的存在。 6、挑选更多“有趣”对象的问题 搜索引擎被认为是一个热门话题,因为它收集用户赍询记录。捡索程序优先抓取网站根据一坚重要的度量,例如相似 性(对苻导引的制旬),返回链接数,W页排名或者其组合/变化域近Najork等。表明,首先考虑广泛优先搜索收集高品质 贝面,并提出了一种网页排名,然而,目前,搜索策略是无法准确选杼“敁作“路径,因为他们的认识仅仅趕局部的-山 于在互联网上可得到的信息数帒非常庞人丨|前不可能实现全部全而的进行检尜,W此,必须釆用剪裁策略。主题爬行和智 能检索,是发现相关的特定主题或主题集网页技术。 结论 在本文屮,我们得出这样的结论实现完整的网络爬行搜盖是+n丨能实现,W为受限于整个万维网的巨大规模和资源的 可用性。通常是通过一种阈值的设罝(网站访问人数,咧站上树的水平,与主题等规定).以限制对选定的网站上进行抓取 的过程。此信息是在搜索引擎可用于存储/刷新最相关和最新更新的网页.从而提高检索的内容质链,同时减少陈旧的内容 和缺页。
本文档为【文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_882336
暂无简介~
格式:doc
大小:58KB
软件:Word
页数:19
分类:互联网
上传时间:2017-10-12
浏览量:194