首页 Module 3 - Nutch

Module 3 - Nutch

举报
开通vip

Module 3 - NutchnullGoogle Cluster Computing Faculty Training Workshop Google Cluster Computing Faculty Training Workshop Module III: Nutch This presentation © Michael Cafarella Redistributed under the Creative Commons Attribution 3.0 license.Meta-detailsMeta-detailsBuilt to...

Module 3 - Nutch
nullGoogle Cluster Computing Faculty Training Workshop Google Cluster Computing Faculty Training Workshop Module III: Nutch This presentation © Michael Cafarella Redistributed under the Creative Commons Attribution 3.0 license.Meta-detailsMeta-detailsBuilt to encourage public search work Open-source, w/pluggable modules Cheap to run, both machines & admins Goal: Search more pages, with better quality, than any other engine Pretty good ranking Has done ~ 200M pages, more possible Hadoop is a spinoffOutlineOutlineNutch design Link database, fetcher, indexer, etc… Hadoop support Distributed filesystem, job controlnullWebDBMoving PartsMoving PartsAcquisition cycle WebDB Fetcher Index generation Indexing Link analysis (maybe) Serving resultsWebDBWebDBContains info on all pages, links URL, last download, # failures, link score, content hash, ref counting Source hash, target URL Must always be consistent Designed to minimize disk seeks 19ms seek time x 200m new pages/mo = ~44 days of disk seeks! Single-disk WebDB was huge headacheFetcherFetcherFetcher is very stupid. Not a “crawler” Pre-MapRed: divide “to-fetch list” into k pieces, one for each fetcher machine URLs for one domain go to same list, otherwise random “Politeness” w/o inter-fetcher protocols Can observe robots.txt similarly Better DNS, robots caching Easy parallelism Two outputs: pages, WebDB edits WebDB/Fetcher UpdatesWebDB/Fetcher Updates2. Sort edits (externally, if necessary)WebDBFetcher edits1. Write down fetcher edits3. Read streams in parallel, emitting new database4. Repeat for other tablesIndexingIndexingIterate through all k page sets in parallel, constructing inverted index Creates a “searchable document” of: URL text Content text Incoming anchor text Other content types might have a different document fields Eg, email has sender/receiver Any searchable field end-user will want Uses Lucene text indexerLink analysisLink analysisA page’s relevance depends on both intrinsic and extrinsic factors Intrinsic: page title, URL, text Extrinsic: anchor text, link graph PageRank is most famous of many Others include: HITS OPIC Simple incoming link count Link analysis is sexy, but importance generally overstatedLink analysis (2)Link analysis (2)Nutch performs analysis in WebDB Emit a score for each known page At index time, incorporate score into inverted index Extremely time-consuming In our case, disk-consuming, too (because we want to use low-memory machines) Fast and easy: 0.5 * log(# incoming links)Query Processing“britney”Query ProcessingDocs 0-1MDocs 1-2MDocs 2-3MDocs 3-4MDocs 4-5M“britney”“britney”“britney”“britney”“britney”Ds 1, 29Ds 1.2M, 1.7MDs 2.3M, 2.9MDs 3.1M, 3.2MDs 4.4M, 4.5M1.2M, 4.4M, 29, …Administering NutchAdministering NutchAdmin costs are critical It’s a hassle when you have 25 machines Google has >100k, probably more Files WebDB content, working files Fetchlists, fetched pages Link analysis outputs, working files Inverted indices Jobs Emit fetchlists, fetch, update WebDB Run link analysis Build inverted indicesAdministering Nutch (2)Administering Nutch (2)Admin sounds boring, but it’s not! Really I swear Large-file maintenance Google File System (Ghemawat, Gobioff, Leung) Nutch Distributed File System Job Control Map/Reduce (Dean and Ghemawat) Pig (Yahoo Research) Data Storage (BigTable)Nutch Distributed File SystemNutch Distributed File SystemSimilar, but not identical, to GFS Requirements are fairly strange Extremely large files Most files read once, from start to end Low admin costs per GB Equally strange design Write-once, with delete Single file can exist across many machines Wholly automatic failure recoveryNDFS (2)NDFS (2)Data divided into blocks Blocks can be copied, replicated Datanodes hold and serve blocks Namenode holds metainfo Filename  block list Block  datanode-location Datanodes report in to namenode every few secondsNDFS File ReadNDFS File ReadNamenodeDatanode 0Datanode 1Datanode 2Datanode 3Datanode 4Datanode 5Client asks datanode for filename info Namenode responds with blocklist, and location(s) for each block Client fetches each block, in sequence, from a datanode“crawl.txt”(block-33 / datanodes 1, 4) (block-95 / datanodes 0, 2) (block-65 / datanodes 1, 4, 5)NDFS ReplicationNDFS ReplicationNamenodeDatanode 0 (33, 95)Datanode 1 (46, 95)Datanode 2 (33, 104)Datanode 3 (21, 33, 46)Datanode 4 (90)Datanode 5 (21, 90, 104)Always keep at least k copies of each blk Imagine datanode 4 dies; blk 90 lost Namenode loses heartbeat, decrements blk 90’s reference count. Asks datanode 5 to replicate blk 90 to datanode 0 Choosing replication target is tricky (Blk 90 to dn 0)Map/ReduceMap/ReduceMap/Reduce is programming model from Lisp (and other places) Easy to distribute across nodes Nice retry/failure semantics map(key, val) is run on each item in set emits key/val pairs reduce(key, vals) is run for each unique key emitted by map() emits final output Many problems can be phrased this wayMap/Reduce (2)Map/Reduce (2)Task: count words in docs Input consists of (url, contents) pairs map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)”Map/Reduce (3)Map/Reduce (3)Task: grep Input consists of (url+offset, single line) map(key=url+offset, val=line): If contents matches regexp, emit (line, “1”) reduce(key=line, values=uniq_counts): Don’t do anything; just emit line We can also do graph inversion, link analysis, WebDB updates, etcMap/Reduce (4)Map/Reduce (4)How is this distributed? Partition input key/value pairs into chunks, run map() tasks in parallel After all map()s are complete, consolidate all emitted values for each unique emitted key Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute!Map/Reduce Job ProcessingMap/Reduce Job ProcessingJobTrackerTaskTracker 0TaskTracker 1TaskTracker 2TaskTracker 3TaskTracker 4TaskTracker 5Client submits “grep” job, indicating code and input files JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers. After map(), tasktrackers exchange map-output to build reduce() keyspace JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. reduce() output may go to NDFS“grep”Nutch & HadoopNutch & HadoopNDFS stores the crawl and indexes MapReduce for indexing, parsing, WebDB construction, even fetching Broke previous 200M/mo limit Index-serving? Required massive rewrite of almost every Nutch componentConclusionConclusionhttp://www.nutch.org/ Partial documentation Source code Developer discussion board “Lucene in Action” by Hatcher, Gospodnetic (or you can borrow mine) Read the Google papers on GFS, MapReduce, and BigTable; we relied heavily on these papers. Questions?
本文档为【Module 3 - Nutch】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_039415
暂无简介~
格式:ppt
大小:269KB
软件:PowerPoint
页数:0
分类:互联网
上传时间:2010-01-27
浏览量:20