nullGoogle Cluster Computing Faculty Training Workshop Google Cluster Computing Faculty Training Workshop Module III: Nutch
This presentation © Michael Cafarella
Redistributed under the Creative Commons Attribution 3.0 license.Meta-detailsMeta-detailsBuilt to encourage public search work
Open-source, w/pluggable modules
Cheap to run, both machines & admins
Goal: Search more pages, with better quality, than any other engine
Pretty good ranking
Has done ~ 200M pages, more possible
Hadoop is a spinoffOutlineOutlineNutch design
Link database, fetcher, indexer, etc…
Hadoop support
Distributed filesystem, job controlnullWebDBMoving PartsMoving PartsAcquisition cycle
WebDB
Fetcher
Index generation
Indexing
Link analysis (maybe)
Serving resultsWebDBWebDBContains info on all pages, links
URL, last download, # failures, link score, content hash, ref counting
Source hash, target URL
Must always be consistent
Designed to minimize disk seeks
19ms seek time x 200m new pages/mo
= ~44 days of disk seeks!
Single-disk WebDB was huge headacheFetcherFetcherFetcher is very stupid. Not a “crawler”
Pre-MapRed: divide “to-fetch list” into k pieces, one for each fetcher machine
URLs for one domain go to same list, otherwise random
“Politeness” w/o inter-fetcher protocols
Can observe robots.txt similarly
Better DNS, robots caching
Easy parallelism
Two outputs: pages, WebDB edits WebDB/Fetcher UpdatesWebDB/Fetcher Updates2. Sort edits (externally, if necessary)WebDBFetcher edits1. Write down fetcher edits3. Read streams in parallel, emitting new database4. Repeat for other tablesIndexingIndexingIterate through all k page sets in parallel, constructing inverted index
Creates a “searchable document” of:
URL text
Content text
Incoming anchor text
Other content types might have a different document fields
Eg, email has sender/receiver
Any searchable field end-user will want
Uses Lucene text indexerLink analysisLink analysisA page’s relevance depends on both intrinsic and extrinsic factors
Intrinsic: page title, URL, text
Extrinsic: anchor text, link graph
PageRank is most famous of many
Others include:
HITS
OPIC
Simple incoming link count
Link analysis is sexy, but importance generally overstatedLink analysis (2)Link analysis (2)Nutch performs analysis in WebDB
Emit a score for each known page
At index time, incorporate score into inverted index
Extremely time-consuming
In our case, disk-consuming, too (because we want to use low-memory machines)
Fast and easy:
0.5 * log(# incoming links)Query Processing“britney”Query ProcessingDocs 0-1MDocs 1-2MDocs 2-3MDocs 3-4MDocs 4-5M“britney”“britney”“britney”“britney”“britney”Ds 1, 29Ds 1.2M, 1.7MDs 2.3M, 2.9MDs 3.1M, 3.2MDs 4.4M, 4.5M1.2M, 4.4M, 29, …Administering NutchAdministering NutchAdmin costs are critical
It’s a hassle when you have 25 machines
Google has >100k, probably more
Files
WebDB content, working files
Fetchlists, fetched pages
Link analysis outputs, working files
Inverted indices
Jobs
Emit fetchlists, fetch, update WebDB
Run link analysis
Build inverted indicesAdministering Nutch (2)Administering Nutch (2)Admin sounds boring, but it’s not!
Really
I swear
Large-file maintenance
Google File System (Ghemawat, Gobioff, Leung)
Nutch Distributed File System
Job Control
Map/Reduce (Dean and Ghemawat)
Pig (Yahoo Research)
Data Storage (BigTable)Nutch Distributed File SystemNutch Distributed File SystemSimilar, but not identical, to GFS
Requirements are fairly strange
Extremely large files
Most files read once, from start to end
Low admin costs per GB
Equally strange design
Write-once, with delete
Single file can exist across many machines
Wholly automatic failure recoveryNDFS (2)NDFS (2)Data divided into blocks
Blocks can be copied, replicated
Datanodes hold and serve blocks
Namenode holds metainfo
Filename block list
Block datanode-location
Datanodes report in to namenode every few secondsNDFS File ReadNDFS File ReadNamenodeDatanode 0Datanode 1Datanode 2Datanode 3Datanode 4Datanode 5Client asks datanode for filename info
Namenode responds with blocklist, and location(s) for each block
Client fetches each block, in sequence, from a datanode“crawl.txt”(block-33 / datanodes 1, 4)
(block-95 / datanodes 0, 2)
(block-65 / datanodes 1, 4, 5)NDFS ReplicationNDFS ReplicationNamenodeDatanode 0
(33, 95)Datanode 1
(46, 95)Datanode 2
(33, 104)Datanode 3
(21, 33, 46)Datanode 4
(90)Datanode 5
(21, 90, 104)Always keep at least k copies of each blk
Imagine datanode 4 dies; blk 90 lost
Namenode loses heartbeat, decrements blk 90’s reference count. Asks datanode 5 to replicate blk 90 to datanode 0
Choosing replication target is tricky (Blk 90 to dn 0)Map/ReduceMap/ReduceMap/Reduce is programming model from Lisp (and other places)
Easy to distribute across nodes
Nice retry/failure semantics
map(key, val) is run on each item in set
emits key/val pairs
reduce(key, vals) is run for each unique key emitted by map()
emits final output
Many problems can be phrased this wayMap/Reduce (2)Map/Reduce (2)Task: count words in docs
Input consists of (url, contents) pairs
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word, sum)”Map/Reduce (3)Map/Reduce (3)Task: grep
Input consists of (url+offset, single line)
map(key=url+offset, val=line):
If contents matches regexp, emit (line, “1”)
reduce(key=line, values=uniq_counts):
Don’t do anything; just emit line
We can also do graph inversion, link analysis, WebDB updates, etcMap/Reduce (4)Map/Reduce (4)How is this distributed?
Partition input key/value pairs into chunks, run map() tasks in parallel
After all map()s are complete, consolidate all emitted values for each unique emitted key
Now partition space of output map keys, and run reduce() in parallel
If map() or reduce() fails, reexecute!Map/Reduce Job ProcessingMap/Reduce Job ProcessingJobTrackerTaskTracker 0TaskTracker 1TaskTracker 2TaskTracker 3TaskTracker 4TaskTracker 5Client submits “grep” job, indicating code and input files
JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers.
After map(), tasktrackers exchange map-output to build reduce() keyspace
JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work.
reduce() output may go to NDFS“grep”Nutch & HadoopNutch & HadoopNDFS stores the crawl and indexes
MapReduce for indexing, parsing, WebDB construction, even fetching
Broke previous 200M/mo limit
Index-serving?
Required massive rewrite of almost every Nutch componentConclusionConclusionhttp://www.nutch.org/
Partial documentation
Source code
Developer discussion board
“Lucene in Action” by Hatcher, Gospodnetic (or you can borrow mine)
Read the Google papers on GFS, MapReduce, and BigTable; we relied heavily on these papers.
Questions?
本文档为【Module 3 - Nutch】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。