关闭

关闭

关闭

封号提示

内容

首页 Hadoop.The.Definitive.Guide.pdf

Hadoop.The.Definitive.Guide.pdf

Hadoop.The.Definitive.Guide.pdf

上传者: zgq110486 2010-12-18 评分 0 0 0 0 0 0 暂无简介 简介 举报

简介:本文档为《Hadoop.The.Definitive.Guidepdf》,可适用于IT/计算机领域,主题内容包含Hadoop:TheDefinitiveGuideTomWhiteforewordbyDougCuttingBeijing•Cambridge•Fa符等。

Hadoop:TheDefinitiveGuideTomWhiteforewordbyDougCuttingBeijing•Cambridge•Farnham•Köln•Sebastopol•Taipei•TokyoHadoop:TheDefinitiveGuidebyTomWhiteCopyrightTomWhiteAllrightsreservedPrintedintheUnitedStatesofAmericaPublishedbyO’ReillyMedia,Inc,GravensteinHighwayNorth,Sebastopol,CAO’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluseOnlineeditionsarealsoavailableformosttitles(http:mysafaribooksonlinecom)Formoreinformation,contactourcorporateinstitutionalsalesdepartment:()orcorporateoreillycomEditor:MikeLoukidesProductionEditor:LoranahDimantProofreader:NancyKotaryIndexer:EllenTroutmanZaigCoverDesigner:KarenMontgomeryInteriorDesigner:DavidFutatoIllustrator:RobertRomanoPrintingHistory:June:FirstEditionNutshellHandbook,theNutshellHandbooklogo,andtheO’ReillylogoareregisteredtrademarksofO’ReillyMedia,IncHadoop:TheDefinitiveGuide,theimageofanAfricanelephant,andrelatedtradedressaretrademarksofO’ReillyMedia,IncManyofthedesignationsusedbymanufacturersandsellerstodistinguishtheirproductsareclaimedastrademarksWherethosedesignationsappearinthisbook,andO’ReillyMedia,Incwasawareofatrademarkclaim,thedesignationshavebeenprintedincapsorinitialcapsWhileeveryprecautionhasbeentakeninthepreparationofthisbook,thepublisherandauthorassumenoresponsibilityforerrorsoromissions,orfordamagesresultingfromtheuseoftheinformationcontainedhereinTMThisbookusesRepKover,adurableandflexiblelayflatbindingISBN:MForEliane,Emilia,andLottieTableofContentsForewordxiiiPrefacexvMeetHadoopData!DataStorageandAnalysisComparisonwithOtherSystemsRDBMSGridComputingVolunteerComputingABriefHistoryofHadoopTheApacheHadoopProjectMapReduceAWeatherDatasetDataFormatAnalyzingtheDatawithUnixToolsAnalyzingtheDatawithHadoopMapandReduceJavaMapReduceScalingOutDataFlowCombinerFunctionsRunningaDistributedMapReduceJobHadoopStreamingRubyPythonHadoopPipesCompilingandRunningvTheHadoopDistributedFilesystemTheDesignofHDFSHDFSConceptsBlocksNamenodesandDatanodesTheCommandLineInterfaceBasicFilesystemOperationsHadoopFilesystemsInterfacesTheJavaInterfaceReadingDatafromaHadoopURLReadingDataUsingtheFileSystemAPIWritingDataDirectoriesQueryingtheFilesystemDeletingDataDataFlowAnatomyofaFileReadAnatomyofaFileWriteCoherencyModelParallelCopyingwithdistcpKeepinganHDFSClusterBalancedHadoopArchivesUsingHadoopArchivesLimitationsHadoopIODataIntegrityDataIntegrityinHDFSLocalFileSystemChecksumFileSystemCompressionCodecsCompressionandInputSplitsUsingCompressioninMapReduceSerializationTheWritableInterfaceWritableClassesImplementingaCustomWritableSerializationFrameworksFileBasedDataStructuresSequenceFileMapFilevi|TableofContentsDevelopingaMapReduceApplicationTheConfigurationAPICombiningResourcesVariableExpansionConfiguringtheDevelopmentEnvironmentManagingConfigurationGenericOptionsParser,Tool,andToolRunnerWritingaUnitTestMapperReducerRunningLocallyonTestDataRunningaJobinaLocalJobRunnerTestingtheDriverRunningonaClusterPackagingLaunchingaJobTheMapReduceWebUIRetrievingtheResultsDebuggingaJobUsingaRemoteDebuggerTuningaJobProfilingTasksMapReduceWorkflowsDecomposingaProblemintoMapReduceJobsRunningDependentJobsHowMapReduceWorksAnatomyofaMapReduceJobRunJobSubmissionJobInitializationTaskAssignmentTaskExecutionProgressandStatusUpdatesJobCompletionFailuresTaskFailureTasktrackerFailureJobtrackerFailureJobSchedulingTheFairSchedulerShuffleandSortTheMapSideTheReduceSideTableofContents|viiConfigurationTuningTaskExecutionSpeculativeExecutionTaskJVMReuseSkippingBadRecordsTheTaskExecutionEnvironmentMapReduceTypesandFormatsMapReduceTypesTheDefaultMapReduceJobInputFormatsInputSplitsandRecordsTextInputBinaryInputMultipleInputsDatabaseInput(andOutput)OutputFormatsTextOutputBinaryOutputMultipleOutputsLazyOutputDatabaseOutputMapReduceFeaturesCountersBuiltinCountersUserDefinedJavaCountersUserDefinedStreamingCountersSortingPreparationPartialSortTotalSortSecondarySortJoinsMapSideJoinsReduceSideJoinsSideDataDistributionUsingtheJobConfigurationDistributedCacheMapReduceLibraryClassesSettingUpaHadoopClusterClusterSpecificationviii|TableofContentsNetworkTopologyClusterSetupandInstallationInstallingJavaCreatingaHadoopUserInstallingHadoopTestingtheInstallationSSHConfigurationHadoopConfigurationConfigurationManagementEnvironmentSettingsImportantHadoopDaemonPropertiesHadoopDaemonAddressesandPortsOtherHadoopPropertiesPostInstallBenchmarkingaHadoopClusterHadoopBenchmarksUserJobsHadoopintheCloudHadooponAmazonECAdministeringHadoopHDFSPersistentDataStructuresSafeModeAuditLoggingToolsMonitoringLoggingMetricsJavaManagementExtensionsMaintenanceRoutineAdministrationProceduresCommissioningandDecommissioningNodesUpgradesPigInstallingandRunningPigExecutionTypesRunningPigProgramsGruntPigLatinEditorsAnExampleGeneratingExamplesTableofContents|ixComparisonwithDatabasesPigLatinStructureStatementsExpressionsTypesSchemasFunctionsUserDefinedFunctionsAFilterUDFAnEvalUDFALoadUDFDataProcessingOperatorsLoadingandStoringDataFilteringDataGroupingandJoiningDataSortingDataCombiningandSplittingDataPiginPracticeParallelismParameterSubstitutionHBaseHBasicsBackdropConceptsWhirlwindTouroftheDataModelImplementationInstallationTestDriveClientsJavaRESTandThriftExampleSchemasLoadingDataWebQueriesHBaseVersusRDBMSSuccessfulServiceHBaseUseCase:HBaseatstreamycomPraxisVersionsx|TableofContentsLoveandHate:HBaseandHDFSUIMetricsSchemaDesignZooKeeperInstallingandRunningZooKeeperAnExampleGroupMembershipinZooKeeperCreatingtheGroupJoiningaGroupListingMembersinaGroupDeletingaGroupTheZooKeeperServiceDataModelOperationsImplementationConsistencySessionsStatesBuildingApplicationswithZooKeeperAConfigurationServiceTheResilientZooKeeperApplicationALockServiceMoreDistributedDataStructuresandProtocolsZooKeeperinProductionResilienceandPerformanceConfigurationCaseStudiesHadoopUsageatLastfmLastfm:TheSocialMusicRevolutionHadoopatLastfmGeneratingChartswithHadoopTheTrackStatisticsProgramSummaryHadoopandHiveatFacebookIntroductionHadoopatFacebookHypotheticalUseCaseStudiesHiveProblemsandFutureWorkNutchSearchEngineTableofContents|xiBackgroundDataStructuresSelectedExamplesofHadoopDataProcessinginNutchSummaryLogProcessingatRackspaceRequirementsTheProblemBriefHistoryChoosingHadoopCollectionandStorageMapReduceforLogsCascadingFields,Tuples,andPipesOperationsTaps,Schemes,andFlowsCascadinginPracticeFlexibilityHadoopandCascadingatShareThisSummaryTeraByteSortonApacheHadoopAInstallingApacheHadoopBCloudera’sDistributionforHadoopCPreparingtheNCDCWeatherDataIndexxii|TableofContentsForewordHadoopgotitsstartinNutchAfewofuswereattemptingtobuildanopensourcewebsearchengineandhavingtroublemanagingcomputationsrunningonevenahandfulofcomputersOnceGooglepublisheditsGFSandMapReducepapers,theroutebecameclearThey’ddevisedsystemstosolvepreciselytheproblemswewerehavingwithNutchSowestarted,twoofus,halftime,totrytorecreatethesesystemsasapartofNutchWemanagedtogetNutchlimpingalongonmachines,butitsoonbecameclearthattohandletheWeb’smassivescale,we’dneedtorunitonthousandsofmachinesand,moreover,thatthejobwasbiggerthantwohalftimedeveloperscouldhandleAroundthattime,Yahoo!gotinterested,andquicklyputtogetherateamthatIjoinedWesplitoffthedistributedcomputingpartofNutch,namingitHadoopWiththehelpofYahoo!,HadoopsoongrewintoatechnologythatcouldtrulyscaletotheWebIn,TomWhitestartedcontributingtoHadoopIalreadyknewTomthroughanexcellentarticlehe’dwrittenaboutNutch,soIknewhecouldpresentcomplexideasinclearproseIsoonlearnedthathecouldalsodevelopsoftwarethatwasaspleasanttoreadashisproseFromthebeginning,Tom’scontributionstoHadoopshowedhisconcernforusersandfortheprojectUnlikemostopensourcecontributors,Tomisnotprimarilyinterestedintweakingthesystemtobettermeethisownneeds,butratherinmakingiteasierforanyonetouseInitially,TomspecializedinmakingHadooprunwellonAmazon’sECandSservicesThenhemovedontotackleawidevarietyofproblems,includingimprovingtheMapReduceAPIs,enhancingthewebsite,anddevisinganobjectserializationframeworkInallcases,TompresentedhisideaspreciselyInshortorder,TomearnedtheroleofHadoopcommitterandsoonthereafterbecameamemberoftheHadoopProjectManagementCommitteeTomisnowarespectedseniormemberoftheHadoopdevelopercommunityThoughhe’sanexpertinmanytechnicalcornersoftheproject,hisspecialtyismakingHadoopeasiertouseandunderstandxiiiGiventhis,IwasverypleasedwhenIlearnedthatTomintendedtowriteabookaboutHadoopWhocouldbebetterqualifiedNowyouhavetheopportunitytolearnaboutHadoopfromamasternotonlyofthetechnology,butalsoofcommonsenseandplaintalkDougCuttingShedintheYard,Californiaxiv|ForewordPrefaceMartinGardner,themathematicsandsciencewriter,oncesaidinaninterview:Beyondcalculus,IamlostThatwasthesecretofmycolumn’ssuccessIttookmesolongtounderstandwhatIwaswritingaboutthatIknewhowtowriteinawaymostreaderswouldunderstand*Inmanyways,thisishowIfeelaboutHadoopItsinnerworkingsarecomplex,restingastheydoonamixtureofdistributedsystemstheory,practicalengineering,andcommonsenseAndtotheuninitiated,HadoopcanappearalienButitdoesn’tneedtobelikethisStrippedtoitscore,thetoolsthatHadoopprovidesforbuildingdistributedsystemsfordatastorage,dataanalysis,andcoordinationaresimpleIfthere’sacommontheme,itisaboutraisingthelevelofabstractiontocreatebuildingblocksforprogrammerswhojusthappentohavelotsofdatatostore,orlotsofdatatoanalyze,orlotsofmachinestocoordinate,andwhodon’thavethetime,theskill,ortheinclinationtobecomedistributedsystemsexpertstobuildtheinfrastructuretohandleitWithsuchasimpleandgenerallyapplicablefeatureset,itseemedobvioustomewhenIstartedusingitthatHadoopdeservedtobewidelyusedHowever,atthetime(inearly),settingup,configuring,andwritingprogramstouseHadoopwasanartThingshavecertainlyimprovedsincethen:thereismoredocumentation,therearemoreexamples,andtherearethrivingmailingliststogotowhenyouhavequestionsAndyetthebiggesthurdlefornewcomersisunderstandingwhatthistechnologyiscapableof,whereitexcels,andhowtouseitThatiswhyIwrotethisbookTheApacheHadoopcommunityhascomealongwayOverthecourseofthreeyears,theHadoopprojecthasblossomedandspunoffhalfadozensubprojectsInthistime,thesoftwarehasmadegreatleapsinperformance,reliability,scalability,andmanageabilityTogainevenwideradoption,however,IbelieveweneedtomakeHadoopeveneasiertouseThiswillinvolvewritingmoretoolsintegratingwithmoresystemsand*“Thescienceoffun,”AlexBellos,TheGuardian,May,,http:wwwguardiancouksciencemaymathssciencexvwritingnew,improvedAPIsI’mlookingforwardtobeingapartofthis,andIhopethisbookwillencourageandenableotherstodoso,tooAdministrativeNotesDuringdiscussionofaparticularJavaclassinthetext,Ioftenomititspackagename,toreduceclutterIfyouneedtoknowwhichpackageaclassisin,youcaneasilylookitupinHadoop’sJavaAPIdocumentationfortherelevantsubproject,linkedtofromtheApacheHadoophomepageathttp:hadoopapacheorgOrifyou’reusinganIDE,itcanhelpusingitsautocompletemechanismSimilarly,althoughitdeviatesfromusualstyleguidelines,programlistingsthatimportmultipleclassesfromthesamepackagemayusetheasteriskwildcardcharactertosavespace(forexample:importorgapachehadoopio*)Thesampleprogramsinthisbookareavailablefordownloadfromthewebsitethataccompaniesthisbook:http:wwwhadoopbookcomYouwillalsofindinstructionsthereforobtainingthedatasetsthatareusedinexamplesthroughoutthebook,aswellasfurthernotesforrunningtheprogramsinthebook,andlinkstoupdates,additionalresources,andmyblogWhat’sinThisBookTherestofthisbookisorganizedasfollowsChapterprovidesanintroductiontoMapReduceChapterlooksatHadoopfilesystems,andinparticularHDFS,indepthChaptercoversthefundamentalsofIOinHadoop:dataintegrity,compression,serialization,andfilebaseddatastructuresThenextfourchapterscoverMapReduceindepthChaptergoesthroughthepracticalstepsneededtodevelopaMapReduceapplicationChapterlooksathowMapReduceisimplementedinHadoop,fromthepointofviewofauserChapterisabouttheMapReduceprogrammingmodel,andthevariousdataformatsthatMapReducecanworkwithChapterisonadvancedMapReducetopics,includingsortingandjoiningdataChaptersandareforHadoopadministrators,anddescribehowtosetupandmaintainaHadoopclusterrunningHDFSandMapReduceChapters,,andpresentPig,HBase,andZooKeeper,respectivelyFinally,ChapterisacollectionofcasestudiescontributedbymembersoftheApacheHadoopcommunityxvi|PrefaceConventionsUsedinThisBookThefollowingtypographicalconventionsareusedinthisbook:ItalicIndicatesnewterms,URLs,emailaddresses,filenames,andfileextensionsConstantwidthUsedforprogramlistings,aswellaswithinparagraphstorefertoprogramelementssuchasvariableorfunctionnames,databases,datatypes,environmentvariables,statements,andkeywordsConstantwidthboldShowscommandsorothertextthatshouldbetypedliterallybytheuserConstantwidthitalicShowstextthatshouldbereplacedwithusersuppliedvaluesorbyvaluesdeterminedbycontextThisiconsignifiesatip,suggestion,orgeneralnoteThisiconindicatesawarningorcautionUsingCodeExamplesThisbookisheretohelpyougetyourjobdoneIngeneral,youmayusethecodeinthisbookinyourprogramsanddocumentationYoudonotneedtocontactusforpermissionunlessyou’rereproducingasignificantportionofthecodeForexample,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequirepermissionSellingordistributingaCDROMofexamplesfromO’ReillybooksdoesrequirepermissionAnsweringaquestionbycitingthisbookandquotingexamplecodedoesnotrequirepermissionIncorporatingasignificantamountofexamplecodefromthisbookintoyourproduct’sdocumentationdoesrequirepermissionWeappreciate,butdonotrequire,attributionAnattributionusuallyincludesthetitle,author,publisher,andISBNForexample:“Hadoop:TheDefinitiveGuide,byTomWhiteCopyrightTomWhite,”Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,feelfreetocontactusatpermissionsoreillycomPreface|xviiSafariBooksOnlineWhenyouseeaSafariBooksOnlineicononthecoverofyourfavoritetechnologybook,thatmeansthebookisavailableonlinethroughtheO’ReillyNetworkSafariBookshelfSafarioffersasolutionthat’sbetterthanebooksIt’savirtuallibrarythatletsyoueasilysearchthousandsoftoptechb

用户评论(0)

0/200

精彩专题

上传我的资料

每篇奖励 +2积分

资料评价:

/49
仅支持在线阅读

意见
反馈

立即扫码关注

爱问共享资料微信公众号

返回
顶部