关闭

关闭

封号提示

内容

首页 大数据.pdf

大数据.pdf

大数据.pdf

上传者: sixiaolin067 2011-09-21 评分 0 0 0 0 0 0 暂无简介 简介 举报

简介:本文档为《大数据pdf》,可适用于IT/计算机领域,主题内容包含MiningofMassiveDatasetsAnandRajaramanKosmix,IncJeffreyDUllmanStanfordUnivC符等。

MiningofMassiveDatasetsAnandRajaramanKosmix,IncJeffreyDUllmanStanfordUnivCopyrightc,AnandRajaramanandJeffreyDUllmaniiPrefaceThisbookevolvedfrommaterialdevelopedoverseveralyearsbyAnandRajaramanandJeffUllmanforaonequartercourseatStanfordThecourseCSA,titled“WebMining,”wasdesignedasanadvancedgraduatecourse,althoughithasbecomeaccessibleandinterestingtoadvancedundergraduatesWhattheBookIsAboutAtthehighestlevelofdescription,thisbookisaboutdataminingHowever,itfocusesondataminingofverylargeamountsofdata,thatis,datasolargeitdoesnotfitinmainmemoryBecauseoftheemphasisonsize,manyofourexamplesareabouttheWebordataderivedfromtheWebFurther,thebooktakesanalgorithmicpointofview:dataminingisaboutapplyingalgorithmstodata,ratherthanusingdatato“train”amachinelearningengineofsomesortTheprincipaltopicscoveredare:DistributedfilesystemsandmapreduceasatoolforcreatingparallelalgorithmsthatsucceedonverylargeamountsofdataSimilaritysearch,includingthekeytechniquesofminhashingandlocalitysensitivehashingDatastreamprocessingandspecializedalgorithmsfordealingwithdatathatarrivessofastitmustbeprocessedimmediatelyorlostThetechnologyofsearchengines,includingGoogle’sPageRank,linkspamdetection,andthehubsandauthoritiesapproachFrequentitemsetmining,includingassociationrules,marketbaskets,theAPrioriAlgorithmanditsimprovementsAlgorithmsforclusteringverylarge,highdimensionaldatasetsTwokeyproblemsforWebapplications:managingadvertisingandrecommendationsystemsiiiivPREFACEPrerequisitesCSA,althoughitsnumberindicatesanadvancedgraduatecourse,hasbeenfoundaccessiblebyadvancedundergraduatesandbeginningmastersstudentsInthefuture,itislikelythatthecoursewillbegivenamezzaninelevelnumberTheprerequisitesforCSAare:Thefirstcourseindatabasesystems,coveringapplicationprogramminginSQLandotherdatabaserelatedlanguagessuchasXQueryAsophomorelevelcourseindatastructures,algorithms,anddiscretemathAsophomorelevelcourseinsoftwaresystems,softwareengineering,andprogramminglanguagesExercisesThebookcontainsextensiveexercises,withsomeforalmosteverysectionWeindicateharderexercisesorpartsofexerciseswithanexclamationpointThehardestexerciseshaveadoubleexclamationpointSupportontheWebYoucanfindmaterialsfrompastofferingsofCSAat:http:infolabstanfordedu~ullmanminingmininghtmlThere,youwillfindslides,homeworkassignments,projectrequirements,andinsomecases,examsAcknowledgementsWewouldliketothankFotoAfratiandArunMaratheforcriticalreadingsofthedraftofthismanuscriptErrorswerealsoreportedbyLelandChen,ShreyGupta,XieKe,BradPenoff,PhilipsKokohPrasetyo,MarkStorus,TimTricheJr,andRoshanSumbalyTheremainingerrorsareours,ofcourseARJDUPaloAlto,CAJune,ContentsDataMiningWhatisDataMiningStatisticalModelingMachineLearningComputationalApproachestoModelingSummarizationFeatureExtractionStatisticalLimitsonDataMiningTotalInformationAwarenessBonferroni’sPrincipleAnExampleofBonferroni’sPrincipleExercisesforSectionThingsUsefultoKnowImportanceofWordsinDocumentsHashFunctionsIndexesSecondaryStorageTheBaseofNaturalLogarithmsPowerLawsExercisesforSectionOutlineoftheBookSummaryofChapterReferencesforChapterLargeScaleFileSystemsandMapReduceDistributedFileSystemsPhysicalOrganizationofComputeNodesLargeScaleFileSystemOrganizationMapReduceTheMapTasksGroupingandAggregationTheReduceTasksCombinersvviCONTENTSDetailsofMapReduceExecutionCopingWithNodeFailuresAlgorithmsUsingMapReduceMatrixVectorMultiplicationbyMapReduceIftheVectorvCannotFitinMainMemoryRelationalAlgebraOperationsComputingSelectionsbyMapReduceComputingProjectionsbyMapReduceUnion,Intersection,andDifferencebyMapReduceComputingNaturalJoinbyMapReduceGeneralizingtheJoinAlgorithmGroupingandAggregationbyMapReduceMatrixMultiplicationMatrixMultiplicationwithOneMapReduceStepExercisesforSectionExtensionstoMapReduceWorkflowSystemsRecursiveExtensionstoMapReducePregelExercisesforSectionEfficiencyofClusterComputingAlgorithmsTheCommunicationCostModelforClusterComputingElapsedCommunicationCostMultiwayJoinsExercisesforSectionSummaryofChapterReferencesforChapterFindingSimilarItemsApplicationsofNearNeighborSearchJaccardSimilarityofSetsSimilarityofDocumentsCollaborativeFilteringasaSimilarSetsProblemExercisesforSectionShinglingofDocumentskShinglesChoosingtheShingleSizeHashingShinglesShinglesBuiltfromWordsExercisesforSectionSimilarityPreservingSummariesofSetsMatrixRepresentationofSetsMinhashingMinhashingandJaccardSimilarityCONTENTSviiMinhashSignaturesComputingMinhashSignaturesExercisesforSectionLocalitySensitiveHashingforDocumentsLSHforMinhashSignaturesAnalysisoftheBandingTechniqueCombiningtheTechniquesExercisesforSectionDistanceMeasuresDefinitionofaDistanceMeasureEuclideanDistancesJaccardDistanceCosineDistanceEditDistanceHammingDistanceExercisesforSectionTheTheoryofLocalitySensitiveFunctionsLocalitySensitiveFunctionsLocalitySensitiveFamiliesforJaccardDistanceAmplifyingaLocalitySensitiveFamilyExercisesforSectionLSHFamiliesforOtherDistanceMeasuresLSHFamiliesforHammingDistanceRandomHyperplanesandtheCosineDistanceSketchesLSHFamiliesforEuclideanDistanceMoreLSHFamiliesforEuclideanSpacesExercisesforSectionApplicationsofLocalitySensitiveHashingEntityResolutionAnEntityResolutionExampleValidatingRecordMatchesMatchingFingerprintsALSHFamilyforFingerprintMatchingSimilarNewsArticlesExercisesforSectionMethodsforHighDegreesofSimilarityFindingIdenticalItemsRepresentingSetsasStringsLengthBasedFilteringPrefixIndexingUsingPositionInformationUsingPositionandLengthinIndexesExercisesforSectionSummaryofChapterviiiCONTENTSReferencesforChapterMiningDataStreamsTheStreamDataModelADataStreamManagementSystemExamplesofStreamSourcesStreamQueriesIssuesinStreamProcessingSamplingDatainaStreamAMotivatingExampleObtainingaRepresentativeSampleTheGeneralSamplingProblemVaryingtheSampleSizeExercisesforSectionFilteringStreamsAMotivatingExampleTheBloomFilterAnalysisofBloomFilteringExercisesforSectionCountingDistinctElementsinaStreamTheCountDistinctProblemTheFlajoletMartinAlgorithmCombiningEstimatesSpaceRequirementsExercisesforSectionEstimatingMomentsDefinitionofMomentsTheAlonMatiasSzegedyAlgorithmforSecondMomentsWhytheAlonMatiasSzegedyAlgorithmWorksHigherOrderMomentsDealingWithInfiniteStreamsExercisesforSectionCountingOnesinaWindowTheCostofExactCountsTheDatarGionisIndykMotwaniAlgorithmStorageRequirementsfortheDGIMAlgorithmQueryAnsweringintheDGIMAlgorithmMaintainingtheDGIMConditionsReducingtheErrorExtensionstotheCountingofOnesExercisesforSectionDecayingWindowsTheProblemofMostCommonElementsDefinitionoftheDecayingWindowCONTENTSixFindingtheMostPopularElementsSummaryofChapterReferencesforChapterLinkAnalysisPageRankEarlySearchEnginesandTermSpamDefinitionofPageRankStructureoftheWebAvoidingDeadEndsSpiderTrapsandTaxationUsingPageRankinaSearchEngineExercisesforSectionEfficientComputationofPageRankRepresentingTransitionMatricesPageRankIterationUsingMapReduceUseofCombinerstoConsolidatetheResultVectorRepresentingBlocksoftheTransitionMatrixOtherEfficientApproachestoPageRankIterationExercisesforSectionTopicSensitivePageRankMotivationforTopicSensitivePageRankBiasedRandomWalksUsingTopicSensitivePageRankInferringTopicsfromWordsExercisesforSectionLinkSpamArchitectureofaSpamFarmAnalysisofaSpamFarmCombatingLinkSpamTrustRankSpamMassExercisesforSectionHubsandAuthoritiesTheIntuitionBehindHITSFormalizingHubbinessandAuthorityExercisesforSectionSummaryofChapterReferencesforChapterFrequentItemsetsTheMarketBasketModelDefinitionofFrequentItemsetsApplicationsofFrequentItemsetsAssociationRulesxCONTENTSFindingAssociationRuleswithHighConfidenceExercisesforSectionMarketBasketsandtheAPrioriAlgorithmRepresentationofMarketBasketDataUseofMainMemoryforItemsetCountingMonotonicityofItemsetsTyrannyofCountingPairsTheAPrioriAlgorithmAPrioriforAllFrequentItemsetsExercisesforSectionHandlingLargerDatasetsinMainMemoryTheAlgorithmofPark,Chen,andYuTheMultistageAlgorithmTheMultihashAlgorithmExercisesforSectionLimitedPassAlgorithmsTheSimple,RandomizedAlgorithmAvoidingErrorsinSamplingAlgorithmsTheAlgorithmofSavasere,Omiecinski,andNavatheTheSONAlgorithmandMapReduceToivonen’sAlgorithmWhyToivonen’sAlgorithmWorksExercisesforSectionCountingFrequentItemsinaStreamSamplingMethodsforStreamsFrequentItemsetsinDecayingWindowsHybridMethodsExercisesforSectionSummaryofChapter

职业精品

精彩专题

上传我的资料

热门资料

资料评价:

/ 341
所需积分:2 立即下载

意见
反馈

返回
顶部

Q