关闭

关闭

关闭

封号提示

内容

首页 Mining of Massive Datasets

Mining of Massive Datasets.pdf

Mining of Massive Datasets

pais861009
2012-07-05 0人阅读 0 0 0 暂无简介 举报

简介:本文档为《Mining of Massive Datasetspdf》,可适用于IT/计算机领域

MiningofMassiveDatasetsAnandRajaramanKosmix,IncJeffreyDUllmanStanfordUnivCopyrightc©,AnandRajaramanandJeffreyDUllmaniiPrefaceThisbookevolvedfrommaterialdevelopedoverseveralyearsbyAnandRajaramanandJeffUllmanforaonequartercourseatStanfordThecourseCSA,titled“WebMining,”wasdesignedasanadvancedgraduatecourse,althoughithasbecomeaccessibleandinterestingtoadvancedundergraduatesWhattheBookIsAboutAtthehighestlevelofdescription,thisbookisaboutdataminingHowever,itfocusesondataminingofverylargeamountsofdata,thatis,datasolargeitdoesnotfitinmainmemoryBecauseoftheemphasisonsize,manyofourexamplesareabouttheWebordataderivedfromtheWebFurther,thebooktakesanalgorithmicpointofview:dataminingisaboutapplyingalgorithmstodata,ratherthanusingdatato“train”amachinelearningengineofsomesortTheprincipaltopicscoveredare:DistributedfilesystemsandmapreduceasatoolforcreatingparallelalgorithmsthatsucceedonverylargeamountsofdataSimilaritysearch,includingthekeytechniquesofminhashingandlocalitysensitivehashingDatastreamprocessingandspecializedalgorithmsfordealingwithdatathatarrivessofastitmustbeprocessedimmediatelyorlostThetechnologyofsearchengines,includingGoogle’sPageRank,linkspamdetection,andthehubsandauthoritiesapproachFrequentitemsetmining,includingassociationrules,marketbaskets,theAPrioriAlgorithmanditsimprovementsAlgorithmsforclusteringverylarge,highdimensionaldatasetsTwokeyproblemsforWebapplications:managingadvertisingandrecommendationsystemsiiiivPREFACEPrerequisitesCSA,althoughitsnumberindicatesanadvancedgraduatecourse,hasbeenfoundaccessiblebyadvancedundergraduatesandbeginningmastersstudentsInthefuture,itislikelythatthecoursewillbegivenamezzaninelevelnumberTheprerequisitesforCSAare:Thefirstcourseindatabasesystems,coveringapplicationprogramminginSQLandotherdatabaserelatedlanguagessuchasXQueryAsophomorelevelcourseindatastructures,algorithms,anddiscretemathAsophomorelevelcourseinsoftwaresystems,softwareengineering,andprogramminglanguagesExercisesThebookcontainsextensiveexercises,withsomeforalmosteverysectionWeindicateharderexercisesorpartsofexerciseswithanexclamationpointThehardestexerciseshaveadoubleexclamationpointSupportontheWebYoucanfindmaterialsfrompastofferingsofCSAat:http:infolabstanfordedu~ullmanminingmininghtmlThere,youwillfindslides,homeworkassignments,projectrequirements,andinsomecases,examsAcknowledgementsWewouldliketothankFotoAfratiandArunMaratheforcriticalreadingsofthedraftofthismanuscriptErrorswerealsoreportedbyLelandChen,ShreyGupta,XieKe,BradPenoff,PhilipsKokohPrasetyo,MarkStorus,TimTricheJr,andRoshanSumbalyTheremainingerrorsareours,ofcourseARJDUPaloAlto,CAJune,ContentsDataMiningWhatisDataMiningStatisticalModelingMachineLearningComputationalApproachestoModelingSummarizationFeatureExtractionStatisticalLimitsonDataMiningTotalInformationAwarenessBonferroni’sPrincipleAnExampleofBonferroni’sPrincipleExercisesforSectionThingsUsefultoKnowImportanceofWordsinDocumentsHashFunctionsIndexesSecondaryStorageTheBaseofNaturalLogarithmsPowerLawsExercisesforSectionOutlineoftheBookSummaryofChapterReferencesforChapterLargeScaleFileSystemsandMapReduceDistributedFileSystemsPhysicalOrganizationofComputeNodesLargeScaleFileSystemOrganizationMapReduceTheMapTasksGroupingandAggregationTheReduceTasksCombinersvviCONTENTSDetailsofMapReduceExecutionCopingWithNodeFailuresAlgorithmsUsingMapReduceMatrixVectorMultiplicationbyMapReduceIftheVectorvCannotFitinMainMemoryRelationalAlgebraOperationsComputingSelectionsbyMapReduceComputingProjectionsbyMapReduceUnion,Intersection,andDifferencebyMapReduceComputingNaturalJoinbyMapReduceGeneralizingtheJoinAlgorithmGroupingandAggregationbyMapReduceMatrixMultiplicationMatrixMultiplicationwithOneMapReduceStepExercisesforSectionExtensionstoMapReduceWorkflowSystemsRecursiveExtensionstoMapReducePregelExercisesforSectionEfficiencyofClusterComputingAlgorithmsTheCommunicationCostModelforClusterComputingElapsedCommunicationCostMultiwayJoinsExercisesforSectionSummaryofChapterReferencesforChapterFindingSimilarItemsApplicationsofNearNeighborSearchJaccardSimilarityofSetsSimilarityofDocumentsCollaborativeFilteringasaSimilarSetsProblemExercisesforSectionShinglingofDocumentskShinglesChoosingtheShingleSizeHashingShinglesShinglesBuiltfromWordsExercisesforSectionSimilarityPreservingSummariesofSetsMatrixRepresentationofSetsMinhashingMinhashingandJaccardSimilarityCONTENTSviiMinhashSignaturesComputingMinhashSignaturesExercisesforSectionLocalitySensitiveHashingforDocumentsLSHforMinhashSignaturesAnalysisoftheBandingTechniqueCombiningtheTechniquesExercisesforSectionDistanceMeasuresDefinitionofaDistanceMeasureEuclideanDistancesJaccardDistanceCosineDistanceEditDistanceHammingDistanceExercisesforSectionTheTheoryofLocalitySensitiveFunctionsLocalitySensitiveFunctionsLocalitySensitiveFamiliesforJaccardDistanceAmplifyingaLocalitySensitiveFamilyExercisesforSectionLSHFamiliesforOtherDistanceMeasuresLSHFamiliesforHammingDistanceRandomHyperplanesandtheCosineDistanceSketchesLSHFamiliesforEuclideanDistanceMoreLSHFamiliesforEuclideanSpacesExercisesforSectionApplicationsofLocalitySensitiveHashingEntityResolutionAnEntityResolutionExampleValidatingRecordMatchesMatchingFingerprintsALSHFamilyforFingerprintMatchingSimilarNewsArticlesExercisesforSectionMethodsforHighDegreesofSimilarityFindingIdenticalItemsRepresentingSetsasStringsLengthBasedFilteringPrefixIndexingUsingPositionInformationUsingPositionandLengthinIndexesExercisesforSectionSummaryofChapterviiiCONTENTSReferencesforChapterMiningDataStreamsTheStreamDataModelADataStreamManagementSystemExamplesofStreamSourcesStreamQueriesIssuesinStreamProcessingSamplingDatainaStreamAMotivatingExampleObtainingaRepresentativeSampleTheGeneralSamplingProblemVaryingtheSampleSizeExercisesforSectionFilteringStreamsAMotivatingExampleTheBloomFilterAnalysisofBloomFilteringExercisesforSectionCountingDistinctElementsinaStreamTheCountDistinctProblemTheFlajoletMartinAlgorithmCombiningEstimatesSpaceRequirementsExercisesforSectionEstimatingMomentsDefinitionofMomentsTheAlonMatiasSzegedyAlgorithmforSecondMomentsWhytheAlonMatiasSzegedyAlgorithmWorksHigherOrderMomentsDealingWithInfiniteStreamsExercisesforSectionCountingOnesinaWindowTheCostofExactCountsTheDatarGionisIndykMotwaniAlgorithmStorageRequirementsfortheDGIMAlgorithmQueryAnsweringintheDGIMAlgorithmMaintainingtheDGIMConditionsReducingtheErrorExtensionstotheCountingofOnesExercisesforSectionDecayingWindowsTheProblemofMostCommonElementsDefinitionoftheDecayingWindowCONTENTSixFindingtheMostPopularElementsSummaryofChapterReferencesforChapterLinkAnalysisPageRankEarlySearchEnginesandTermSpamDefinitionofPageRankStructureoftheWebAvoidingDeadEndsSpiderTrapsandTaxationUsingPageRankinaSearchEngineExercisesforSectionEfficientComputationofPageRankRepresentingTransitionMatricesPageRankIterationUsingMapReduceUseofCombinerstoConsolidatetheResultVectorRepresentingBlocksoftheTransitionMatrixOtherEfficientApproachestoPageRankIterationExercisesforSectionTopicSensitivePageRankMotivationforTopicSensitivePageRankBiasedRandomWalksUsingTopicSensitivePageRankInferringTopicsfromWordsExercisesforSectionLinkSpamArchitectureofaSpamFarmAnalysisofaSpamFarmCombatingLinkSpamTrustRankSpamMassExercisesforSectionHubsandAuthoritiesTheIntuitionBehindHITSFormalizingHubbinessandAuthorityExercisesforSectionSummaryofChapterReferencesforChapterFrequentItemsetsTheMarketBasketModelDefinitionofFrequentItemsetsApplicationsofFrequentItemsetsAssociationRulesxCONTENTSFindingAssociationRuleswithHighConfidenceExercisesforSectionMarketBasketsandtheAPrioriAlgorithmRepresentationofMarketBasketDataUseofMainMemoryforItemsetCountingMonotonicityofItemsetsTyrannyofCountingPairsTheAPrioriAlgorithmAPrioriforAllFrequentItemsetsExercisesforSectionHandlingLargerDatasetsinMainMemoryTheAlgorithmofPark,Chen,andYuTheMultistageAlgorithmTheMultihashAlgorithmExercisesforSectionLimitedPassAlgorithmsTheSimple,RandomizedAlgorithmAvoidingErrorsinSamplingAlgorithmsTheAlgorithmofSavasere,Omiecinski,andNavatheTheSONAlgorithmandMapReduceToivonen’sAlgorithmWhyToivonen’sAlgorithmWorksExercisesforSectionCountingFrequentItemsinaStreamSamplingMethodsforStreamsFrequentItemsetsinDecayingWindowsHybridMethodsExercisesforSectionSummaryofChapter

用户评价(0)

关闭

新课改视野下建构高中语文教学实验成果报告(32KB)

抱歉,积分不足下载失败,请稍后再试!

提示

试读已结束,如需要继续阅读或者下载,敬请购买!

评分:

/49

意见
反馈

立即扫码关注

爱问共享资料微信公众号

返回
顶部

举报
资料