关闭

关闭

关闭

封号提示

内容

首页 大数据.pdf

大数据.pdf

大数据.pdf

sixiaolin067 2011-09-21 评分 0 浏览量 0 0 0 0 暂无简介 简介 举报

简介:本文档为《大数据pdf》,可适用于IT/计算机领域,主题内容包含MiningofMassiveDatasetsAnandRajaramanKosmix,IncJeffreyDUllmanStanfordUnivC符等。

MiningofMassiveDatasetsAnandRajaramanKosmix,IncJeffreyDUllmanStanfordUnivCopyrightc,AnandRajaramanandJeffreyDUllmaniiPrefaceThisbookevolvedfrommaterialdevelopedoverseveralyearsbyAnandRajaramanandJeffUllmanforaonequartercourseatStanfordThecourseCSA,titled“WebMining,”wasdesignedasanadvancedgraduatecourse,althoughithasbecomeaccessibleandinterestingtoadvancedundergraduatesWhattheBookIsAboutAtthehighestlevelofdescription,thisbookisaboutdataminingHowever,itfocusesondataminingofverylargeamountsofdata,thatis,datasolargeitdoesnotfitinmainmemoryBecauseoftheemphasisonsize,manyofourexamplesareabouttheWebordataderivedfromtheWebFurther,thebooktakesanalgorithmicpointofview:dataminingisaboutapplyingalgorithmstodata,ratherthanusingdatato“train”amachinelearningengineofsomesortTheprincipaltopicscoveredare:DistributedfilesystemsandmapreduceasatoolforcreatingparallelalgorithmsthatsucceedonverylargeamountsofdataSimilaritysearch,includingthekeytechniquesofminhashingandlocalitysensitivehashingDatastreamprocessingandspecializedalgorithmsfordealingwithdatathatarrivessofastitmustbeprocessedimmediatelyorlostThetechnologyofsearchengines,includingGoogle’sPageRank,linkspamdetection,andthehubsandauthoritiesapproachFrequentitemsetmining,includingassociationrules,marketbaskets,theAPrioriAlgorithmanditsimprovementsAlgorithmsforclusteringverylarge,highdimensionaldatasetsTwokeyproblemsforWebapplications:managingadvertisingandrecommendationsystemsiiiivPREFACEPrerequisitesCSA,althoughitsnumberindicatesanadvancedgraduatecourse,hasbeenfoundaccessiblebyadvancedundergraduatesandbeginningmastersstudentsInthefuture,itislikelythatthecoursewillbegivenamezzaninelevelnumberTheprerequisitesforCSAare:Thefirstcourseindatabasesystems,coveringapplicationprogramminginSQLandotherdatabaserelatedlanguagessuchasXQueryAsophomorelevelcourseindatastructures,algorithms,anddiscretemathAsophomorelevelcourseinsoftwaresystems,softwareengineering,andprogramminglanguagesExercisesThebookcontainsextensiveexercises,withsomeforalmosteverysectionWeindicateharderexercisesorpartsofexerciseswithanexclamationpointThehardestexerciseshaveadoubleexclamationpointSupportontheWebYoucanfindmaterialsfrompastofferingsofCSAat:http:infolabstanfordedu~ullmanminingmininghtmlThere,youwillfindslides,homeworkassignments,projectrequirements,andinsomecases,examsAcknowledgementsWewouldliketothankFotoAfratiandArunMaratheforcriticalreadingsofthedraftofthismanuscriptErrorswerealsoreportedbyLelandChen,ShreyGupta,XieKe,BradPenoff,PhilipsKokohPrasetyo,MarkStorus,TimTricheJr,andRoshanSumbalyTheremainingerrorsareours,ofcourseARJDUPaloAlto,CAJune,ContentsDataMiningWhatisDataMiningStatisticalModelingMachineLearningComputationalApproachestoModelingSummarizationFeatureExtractionStatisticalLimitsonDataMiningTotalInformationAwarenessBonferroni’sPrincipleAnExampleofBonferroni’sPrincipleExercisesforSectionThingsUsefultoKnowImportanceofWordsinDocumentsHashFunctionsIndexesSecondaryStorageTheBaseofNaturalLogarithmsPowerLawsExercisesforSectionOutlineoftheBookSummaryofChapterReferencesforChapterLargeScaleFileSystemsandMapReduceDistributedFileSystemsPhysicalOrganizationofComputeNodesLargeScaleFileSystemOrganizationMapReduceTheMapTasksGroupingandAggregationTheReduceTasksCombinersvviCONTENTSDetailsofMapReduceExecutionCopingWithNodeFailuresAlgorithmsUsingMapReduceMatrixVectorMultiplicationbyMapReduceIftheVectorvCannotFitinMainMemoryRelationalAlgebraOperationsComputingSelectionsbyMapReduceComputingProjectionsbyMapReduceUnion,Intersection,andDifferencebyMapReduceComputingNaturalJoinbyMapReduceGeneralizingtheJoinAlgorithmGroupingandAggregationbyMapReduceMatrixMultiplicationMatrixMultiplicationwithOneMapReduceStepExercisesforSectionExtensionstoMapReduceWorkflowSystemsRecursiveExtensionstoMapReducePregelExercisesforSectionEfficiencyofClusterComputingAlgorithmsTheCommunicationCostModelforClusterComputingElapsedCommunicationCostMultiwayJoinsExercisesforSectionSummaryofChapterReferencesforChapterFindingSimilarItemsApplicationsofNearNeighborSearchJaccardSimilarityofSetsSimilarityofDocumentsCollaborativeFilteringasaSimilarSetsProblemExercisesforSectionShinglingofDocumentskShinglesChoosingtheShingleSizeHashingShinglesShinglesBuiltfromWordsExercisesforSectionSimilarityPreservingSummariesofSetsMatrixRepresentationofSetsMinhashingMinhashingandJaccardSimilarityCONTENTSviiMinhashSignaturesComputingMinhashSignaturesExercisesforSectionLocalitySensitiveHashingforDocumentsLSHforMinhashSignaturesAnalysisoftheBandingTechniqueCombiningtheTechniquesExercisesforSectionDistanceMeasuresDefinitionofaDistanceMeasureEuclideanDistancesJaccardDistanceCosineDistanceEditDistanceHammingDistanceExercisesforSectionTheTheoryofLocalitySensitiveFunctionsLocalitySensitiveFunctionsLocalitySensitiveFamiliesforJaccardDistanceAmplifyingaLocalitySensitiveFamilyExercisesforSectionLSHFamiliesforOtherDistanceMeasuresLSHFamiliesforHammingDistanceRandomHyperplanesandtheCosineDistanceSketchesLSHFamiliesforEuclideanDistanceMoreLSHFamiliesforEuclideanSpacesExercisesforSectionApplicationsofLocalitySensitiveHashingEntityResolutionAnEntityResolutionExampleValidatingRecordMatchesMatchingFingerprintsALSHFamilyforFingerprintMatchingSimilarNewsArticlesExercisesforSectionMethodsforHighDegreesofSimilarityFindingIdenticalItemsRepresentingSetsasStringsLengthBasedFilteringPrefixIndexingUsingPositionInformationUsingPositionandLengthinIndexesExercisesforSectionSummaryofChapterviiiCONTENTSReferencesforChapterMiningDataStreamsTheStreamDataModelADataStreamManagementSystemExamplesofStreamSourcesStreamQueriesIssuesinStreamProcessingSamplingDatainaStreamAMotivatingExampleObtainingaRepresentativeSampleTheGeneralSamplingProblemVaryingtheSampleSizeExercisesforSectionFilteringStreamsAMotivatingExampleTheBloomFilterAnalysisofBloomFilteringExercisesforSectionCountingDistinctElementsinaStreamTheCountDistinctProblemTheFlajoletMartinAlgorithmCombiningEstimatesSpaceRequirementsExercisesforSectionEstimatingMomentsDefinitionofMomentsTheAlonMatiasSzegedyAlgorithmforSecondMomentsWhytheAlonMatiasSzegedyAlgorithmWorksHigherOrderMomentsDealingWithInfiniteStreamsExercisesforSectionCountingOnesinaWindowTheCostofExactCountsTheDatarGionisIndykMotwaniAlgorithmStorageRequirementsfortheDGIMAlgorithmQueryAnsweringintheDGIMAlgorithmMaintainingtheDGIMConditionsReducingtheErrorExtensionstotheCountingofOnesExercisesforSectionDecayingWindowsTheProblemofMostCommonElementsDefinitionoftheDecayingWindowCONTENTSixFindingtheMostPopularElementsSummaryofChapterReferencesforChapterLinkAnalysisPageRankEarlySearchEnginesandTermSpamDefinitionofPageRankStructureoftheWebAvoidingDeadEndsSpiderTrapsandTaxationUsingPageRankinaSearchEngineExercisesforSectionEfficientComputationofPageRankRepresentingTransitionMatricesPageRankIterationUsingMapReduceUseofCombinerstoConsolidatetheResultVectorRepresentingBlocksoftheTransitionMatrixOtherEfficientApproachestoPageRankIterationExercisesforSectionTopicSensitivePageRankMotivationforTopicSensitivePageRankBiasedRandomWalksUsingTopicSensitivePageRankInferringTopicsfromWordsExercisesforSectionLinkSpamArchitectureofaSpamFarmAnalysisofaSpamFarmCombatingLinkSpamTrustRankSpamMassExercisesforSectionHubsandAuthoritiesTheIntuitionBehindHITSFormalizingHubbinessandAuthorityExercisesforSectionSummaryofChapterReferencesforChapterFrequentItemsetsTheMarketBasketModelDefinitionofFrequentItemsetsApplicationsofFrequentItemsetsAssociationRulesxCONTENTSFindingAssociationRuleswithHighConfidenceExercisesforSectionMarketBasketsandtheAPrioriAlgorithmRepresentationofMarketBasketDataUseofMainMemoryforItemsetCountingMonotonicityofItemsetsTyrannyofCountingPairsTheAPrioriAlgorithmAPrioriforAllFrequentItemsetsExercisesforSectionHandlingLargerDatasetsinMainMemoryTheAlgorithmofPark,Chen,andYuTheMultistageAlgorithmTheMultihashAlgorithmExercisesforSectionLimitedPassAlgorithmsTheSimple,RandomizedAlgorithmAvoidingErrorsinSamplingAlgorithmsTheAlgorithmofSavasere,Omiecinski,andNavatheTheSONAlgorithmandMapReduceToivonen’sAlgorithmWhyToivonen’sAlgorithmWorksExercisesforSectionCountingFrequentItemsinaStreamSamplingMethodsforStreamsFrequentItemsetsinDecayingWindowsHybridMethodsExercisesforSectionSummaryofChapter

用户评论(4)

0/200
  • 匿名用户 2016-10-13 09:22:55

    书不错,可惜是英文版,得慢慢啃

  • 10.44.7.248 2013-02-20 04:06:30

    盗名欺世

  • chang1880247 2012-11-16 02:04:40

    不错,谢谢

  • 10.44.7.248 2012-10-27 06:10:56

    不是徐子沛的那个

精彩专题

上传我的资料

每篇奖励 +1积分

资料评分:

/49
仅支持在线阅读

意见
反馈

立即扫码关注

爱问共享资料微信公众号

返回
顶部

举报
资料