关闭

关闭

关闭

封号提示

内容

首页 大数据

大数据.pdf

大数据

sixiaolin067
2011-09-21 0人阅读 0 0 0 暂无简介 举报

简介:本文档为《大数据pdf》,可适用于IT/计算机领域

MiningofMassiveDatasetsAnandRajaramanKosmix,IncJeffreyDUllmanStanfordUnivCopyrightc©,AnandRajaramanandJeffreyDUllmaniiPrefaceThisbookevolvedfrommaterialdevelopedoverseveralyearsbyAnandRajaramanandJeffUllmanforaonequartercourseatStanfordThecourseCSA,titled“WebMining,”wasdesignedasanadvancedgraduatecourse,althoughithasbecomeaccessibleandinterestingtoadvancedundergraduatesWhattheBookIsAboutAtthehighestlevelofdescription,thisbookisaboutdataminingHowever,itfocusesondataminingofverylargeamountsofdata,thatis,datasolargeitdoesnotfitinmainmemoryBecauseoftheemphasisonsize,manyofourexamplesareabouttheWebordataderivedfromtheWebFurther,thebooktakesanalgorithmicpointofview:dataminingisaboutapplyingalgorithmstodata,ratherthanusingdatato“train”amachinelearningengineofsomesortTheprincipaltopicscoveredare:DistributedfilesystemsandmapreduceasatoolforcreatingparallelalgorithmsthatsucceedonverylargeamountsofdataSimilaritysearch,includingthekeytechniquesofminhashingandlocalitysensitivehashingDatastreamprocessingandspecializedalgorithmsfordealingwithdatathatarrivessofastitmustbeprocessedimmediatelyorlostThetechnologyofsearchengines,includingGoogle’sPageRank,linkspamdetection,andthehubsandauthoritiesapproachFrequentitemsetmining,includingassociationrules,marketbaskets,theAPrioriAlgorithmanditsimprovementsAlgorithmsforclusteringverylarge,highdimensionaldatasetsTwokeyproblemsforWebapplications:managingadvertisingandrecommendationsystemsiiiivPREFACEPrerequisitesCSA,althoughitsnumberindicatesanadvancedgraduatecourse,hasbeenfoundaccessiblebyadvancedundergraduatesandbeginningmastersstudentsInthefuture,itislikelythatthecoursewillbegivenamezzaninelevelnumberTheprerequisitesforCSAare:Thefirstcourseindatabasesystems,coveringapplicationprogramminginSQLandotherdatabaserelatedlanguagessuchasXQueryAsophomorelevelcourseindatastructures,algorithms,anddiscretemathAsophomorelevelcourseinsoftwaresystems,softwareengineering,andprogramminglanguagesExercisesThebookcontainsextensiveexercises,withsomeforalmosteverysectionWeindicateharderexercisesorpartsofexerciseswithanexclamationpointThehardestexerciseshaveadoubleexclamationpointSupportontheWebYoucanfindmaterialsfrompastofferingsofCSAat:http:infolabstanfordedu~ullmanminingmininghtmlThere,youwillfindslides,homeworkassignments,projectrequirements,andinsomecases,examsAcknowledgementsWewouldliketothankFotoAfratiandArunMaratheforcriticalreadingsofthedraftofthismanuscriptErrorswerealsoreportedbyLelandChen,ShreyGupta,XieKe,BradPenoff,PhilipsKokohPrasetyo,MarkStorus,TimTricheJr,andRoshanSumbalyTheremainingerrorsareours,ofcourseARJDUPaloAlto,CAJune,ContentsDataMiningWhatisDataMiningStatisticalModelingMachineLearningComputationalApproachestoModelingSummarizationFeatureExtractionStatisticalLimitsonDataMiningTotalInformationAwarenessBonferroni’sPrincipleAnExampleofBonferroni’sPrincipleExercisesforSectionThingsUsefultoKnowImportanceofWordsinDocumentsHashFunctionsIndexesSecondaryStorageTheBaseofNaturalLogarithmsPowerLawsExercisesforSectionOutlineoftheBookSummaryofChapterReferencesforChapterLargeScaleFileSystemsandMapReduceDistributedFileSystemsPhysicalOrganizationofComputeNodesLargeScaleFileSystemOrganizationMapReduceTheMapTasksGroupingandAggregationTheReduceTasksCombinersvviCONTENTSDetailsofMapReduceExecutionCopingWithNodeFailuresAlgorithmsUsingMapReduceMatrixVectorMultiplicationbyMapReduceIftheVectorvCannotFitinMainMemoryRelationalAlgebraOperationsComputingSelectionsbyMapReduceComputingProjectionsbyMapReduceUnion,Intersection,andDifferencebyMapReduceComputingNaturalJoinbyMapReduceGeneralizingtheJoinAlgorithmGroupingandAggregationbyMapReduceMatrixMultiplicationMatrixMultiplicationwithOneMapReduceStepExercisesforSectionExtensionstoMapReduceWorkflowSystemsRecursiveExtensionstoMapReducePregelExercisesforSectionEfficiencyofClusterComputingAlgorithmsTheCommunicationCostModelforClusterComputingElapsedCommunicationCostMultiwayJoinsExercisesforSectionSummaryofChapterReferencesforChapterFindingSimilarItemsApplicationsofNearNeighborSearchJaccardSimilarityofSetsSimilarityofDocumentsCollaborativeFilteringasaSimilarSetsProblemExercisesforSectionShinglingofDocumentskShinglesChoosingtheShingleSizeHashingShinglesShinglesBuiltfromWordsExercisesforSectionSimilarityPreservingSummariesofSetsMatrixRepresentationofSetsMinhashingMinhashingandJaccardSimilarityCONTENTSviiMinhashSignaturesComputingMinhashSignaturesExercisesforSectionLocalitySensitiveHashingforDocumentsLSHforMinhashSignaturesAnalysisoftheBandingTechniqueCombiningtheTechniquesExercisesforSectionDistanceMeasuresDefinitionofaDistanceMeasureEuclideanDistancesJaccardDistanceCosineDistanceEditDistanceHammingDistanceExercisesforSectionTheTheoryofLocalitySensitiveFunctionsLocalitySensitiveFunctionsLocalitySensitiveFamiliesforJaccardDistanceAmplifyingaLocalitySensitiveFamilyExercisesforSectionLSHFamiliesforOtherDistanceMeasuresLSHFamiliesforHammingDistanceRandomHyperplanesandtheCosineDistanceSketchesLSHFamiliesforEuclideanDistanceMoreLSHFamiliesforEuclideanSpacesExercisesforSectionApplicationsofLocalitySensitiveHashingEntityResolutionAnEntityResolutionExampleValidatingRecordMatchesMatchingFingerprintsALSHFamilyforFingerprintMatchingSimilarNewsArticlesExercisesforSectionMethodsforHighDegreesofSimilarityFindingIdenticalItemsRepresentingSetsasStringsLengthBasedFilteringPrefixIndexingUsingPositionInformationUsingPositionandLengthinIndexesExercisesforSectionSummaryofChapterviiiCONTENTSReferencesforChapterMiningDataStreamsTheStreamDataModelADataStreamManagementSystemExamplesofStreamSourcesStreamQueriesIssuesinStreamProcessingSamplingDatainaStreamAMotivatingExampleObtainingaRepresentativeSampleTheGeneralSamplingProblemVaryingtheSampleSizeExercisesforSectionFilteringStreamsAMotivatingExampleTheBloomFilterAnalysisofBloomFilteringExercisesforSectionCountingDistinctElementsinaStreamTheCountDistinctProblemTheFlajoletMartinAlgorithmCombiningEstimatesSpaceRequirementsExercisesforSectionEstimatingMomentsDefinitionofMomentsTheAlonMatiasSzegedyAlgorithmforSecondMomentsWhytheAlonMatiasSzegedyAlgorithmWorksHigherOrderMomentsDealingWithInfiniteStreamsExercisesforSectionCountingOnesinaWindowTheCostofExactCountsTheDatarGionisIndykMotwaniAlgorithmStorageRequirementsfortheDGIMAlgorithmQueryAnsweringintheDGIMAlgorithmMaintainingtheDGIMConditionsReducingtheErrorExtensionstotheCountingofOnesExercisesforSectionDecayingWindowsTheProblemofMostCommonElementsDefinitionoftheDecayingWindowCONTENTSixFindingtheMostPopularElementsSummaryofChapterReferencesforChapterLinkAnalysisPageRankEarlySearchEnginesandTermSpamDefinitionofPageRankStructureoftheWebAvoidingDeadEndsSpiderTrapsandTaxationUsingPageRankinaSearchEngineExercisesforSectionEfficientComputationofPageRankRepresentingTransitionMatricesPageRankIterationUsingMapReduceUseofCombinerstoConsolidatetheResultVectorRepresentingBlocksoftheTransitionMatrixOtherEfficientApproachestoPageRankIterationExercisesforSectionTopicSensitivePageRankMotivationforTopicSensitivePageRankBiasedRandomWalksUsingTopicSensitivePageRankInferringTopicsfromWordsExercisesforSectionLinkSpamArchitectureofaSpamFarmAnalysisofaSpamFarmCombatingLinkSpamTrustRankSpamMassExercisesforSectionHubsandAuthoritiesTheIntuitionBehindHITSFormalizingHubbinessandAuthorityExercisesforSectionSummaryofChapterReferencesforChapterFrequentItemsetsTheMarketBasketModelDefinitionofFrequentItemsetsApplicationsofFrequentItemsetsAssociationRulesxCONTENTSFindingAssociationRuleswithHighConfidenceExercisesforSectionMarketBasketsandtheAPrioriAlgorithmRepresentationofMarketBasketDataUseofMainMemoryforItemsetCountingMonotonicityofItemsetsTyrannyofCountingPairsTheAPrioriAlgorithmAPrioriforAllFrequentItemsetsExercisesforSectionHandlingLargerDatasetsinMainMemoryTheAlgorithmofPark,Chen,andYuTheMultistageAlgorithmTheMultihashAlgorithmExercisesforSectionLimitedPassAlgorithmsTheSimple,RandomizedAlgorithmAvoidingErrorsinSamplingAlgorithmsTheAlgorithmofSavasere,Omiecinski,andNavatheTheSONAlgorithmandMapReduceToivonen’sAlgorithmWhyToivonen’sAlgorithmWorksExercisesforSectionCountingFrequentItemsinaStreamSamplingMethodsforStreamsFrequentItemsetsinDecayingWindowsHybridMethodsExercisesforSectionSummaryofChapter

用户评价(4)

  • 匿名用户 书不错,可惜是英文版,得慢慢啃

    2016-10-13 09:22:55

  • 10.44.7.248 盗名欺世

    2013-02-20 04:06:30

  • chang1880247 不错,谢谢

    2012-11-16 02:04:40

  • 10.44.7.248 不是徐子沛的那个

    2012-10-27 06:10:56

关闭

新课改视野下建构高中语文教学实验成果报告(32KB)

抱歉,积分不足下载失败,请稍后再试!

提示

试读已结束,如需要继续阅读或者下载,敬请购买!

评分:

/49

意见
反馈

立即扫码关注

爱问共享资料微信公众号

返回
顶部

举报
资料