下载
加入VIP
  • 专属下载券
  • 上传内容扩展
  • 资料优先审核
  • 免费资料无限下载

上传资料

关闭

关闭

关闭

封号提示

内容

首页 大数据

大数据.pdf

大数据

sixiaolin067
2011-09-21 0人阅读 举报 0 0 0 暂无简介

简介:本文档为《大数据pdf》,可适用于IT/计算机领域

MiningofMassiveDatasetsAnandRajaramanKosmix,IncJeffreyDUllmanStanfordUnivCopyrightc©,AnandRajaramanandJeffreyDUllmaniiPrefaceThisbookevolvedfrommaterialdevelopedoverseveralyearsbyAnandRajaramanandJeffUllmanforaonequartercourseatStanfordThecourseCSA,titled“WebMining,”wasdesignedasanadvancedgraduatecourse,althoughithasbecomeaccessibleandinterestingtoadvancedundergraduatesWhattheBookIsAboutAtthehighestlevelofdescription,thisbookisaboutdataminingHowever,itfocusesondataminingofverylargeamountsofdata,thatis,datasolargeitdoesnotfitinmainmemoryBecauseoftheemphasisonsize,manyofourexamplesareabouttheWebordataderivedfromtheWebFurther,thebooktakesanalgorithmicpointofview:dataminingisaboutapplyingalgorithmstodata,ratherthanusingdatato“train”amachinelearningengineofsomesortTheprincipaltopicscoveredare:DistributedfilesystemsandmapreduceasatoolforcreatingparallelalgorithmsthatsucceedonverylargeamountsofdataSimilaritysearch,includingthekeytechniquesofminhashingandlocalitysensitivehashingDatastreamprocessingandspecializedalgorithmsfordealingwithdatathatarrivessofastitmustbeprocessedimmediatelyorlostThetechnologyofsearchengines,includingGoogle’sPageRank,linkspamdetection,andthehubsandauthoritiesapproachFrequentitemsetmining,includingassociationrules,marketbaskets,theAPrioriAlgorithmanditsimprovementsAlgorithmsforclusteringverylarge,highdimensionaldatasetsTwokeyproblemsforWebapplications:managingadvertisingandrecommendationsystemsiiiivPREFACEPrerequisitesCSA,althoughitsnumberindicatesanadvancedgraduatecourse,hasbeenfoundaccessiblebyadvancedundergraduatesandbeginningmastersstudentsInthefuture,itislikelythatthecoursewillbegivenamezzaninelevelnumberTheprerequisitesforCSAare:Thefirstcourseindatabasesystems,coveringapplicationprogramminginSQLandotherdatabaserelatedlanguagessuchasXQueryAsophomorelevelcourseindatastructures,algorithms,anddiscretemathAsophomorelevelcourseinsoftwaresystems,softwareengineering,andprogramminglanguagesExercisesThebookcontainsextensiveexercises,withsomeforalmosteverysectionWeindicateharderexercisesorpartsofexerciseswithanexclamationpointThehardestexerciseshaveadoubleexclamationpointSupportontheWebYoucanfindmaterialsfrompastofferingsofCSAat:http:infolabstanfordedu~ullmanminingmininghtmlThere,youwillfindslides,homeworkassignments,projectrequirements,andinsomecases,examsAcknowledgementsWewouldliketothankFotoAfratiandArunMaratheforcriticalreadingsofthedraftofthismanuscriptErrorswerealsoreportedbyLelandChen,ShreyGupta,XieKe,BradPenoff,PhilipsKokohPrasetyo,MarkStorus,TimTricheJr,andRoshanSumbalyTheremainingerrorsareours,ofcourseARJDUPaloAlto,CAJune,ContentsDataMiningWhatisDataMiningStatisticalModelingMachineLearningComputationalApproachestoModelingSummarizationFeatureExtractionStatisticalLimitsonDataMiningTotalInformationAwarenessBonferroni’sPrincipleAnExampleofBonferroni’sPrincipleExercisesforSectionThingsUsefultoKnowImportanceofWordsinDocumentsHashFunctionsIndexesSecondaryStorageTheBaseofNaturalLogarithmsPowerLawsExercisesforSectionOutlineoftheBookSummaryofChapterReferencesforChapterLargeScaleFileSystemsandMapReduceDistributedFileSystemsPhysicalOrganizationofComputeNodesLargeScaleFileSystemOrganizationMapReduceTheMapTasksGroupingandAggregationTheReduceTasksCombinersvviCONTENTSDetailsofMapReduceExecutionCopingWithNodeFailuresAlgorithmsUsingMapReduceMatrixVectorMultiplicationbyMapReduceIftheVectorvCannotFitinMainMemoryRelationalAlgebraOperationsComputingSelectionsbyMapReduceComputingProjectionsbyMapReduceUnion,Intersection,andDifferencebyMapReduceComputingNaturalJoinbyMapReduceGeneralizingtheJoinAlgorithmGroupingandAggregationbyMapReduceMatrixMultiplicationMatrixMultiplicationwithOneMapReduceStepExercisesforSectionExtensionstoMapReduceWorkflowSystemsRecursiveExtensionstoMapReducePregelExercisesforSectionEfficiencyofClusterComputingAlgorithmsTheCommunicationCostModelforClusterComputingElapsedCommunicationCostMultiwayJoinsExercisesforSectionSummaryofChapterReferencesforChapterFindingSimilarItemsApplicationsofNearNeighborSearchJaccardSimilarityofSetsSimilarityofDocumentsCollaborativeFilteringasaSimilarSetsProblemExercisesforSectionShinglingofDocumentskShinglesChoosingtheShingleSizeHashingShinglesShinglesBuiltfromWordsExercisesforSectionSimilarityPreservingSummariesofSetsMatrixRepresentationofSetsMinhashingMinhashingandJaccardSimilarityCONTENTSviiMinhashSignaturesComputingMinhashSignaturesExercisesforSectionLocalitySensitiveHashingforDocumentsLSHforMinhashSignaturesAnalysisoftheBandingTechniqueCombiningtheTechniquesExercisesforSectionDistanceMeasuresDefinitionofaDistanceMeasureEuclideanDistancesJaccardDistanceCosineDistanceEditDistanceHammingDistanceExercisesforSectionTheTheoryofLocalitySensitiveFunctionsLocalitySensitiveFunctionsLocalitySensitiveFamiliesforJaccardDistanceAmplifyingaLocalitySensitiveFamilyExercisesforSectionLSHFamiliesforOtherDistanceMeasuresLSHFamiliesforHammingDistanceRandomHyperplanesandtheCosineDistanceSketchesLSHFamiliesforEuclideanDistanceMoreLSHFamiliesforEuclideanSpacesExercisesforSectionApplicationsofLocalitySensitiveHashingEntityResolutionAnEntityResolutionExampleValidatingRecordMatchesMatchingFingerprintsALSHFamilyforFingerprintMatchingSimilarNewsArticlesExercisesforSectionMethodsforHighDegreesofSimilarityFindingIdenticalItemsRepresentingSetsasStringsLengthBasedFilteringPrefixIndexingUsingPositionInformationUsingPositionandLengthinIndexesExercisesforSectionSummaryofChapterviiiCONTENTSReferencesforChapterMiningDataStreamsTheStreamDataModelADataStreamManagementSystemExamplesofStreamSourcesStreamQueriesIssuesinStreamProcessingSamplingDatainaStreamAMotivatingExampleObtainingaRepresentativeSampleTheGeneralSamplingProblemVaryingtheSampleSizeExercisesforSectionFilteringStreamsAMotivatingExampleTheBloomFilterAnalysisofBloomFilteringExercisesforSectionCountingDistinctElementsinaStreamTheCountDistinctProblemTheFlajoletMartinAlgorithmCombiningEstimatesSpaceRequirementsExercisesforSectionEstimatingMomentsDefinitionofMomentsTheAlonMatiasSzegedyAlgorithmforSecondMomentsWhytheAlonMatiasSzegedyAlgorithmWorksHigherOrderMomentsDealingWithInfiniteStreamsExercisesforSectionCountingOnesinaWindowTheCostofExactCountsTheDatarGionisIndykMotwaniAlgorithmStorageRequirementsfortheDGIMAlgorithmQueryAnsweringintheDGIMAlgorithmMaintainingtheDGIMConditionsReducingtheErrorExtensionstotheCountingofOnesExercisesforSectionDecayingWindowsTheProblemofMostCommonElementsDefinitionoftheDecayingWindowCONTENTSixFindingtheMostPopularElementsSummaryofChapterReferencesforChapterLinkAnalysisPageRankEarlySearchEnginesandTermSpamDefinitionofPageRankStructureoftheWebAvoidingDeadEndsSpiderTrapsandTaxationUsingPageRankinaSearchEngineExercisesforSectionEfficientComputationofPageRankRepresentingTransitionMatricesPageRankIterationUsingMapReduceUseofCombinerstoConsolidatetheResultVectorRepresentingBlocksoftheTransitionMatrixOtherEfficientApproachestoPageRankIterationExercisesforSectionTopicSensitivePageRankMotivationforTopicSensitivePageRankBiasedRandomWalksUsingTopicSensitivePageRankInferringTopicsfromWordsExercisesforSectionLinkSpamArchitectureofaSpamFarmAnalysisofaSpamFarmCombatingLinkSpamTrustRankSpamMassExercisesforSectionHubsandAuthoritiesTheIntuitionBehindHITSFormalizingHubbinessandAuthorityExercisesforSectionSummaryofChapterReferencesforChapterFrequentItemsetsTheMarketBasketModelDefinitionofFrequentItemsetsApplicationsofFrequentItemsetsAssociationRulesxCONTENTSFindingAssociationRuleswithHighConfidenceExercisesforSectionMarketBasketsandtheAPrioriAlgorithmRepresentationofMarketBasketDataUseofMainMemoryforItemsetCountingMonotonicityofItemsetsTyrannyofCountingPairsTheAPrioriAlgorithmAPrioriforAllFrequentItemsetsExercisesforSectionHandlingLargerDatasetsinMainMemoryTheAlgorithmofPark,Chen,andYuTheMultistageAlgorithmTheMultihashAlgorithmExercisesforSectionLimitedPassAlgorithmsTheSimple,RandomizedAlgorithmAvoidingErrorsinSamplingAlgorithmsTheAlgorithmofSavasere,Omiecinski,andNavatheTheSONAlgorithmandMapReduceToivonen’sAlgorithmWhyToivonen’sAlgorithmWorksExercisesforSectionCountingFrequentItemsinaStreamSamplingMethodsforStreamsFrequentItemsetsinDecayingWindowsHybridMethodsExercisesforSectionSummaryofChapter

用户评价(4)

  • 匿名用户 书不错,可惜是英文版,得慢慢啃

    2016-10-13 09:22:55

  • 10.44.7.248 盗名欺世

    2013-02-20 04:06:30

  • chang1880247 不错,谢谢

    2012-11-16 02:04:40

  • 10.44.7.248 不是徐子沛的那个

    2012-10-27 06:10:56

关闭

新课改视野下建构高中语文教学实验成果报告(32KB)

抱歉,积分不足下载失败,请稍后再试!

提示

试读已结束,如需要继续阅读或者下载,敬请购买!

评分:

/49

VIP

意见
反馈

免费
邮箱

爱问共享资料服务号

扫描关注领取更多福利