首页 Apache Mahout-Canopy Clustering

Apache Mahout-Canopy Clustering

举报
开通vip

Apache Mahout-Canopy ClusteringCanopyClusteringCanopyClusteringisaverysimple,fastandsurprisinglyaccuratemethodforgroupingobjectsintoclusters.Allobjectsarerepresentedasapointinamultidimensionalfeaturespace.ThealgorithmusesafastapproximatedistancemetricandtwodistancethresholdsT1>T2forprocessi...

Apache Mahout-Canopy Clustering
CanopyClusteringCanopyClusteringisaverysimple,fastandsurprisinglyaccuratemethodforgroupingobjectsintoclusters.Allobjectsarerepresentedasapointinamultidimensionalfeaturespace.ThealgorithmusesafastapproximatedistancemetricandtwodistancethresholdsT1>T2forprocessing.Thebasicalgorithmistobeginwithasetofpointsandremoveoneatrandom.CreateaCanopycontainingthispointanditeratethroughtheremainderofthepointset.Ateachpoint,ifitsdistancefromthefirstpointis\-o\-dm\-t1\-t2\-t3\-t4\-cf\-ow-cl-xmInvocationusingJavainvolvessupplyingthefollowingarguments:1.input:afilepathstringtoadirectorycontainingtheinputdatasetaSequenceFile(WritableComparable,VectorWritable).Thesequencefilekeyisnotused.2.output:afilepathstringtoanemptydirectorywhichisusedforalloutputfromthealgorithm.3.measure:thefully-qualifiedclassnameofaninstanceofDistanceMeasurewhichwillbeusedfortheclustering.4.t1:theT1distancethresholdusedforclustering.5.t2:theT2distancethresholdusedforclustering.6.t3:theoptionalT1distancethresholdusedbythereducerforclustering.Ifnotspecified,T1isusedbythereducer.7.t4:theoptionalT2distancethresholdusedbythereducerforclustering.Ifnotspecified,T2isusedbythereducer.8.clusterFilter:theminimumsizeforcanopiestobeoutputbythealgorithm.Affectsbothsequentialandmapreduceexecutionmodes,andmapperandreduceroutputs.9.runClustering:abooleanindicating,iftrue,thattheclusteringstepistobeexecutedafterclustershavebeendetermined.10.runSequential:abooleanindicating,iftrue,thatthecomputationistoberuninmemoryusingthereferenceCanopyimplementation.Note:thatthesequentialimplementationperformsasinglepassthroughtheinputvectorswhereastheMapReduceimplementationperformstwopasses(onceinthemapperandagaininthereducer).TheMapReduceimplementationwilltypicallyproducelessclustersthanthesequentialimplementationasaresult.Afterrunningthealgorithm,theoutputdirectorywillcontain:1.clusters-0:adirectorycontainingSequenceFiles(Text,Canopy)producedbythealgorithm.TheTextkeycontainstheclusteridentifieroftheCanopy.2.clusteredPoints:(ifrunClusteringenabled)adirectorycontainingSequenceFile(IntWritable,WeightedVectorWritable).TheIntWritablekeyisthecanopyId.TheWeightedVectorWritablevalueisabeancontainingadoubleweightandaVectorWritablevectorwheretheweightindicatestheprobabilitythatthevectorisamemberofthecanopy.Forcanopyclustering,theweightsarecomputedas1/(1distance)wherethedistanceisbetweentheclustercenterandthevectorusingthechosenDistanceMeasure.ExamplesThefollowingimagesillustrateCanopyclusteringappliedtoasetofrandomly-generated2-ddatapoints.Thepointsaregeneratedusinganormaldistributioncenteredatameanlocationandwithaconstantstandarddeviation.SeetheREADMEfileinthe/examples/src/main/java/org/apache/mahout/clustering/display/README.txtfordetailsonrunningsimilarexamples.Thepointsaregeneratedasfollows:∙500samplesm=[1.0,1.0]sd=3.0∙300samplesm=[1.0,0.0]sd=0.5∙300samplesm=[0.0,2.0]sd=0.1Inthefirstimage,thepointsareplottedandthe3-sigmaboundariesoftheirgeneratoraresuperimposed.Inthesecondimage,theresultingcanopiesareshownsuperimposeduponthesampledata.Eachcanopyisrepresentedbytwocircles,withradiusT1andradiusT2.ThethirdimageusesthesamevaluesofT1andT2butonlysuperimposescanopiescoveringmorethan10%ofthepopulation.Thisisabitbetterrepresentationofthedatabutitstillhaslotsofroomforimprovement.TheadvantageofCanopyclusteringisthatitissingle-passandfastenoughtoiteraterunsusingdifferentT1,T2parametersanddisplaythresholds.
本文档为【Apache Mahout-Canopy Clustering】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_654168
暂无简介~
格式:doc
大小:28KB
软件:Word
页数:12
分类:
上传时间:2022-08-05
浏览量:0