关闭

关闭

关闭

封号提示

内容

首页 A Practical Guide to Training Restricted Boltzm…

A Practical Guide to Training Restricted Boltzmann Machines2010.pdf

A Practical Guide to Training R…

上传者: tustcn 2013-05-01 评分 0 0 0 0 0 0 暂无简介 简介 举报

简介:本文档为《A Practical Guide to Training Restricted Boltzmann Machines2010pdf》,可适用于IT/计算机领域,主题内容包含DepartmentofComputerScienceKing’sCollegeRd,TorontoUniversityofTorontoMSG,C符等。

DepartmentofComputerScienceKing’sCollegeRd,TorontoUniversityofTorontoMSG,Canadahttp:learningcstorontoedufax:CopyrightcGeoffreyHintonAugust,UTMLTR–APracticalGuidetoTrainingRestrictedBoltzmannMachinesVersionGeoffreyHintonDepartmentofComputerScience,UniversityofTorontoAPracticalGuidetoTrainingRestrictedBoltzmannMachinesVersionGeoffreyHintonDepartmentofComputerScience,UniversityofTorontoContentsIntroductionAnoverviewofRestrictedBoltzmannMachinesandContrastiveDivergenceHowtocollectstatisticswhenusingContrastiveDivergenceUpdatingthehiddenstatesUpdatingthevisiblestatesCollectingthestatisticsneededforlearningArecipeforgettingthelearningsignalforCDThesizeofaminibatchArecipefordividingthetrainingsetintominibatchesMonitoringtheprogressoflearningArecipeforusingthereconstructionerrorMonitoringtheoverfittingArecipeformonitoringtheoverfittingThelearningrateArecipeforsettingthelearningratesforweightsandbiasesTheinitialvaluesoftheweightsandbiasesArecipeforsettingtheinitialvaluesoftheweightsandbiasesMomentumArecipeforusingmomentumWeightdecayArecipeforusingweightdecayEncouragingsparsehiddenactivitiesArecipeforsparsityThenumberofhiddenunitsArecipeforchoosingthenumberofhiddenunitsDifferenttypesofunitSoftmaxandmultinomialunitsGaussianvisibleunitsGaussianvisibleandhiddenunitsBinomialunitsRectifiedlinearunitsVarietiesofcontrastivedivergenceDisplayingwhatishappeningduringlearningUsingRBM’sfordiscriminationComputingthefreeenergyofavisiblevectorDealingwithmissingvaluesIfyoumakeuseofthistechnicalreporttotrainanRBM,pleaseciteitinanyresultingpublicationIntroductionRestrictedBoltzmannmachines(RBMs)havebeenusedasgenerativemodelsofmanydifferenttypesofdataincludinglabeledorunlabeledimages(Hintonetal,a),windowsofmelcepstralcoefficientsthatrepresentspeech(Mohamedetal,),bagsofwordsthatrepresentdocuments(SalakhutdinovandHinton,),anduserratingsofmovies(Salakhutdinovetal,)Intheirconditionalformtheycanbeusedtomodelhighdimensionaltemporalsequencessuchasvideoormotioncapturedata(Tayloretal,)orspeech(MohamedandHinton,)Theirmostimportantuseisaslearningmodulesthatarecomposedtoformdeepbeliefnets(Hintonetal,a)RBMsareusuallytrainedusingthecontrastivedivergencelearningprocedure(Hinton,)Thisrequiresacertainamountofpracticalexperiencetodecidehowtosetthevaluesofnumericalmetaparameterssuchasthelearningrate,themomentum,theweightcost,thesparsitytarget,theinitialvaluesoftheweights,thenumberofhiddenunitsandthesizeofeachminibatchTherearealsodecisionstobemadeaboutwhattypesofunitstouse,whethertoupdatetheirstatesstochasticallyordeterministically,howmanytimestoupdatethestatesofthehiddenunitsforeachtrainingcase,andwhethertostarteachsequenceofstateupdatesatadatavectorInaddition,itisusefultoknowhowtomonitortheprogressoflearningandwhentoterminatethetrainingForanyparticularapplication,thecodethatwasusedgivesacompletespecificationofallofthesedecisions,butitdoesnotexplainwhythedecisionsweremadeorhowminorchangeswillaffectperformanceMoresignificantly,itdoesnotprovideanoviceuserwithanyguidanceabouthowtomakegooddecisionsforanewapplicationThisrequiressomesensibleheuristicsandtheabilitytorelatefailuresofthelearningtothedecisionsthatcausedthosefailuresOverthelastfewyears,themachinelearninggroupattheUniversityofTorontohasacquiredconsiderableexpertiseattrainingRBMsandthisguideisanattempttosharethisexpertisewithothermachinelearningresearchersWearestillonafairlysteeppartofthelearningcurve,sotheguideisalivingdocumentthatwillbeupdatedfromtimetotimeandtheversionnumbershouldalwaysbeusedwhenreferringtoitAnoverviewofRestrictedBoltzmannMachinesandContrastiveDivergenceSkipthissectionifyoualreadyknowaboutRBMsConsideratrainingsetofbinaryvectorswhichwewillassumearebinaryimagesforthepurposesofexplanationThetrainingsetcanbemodeledusingatwolayernetworkcalleda“RestrictedBoltzmannMachine”(Smolensky,FreundandHaussler,Hinton,)inwhichstochastic,binarypixelsareconnectedtostochastic,binaryfeaturedetectorsusingsymmetricallyweightedconnectionsThepixelscorrespondto“visible”unitsoftheRBMbecausetheirstatesareobservedthefeaturedetectorscorrespondto“hidden”unitsAjointconfiguration,(v,h)ofthevisibleandhiddenunitshasanenergy(Hopfield,)givenby:E(v,h)=ivisibleaivijhiddenbjhji,jvihjwij()wherevi,hjarethebinarystatesofvisibleunitiandhiddenunitj,ai,bjaretheirbiasesandwijistheweightbetweenthemThenetworkassignsaprobabilitytoeverypossiblepairofavisibleandahiddenvectorviathisenergyfunction:p(v,h)=ZeE(v,h)()wherethe“partitionfunction”,Z,isgivenbysummingoverallpossiblepairsofvisibleandhiddenvectors:Z=v,heE(v,h)()Theprobabilitythatthenetworkassignstoavisiblevector,v,isgivenbysummingoverallpossiblehiddenvectors:p(v)=ZheE(v,h)()Theprobabilitythatthenetworkassignstoatrainingimagecanberaisedbyadjustingtheweightsandbiasestolowertheenergyofthatimageandtoraisetheenergyofotherimages,especiallythosethathavelowenergiesandthereforemakeabigcontributiontothepartitionfunctionThederivativeofthelogprobabilityofatrainingvectorwithrespecttoaweightissurprisinglysimplelogp(v)wij=〈vihj〉data〈vihj〉model()wheretheanglebracketsareusedtodenoteexpectationsunderthedistributionspecifiedbythesubscriptthatfollowsThisleadstoaverysimplelearningruleforperformingstochasticsteepestascentinthelogprobabilityofthetrainingdata:wij=(〈vihj〉data〈vihj〉model)()whereisalearningrateBecausetherearenodirectconnectionsbetweenhiddenunitsinanRBM,itisveryeasytogetanunbiasedsampleof〈vihj〉dataGivenarandomlyselectedtrainingimage,v,thebinarystate,hj,ofeachhiddenunit,j,issettowithprobabilityp(hj=|v)=σ(bjiviwij)()whereσ(x)isthelogisticsigmoidfunction(exp(x))vihjisthenanunbiasedsampleBecausetherearenodirectconnectionsbetweenvisibleunitsinanRBM,itisalsoveryeasytogetanunbiasedsampleofthestateofavisibleunit,givenahiddenvectorp(vi=|h)=σ(aijhjwij)()Gettinganunbiasedsampleof〈vihj〉model,however,ismuchmoredifficultItcanbedonebystartingatanyrandomstateofthevisibleunitsandperformingalternatingGibbssamplingforaverylongtimeOneiterationofalternatingGibbssamplingconsistsofupdatingallofthehiddenunitsinparallelusingequationfollowedbyupdatingallofthevisibleunitsinparallelusingequationAmuchfasterlearningprocedurewasproposedinHinton()ThisstartsbysettingthestatesofthevisibleunitstoatrainingvectorThenthebinarystatesofthehiddenunitsareallcomputedinparallelusingequationOncebinarystateshavebeenchosenforthehiddenunits,a“reconstruction”isproducedbysettingeachvitowithaprobabilitygivenbyequationThechangeinaweightisthengivenbywij=(〈vihj〉data〈vihj〉recon)()AsimplifiedversionofthesamelearningrulethatusesthestatesofindivisdualunitsinsteadofpairwiseproductsisusedforthebiasesThelearningworkswelleventhoughitisonlycrudelyapproximatingthegradientofthelogprobabilityofthetrainingdata(Hinton,)ThelearningruleismuchmorecloselyapproximatingthegradientofanotherobjectivefunctioncalledtheContrastiveDivergence(Hinton,)whichisthedifferencebetweentwoKullbackLieblerdivergences,butitignoresonetrickyterminthisobjectivefunctionsoitisnotevenfollowingthatgradientIndeed,SutskeverandTielemanhaveshownthatitisnotfollowingthegradientofanyfunction(SutskeverandTieleman,)Nevertheless,itworkswellenoughtoachievesuccessinmanysignificantapplicationsRBMstypicallylearnbettermodelsifmorestepsofalternatingGibbssamplingareusedbeforecollectingthestatisticsforthesecondterminthelearningrule,whichwillbecalledthenegativestatisticsCDnwillbeusedtodenotelearningusingnfullstepsofalternatingGibbssamplingHowtocollectstatisticswhenusingContrastiveDivergenceTobeginwith,weshallassumethatallofthevisibleandhiddenunitsarebinaryOthertypesofunitswillbediscussedinsectionsWeshallalsoassumethatthepurposeofthelearningistocreateagoodgenerativemodelofthesetoftrainingvectorsWhenusingRBMstolearnDeepBeliefNets(seethearticleonDeepBeliefNetworksatwwwscholarpediaorg)thatwillsubsequentlybefinetunedusingbackpropagation,thegenerativemodelisnottheultimateobjectiveanditmaybepossibletosavetimebyunderfittingit,butwewillignorethathereUpdatingthehiddenstatesAssumingthatthehiddenunitsarebinaryandthatyouareusingCD,thehiddenunitsshouldhavestochasticbinarystateswhentheyarebeingdrivenbyadatavectorTheprobabilityofturningonahiddenunit,j,iscomputedbyapplyingthelogisticfunctionσ(x)=(exp(x))toits“totalinput”:p(hj=)=σ(bjiviwij)()andthehiddenunitturnsonifthisprobabilityisgreaterthanarandomnumberuniformlydistributedbetweenandItisveryimportanttomakethesehiddenstatesbinary,ratherthanusingtheprobabilitiesthemselvesIftheprobabilitiesareused,eachhiddenunitcancommunicatearealvaluetothevisibleunitsduringthereconstructionThisseriouslyviolatestheinformationbottleneckcreatedbythefactthatahiddenunitcanconveyatmostonebit(onaverage)ThisinformationbottleneckactsasastrongregularizerForthelastupdateofthehiddenunits,itissillytousestochasticbinarystatesbecausenothingdependsonwhichstateischosenSousetheprobabilityitselftoavoidunnecessarysamplingnoiseWhenusingCDn,onlythefinalupdateofthehiddenunitsshouldusetheprobabilityUpdatingthevisiblestatesAssumingthatthevisibleunitsarebinary,thecorrectwaytoupdatethevisiblestateswhengeneratingareconstructionistostochasticallypickaorwithaprobabilitydeterminedbythetotaltopdowninput:pi=p(vi=)=σ(aijhjwij)()However,itiscommontousetheprobability,pi,insteadofsamplingabinaryvalueThisisnotnearlyasproblematicasusingprobabilitiesforthedatadrivenhiddenstatesanditreducessamplingnoisethusallowingfasterlearningThereissomeevidencethatitleadstoslightlyworsedensitymodels(TijmenTieleman,personalcommunication,)ThisprobablydoesnotmatterwhenusinganRBMtopretrainalayerofhiddenfeaturesforuseinadeepbeliefnetCollectingthestatisticsneededforlearningAssumingthatthevisibleunitsareusingrealvaluedprobabilitiesinsteadofstochasticbinaryvalues,therearetwosensiblewaystocollectthepositivestatisticsfortheconnectionbetweenvisibleunitiandhiddenunitj:〈pihj〉dataor〈pipj〉datawherepjisaprobabilityandhjisabinarystatethattakesvaluewithprobabilitypjUsinghjisclosertothemathematicalmodelofanRBM,butusingpjusuallyhaslesssamplingnoisewhichallowsslightlyfasterlearningArecipeforgettingthelearningsignalforCDWhenthehiddenunitsarebeingdrivenbydata,alwaysusestochasticbinarystatesWhentheyarebeingdrivenbyreconstructions,alwaysuseprobabilitieswithoutsamplingAssumingthevisibleunitsusethelogisticfunction,userealvaluedprobabilitiesforboththedataandthereconstructionsWhencollectingthepairwisestatisticsforlearningweightsortheindividualstatisticsforlearningbiases,usetheprobabilities,notthebinarystates,andmakesuretheweightshaverandominitialvaluestobreaksymmetryThesizeofaminibatchItispossibletoupdatetheweightsafterestimatingthegradientonasingletrainingcase,butitisoftenmoreefficienttodividethetrainingsetintosmall“minibatches”oftocasesThisallowsmatrixmatrixmultipliestobeusedwhichisveryadvantageousonGPUboardsorinMatlabUsinghjalwayscreatesmorenoiseinthepositivestatisticsthanusingpjbutitcanactuallycreatelessnoiseinthedifferenceofthepositiveandnegativestatisticsbecausethenegativestatisticsdependonthebinarydecisionforthestateofjthatisusedforcreatingthereconstructionTheprobabilityofjwhendrivenbythereconstructionishighlycorrelatedwiththebinarydecisionthatwasmadeforjwhenitwasdrivenbythedataSothereisnothingrandomaboutthegenerationofthereconstructionsgiventhebinarystatesofthehiddenunitsTheword“batch”isconfusingandwillbeavoidedbecausewhenitisusedtocontrastwith“online”itusuallymeanstheentiretrainingsetToavoidhavingtochangethelearningratewhenthesizeofaminibatchischanged,itishelpfultodividethetotalgradientcomputedonaminibatchbythesizeoftheminibatch,sowhentalkingaboutlearningrateswewillassumethattheymultiplytheaverage,percasegradientcomputedonaminibatch,notthetotalgradientfortheminibatchItisaseriousmistaketomaketheminibatchestoolargewhenusingstochasticgradientdescentIncreasingtheminibatchsizebyafactorofNleadstoamorereliablegradientestimatebutitdoesnotincreasethemaximumstablelearningratebyafactorofN,sotheneteffectisthattheweightupdatesaresmallerpergradientevaluationArecipefordividingthetrainingsetintominibatchesFordatasetsthatcontainasmallnumberofequiprobableclasses,theidealminibatchsizeisoftenequaltothenumberofclassesandeachminibatchshouldcontainoneexampleofeachclasstoreducethesamplingerrorwhenestimatingthegradientforthewholetrainingsetfromasingleminibatchForotherdatasets,firstrandomizetheorderofthetrainingexamplesthenuseminibatchesofsizeaboutMonitoringtheprogressoflearningItiseasytocomputethesquarederrorbetweenthedataandthereconstructions,sothisquantityisoftenprintedoutduringlearningThereconstructionerrorontheentiretrainingsetshouldfallrapidlyandconsistentlyatthestartoflearningandthenmoreslowlyDuetothenoiseinthegradientestimates,thereconstructionerrorontheindividualminibatcheswillfluctuategentlyaftertheinitialrapiddescentItmayalsooscillategentlywithaperiodofafewminibatcheswhenusinghighmomentum(seesection)Althoughitisconvenient,thereconstructionerrorisactuallyaverypoormeasureoftheprogressoflearningItisnotthefunctionthatCDnlearningisapproximatelyoptimizing,especiallyforn>>,anditsystematicallyconfoundstwodifferentquantitiesthatarechangingduringthelearningThefirstisthedifferencebetweentheempiricaldistributionofthetrainingdataandtheequilibriumdistributionoftheRBMThesecondisthemixingrateofthealternatingGibbsMarkovchainIfthemixingrateisverylow,thereconstructionerrorwillbeverysmallevenwhenthedistributionsofthedataandthemodelareverydifferentAstheweightsincreasethemixingratefalls,sodecreasesinreconstructionerrordonotnecessarilymeanthatthemodelisimprovingand,conversely,smallincreasesdonotnecessarilymeanthemodelisgettingworseLargeincreases,however,areabadsignexceptwhentheyaretemporaryandcausedbychangesinthelearningrate,momentum,weightcostorsparsitymetaparametersArecipeforusingthereconstructionerrorUseitbutdon’ttrustitIfyoureallywanttoknowwhatisgoingonduringthelearning,usemultiplehistogramsandgraphicdisplaysasdescribedinsectionAlsoconsiderusingAnnealedImportanceTheeasywaytoparallelizethelearningonaclusteristodivideeachminibatchintosubminibatchesandtousedifferentcorestocomputethegradientsoneachsubminibatchThegradientscomputedbydifferentcoresmustthenbecombinedTominimizetheratioofcommunicationtocomputation,itistemptingtomakethesubminibatcheslargeThisusuallymakesthelearningmuchlessefficient,thuswipingoutmuchofthegainachievedbyusingmultiplecores(VinodNair,personalcommunication,)Sampling(SalakhutdinovandMurray,)toestimatethedensityonheldoutdataIfyouarelearningajointdensitymodeloflabelleddata(seesection),considermonitoringthediscriminativeperformanceonthetrainingdataandonaheldoutvalidationsetMonitoringtheoverfittingWhenlearningagenerativemodel,theobviousquantitytomonitoristheprobabilitythatthecurrentmodelassignstoadatapointWhenthisprobabilitystartstodecreaseforheldoutvalidationdata,itistimetostoplearningUnfortunately,forlargeRBMs,itisverydifficulttocomputethisprobabilitybecauseitrequiresknowledgeofthepartitionfunctionNevertheless,itispossibletodirectlymonitortheoverfittingbycomparingthefreeenergiesoftrainingdataandheldoutvalidationdataInthiscomparison,thepartitionfunctioncancelsoutThefreeenergyofadatavectorcanbecomputedinatimethatislinearinthenumberofhiddenunits(seesection)Ifthemodelisnotoverfittingatall,theaveragefreeenergyshouldbeaboutthesameontrainingandvalidationdataAsthemodelstartstooverfittheaveragefreeenergyofthevalidationdatawillriserelativetotheaveragefreeenergyofthetrainingdataandthisgap

用户评论(0)

0/200

精彩专题

上传我的资料

每篇奖励 +2积分

资料评价:

/21
0下载券 下载 加入VIP, 送下载券

意见
反馈

立即扫码关注

爱问共享资料微信公众号

返回
顶部