关闭

关闭

关闭

封号提示

内容

首页 A Practical Guide to Training Restricted Boltzma…

A Practical Guide to Training Restricted Boltzmann Machines2010.pdf

A Practical Guide to Training R…

tustcn
2013-05-01 0人阅读 0 0 0 暂无简介 举报

简介:本文档为《A Practical Guide to Training Restricted Boltzmann Machines2010pdf》,可适用于IT/计算机领域

DepartmentofComputerScienceKing’sCollegeRd,TorontoUniversityofTorontoMSG,Canadahttp:learningcstorontoedufax:Copyrightc©GeoffreyHintonAugust,UTMLTR–APracticalGuidetoTrainingRestrictedBoltzmannMachinesVersionGeoffreyHintonDepartmentofComputerScience,UniversityofTorontoAPracticalGuidetoTrainingRestrictedBoltzmannMachinesVersionGeoffreyHintonDepartmentofComputerScience,UniversityofTorontoContentsIntroductionAnoverviewofRestrictedBoltzmannMachinesandContrastiveDivergenceHowtocollectstatisticswhenusingContrastiveDivergenceUpdatingthehiddenstatesUpdatingthevisiblestatesCollectingthestatisticsneededforlearningArecipeforgettingthelearningsignalforCDThesizeofaminibatchArecipefordividingthetrainingsetintominibatchesMonitoringtheprogressoflearningArecipeforusingthereconstructionerrorMonitoringtheoverfittingArecipeformonitoringtheoverfittingThelearningrateArecipeforsettingthelearningratesforweightsandbiasesTheinitialvaluesoftheweightsandbiasesArecipeforsettingtheinitialvaluesoftheweightsandbiasesMomentumArecipeforusingmomentumWeightdecayArecipeforusingweightdecayEncouragingsparsehiddenactivitiesArecipeforsparsityThenumberofhiddenunitsArecipeforchoosingthenumberofhiddenunitsDifferenttypesofunitSoftmaxandmultinomialunitsGaussianvisibleunitsGaussianvisibleandhiddenunitsBinomialunitsRectifiedlinearunitsVarietiesofcontrastivedivergenceDisplayingwhatishappeningduringlearningUsingRBM’sfordiscriminationComputingthefreeenergyofavisiblevectorDealingwithmissingvaluesIfyoumakeuseofthistechnicalreporttotrainanRBM,pleaseciteitinanyresultingpublicationIntroductionRestrictedBoltzmannmachines(RBMs)havebeenusedasgenerativemodelsofmanydifferenttypesofdataincludinglabeledorunlabeledimages(Hintonetal,a),windowsofmelcepstralcoefficientsthatrepresentspeech(Mohamedetal,),bagsofwordsthatrepresentdocuments(SalakhutdinovandHinton,),anduserratingsofmovies(Salakhutdinovetal,)Intheirconditionalformtheycanbeusedtomodelhighdimensionaltemporalsequencessuchasvideoormotioncapturedata(Tayloretal,)orspeech(MohamedandHinton,)Theirmostimportantuseisaslearningmodulesthatarecomposedtoformdeepbeliefnets(Hintonetal,a)RBMsareusuallytrainedusingthecontrastivedivergencelearningprocedure(Hinton,)Thisrequiresacertainamountofpracticalexperiencetodecidehowtosetthevaluesofnumericalmetaparameterssuchasthelearningrate,themomentum,theweightcost,thesparsitytarget,theinitialvaluesoftheweights,thenumberofhiddenunitsandthesizeofeachminibatchTherearealsodecisionstobemadeaboutwhattypesofunitstouse,whethertoupdatetheirstatesstochasticallyordeterministically,howmanytimestoupdatethestatesofthehiddenunitsforeachtrainingcase,andwhethertostarteachsequenceofstateupdatesatadatavectorInaddition,itisusefultoknowhowtomonitortheprogressoflearningandwhentoterminatethetrainingForanyparticularapplication,thecodethatwasusedgivesacompletespecificationofallofthesedecisions,butitdoesnotexplainwhythedecisionsweremadeorhowminorchangeswillaffectperformanceMoresignificantly,itdoesnotprovideanoviceuserwithanyguidanceabouthowtomakegooddecisionsforanewapplicationThisrequiressomesensibleheuristicsandtheabilitytorelatefailuresofthelearningtothedecisionsthatcausedthosefailuresOverthelastfewyears,themachinelearninggroupattheUniversityofTorontohasacquiredconsiderableexpertiseattrainingRBMsandthisguideisanattempttosharethisexpertisewithothermachinelearningresearchersWearestillonafairlysteeppartofthelearningcurve,sotheguideisalivingdocumentthatwillbeupdatedfromtimetotimeandtheversionnumbershouldalwaysbeusedwhenreferringtoitAnoverviewofRestrictedBoltzmannMachinesandContrastiveDivergenceSkipthissectionifyoualreadyknowaboutRBMsConsideratrainingsetofbinaryvectorswhichwewillassumearebinaryimagesforthepurposesofexplanationThetrainingsetcanbemodeledusingatwolayernetworkcalleda“RestrictedBoltzmannMachine”(Smolensky,FreundandHaussler,Hinton,)inwhichstochastic,binarypixelsareconnectedtostochastic,binaryfeaturedetectorsusingsymmetricallyweightedconnectionsThepixelscorrespondto“visible”unitsoftheRBMbecausetheirstatesareobservedthefeaturedetectorscorrespondto“hidden”unitsAjointconfiguration,(v,h)ofthevisibleandhiddenunitshasanenergy(Hopfield,)givenby:E(v,h)=−∑i∈visibleaivi−∑j∈hiddenbjhj−∑i,jvihjwij()wherevi,hjarethebinarystatesofvisibleunitiandhiddenunitj,ai,bjaretheirbiasesandwijistheweightbetweenthemThenetworkassignsaprobabilitytoeverypossiblepairofavisibleandahiddenvectorviathisenergyfunction:p(v,h)=Ze−E(v,h)()wherethe“partitionfunction”,Z,isgivenbysummingoverallpossiblepairsofvisibleandhiddenvectors:Z=∑v,he−E(v,h)()Theprobabilitythatthenetworkassignstoavisiblevector,v,isgivenbysummingoverallpossiblehiddenvectors:p(v)=Z∑he−E(v,h)()Theprobabilitythatthenetworkassignstoatrainingimagecanberaisedbyadjustingtheweightsandbiasestolowertheenergyofthatimageandtoraisetheenergyofotherimages,especiallythosethathavelowenergiesandthereforemakeabigcontributiontothepartitionfunctionThederivativeofthelogprobabilityofatrainingvectorwithrespecttoaweightissurprisinglysimple∂logp(v)∂wij=〈vihj〉data−〈vihj〉model()wheretheanglebracketsareusedtodenoteexpectationsunderthedistributionspecifiedbythesubscriptthatfollowsThisleadstoaverysimplelearningruleforperformingstochasticsteepestascentinthelogprobabilityofthetrainingdata:∆wij=�(〈vihj〉data−〈vihj〉model)()where�isalearningrateBecausetherearenodirectconnectionsbetweenhiddenunitsinanRBM,itisveryeasytogetanunbiasedsampleof〈vihj〉dataGivenarandomlyselectedtrainingimage,v,thebinarystate,hj,ofeachhiddenunit,j,issettowithprobabilityp(hj=|v)=σ(bj∑iviwij)()whereσ(x)isthelogisticsigmoidfunction(exp(−x))vihjisthenanunbiasedsampleBecausetherearenodirectconnectionsbetweenvisibleunitsinanRBM,itisalsoveryeasytogetanunbiasedsampleofthestateofavisibleunit,givenahiddenvectorp(vi=|h)=σ(ai∑jhjwij)()Gettinganunbiasedsampleof〈vihj〉model,however,ismuchmoredifficultItcanbedonebystartingatanyrandomstateofthevisibleunitsandperformingalternatingGibbssamplingforaverylongtimeOneiterationofalternatingGibbssamplingconsistsofupdatingallofthehiddenunitsinparallelusingequationfollowedbyupdatingallofthevisibleunitsinparallelusingequationAmuchfasterlearningprocedurewasproposedinHinton()ThisstartsbysettingthestatesofthevisibleunitstoatrainingvectorThenthebinarystatesofthehiddenunitsareallcomputedinparallelusingequationOncebinarystateshavebeenchosenforthehiddenunits,a“reconstruction”isproducedbysettingeachvitowithaprobabilitygivenbyequationThechangeinaweightisthengivenby∆wij=�(〈vihj〉data−〈vihj〉recon)()AsimplifiedversionofthesamelearningrulethatusesthestatesofindivisdualunitsinsteadofpairwiseproductsisusedforthebiasesThelearningworkswelleventhoughitisonlycrudelyapproximatingthegradientofthelogprobabilityofthetrainingdata(Hinton,)ThelearningruleismuchmorecloselyapproximatingthegradientofanotherobjectivefunctioncalledtheContrastiveDivergence(Hinton,)whichisthedifferencebetweentwoKullbackLieblerdivergences,butitignoresonetrickyterminthisobjectivefunctionsoitisnotevenfollowingthatgradientIndeed,SutskeverandTielemanhaveshownthatitisnotfollowingthegradientofanyfunction(SutskeverandTieleman,)Nevertheless,itworkswellenoughtoachievesuccessinmanysignificantapplicationsRBMstypicallylearnbettermodelsifmorestepsofalternatingGibbssamplingareusedbeforecollectingthestatisticsforthesecondterminthelearningrule,whichwillbecalledthenegativestatisticsCDnwillbeusedtodenotelearningusingnfullstepsofalternatingGibbssamplingHowtocollectstatisticswhenusingContrastiveDivergenceTobeginwith,weshallassumethatallofthevisibleandhiddenunitsarebinaryOthertypesofunitswillbediscussedinsectionsWeshallalsoassumethatthepurposeofthelearningistocreateagoodgenerativemodelofthesetoftrainingvectorsWhenusingRBMstolearnDeepBeliefNets(seethearticleonDeepBeliefNetworksatwwwscholarpediaorg)thatwillsubsequentlybefinetunedusingbackpropagation,thegenerativemodelisnottheultimateobjectiveanditmaybepossibletosavetimebyunderfittingit,butwewillignorethathereUpdatingthehiddenstatesAssumingthatthehiddenunitsarebinaryandthatyouareusingCD,thehiddenunitsshouldhavestochasticbinarystateswhentheyarebeingdrivenbyadatavectorTheprobabilityofturningonahiddenunit,j,iscomputedbyapplyingthelogisticfunctionσ(x)=(exp(−x))toits“totalinput”:p(hj=)=σ(bj∑iviwij)()andthehiddenunitturnsonifthisprobabilityisgreaterthanarandomnumberuniformlydistributedbetweenandItisveryimportanttomakethesehiddenstatesbinary,ratherthanusingtheprobabilitiesthemselvesIftheprobabilitiesareused,eachhiddenunitcancommunicatearealvaluetothevisibleunitsduringthereconstructionThisseriouslyviolatestheinformationbottleneckcreatedbythefactthatahiddenunitcanconveyatmostonebit(onaverage)ThisinformationbottleneckactsasastrongregularizerForthelastupdateofthehiddenunits,itissillytousestochasticbinarystatesbecausenothingdependsonwhichstateischosenSousetheprobabilityitselftoavoidunnecessarysamplingnoiseWhenusingCDn,onlythefinalupdateofthehiddenunitsshouldusetheprobabilityUpdatingthevisiblestatesAssumingthatthevisibleunitsarebinary,thecorrectwaytoupdatethevisiblestateswhengeneratingareconstructionistostochasticallypickaorwithaprobabilitydeterminedbythetotaltopdowninput:pi=p(vi=)=σ(ai∑jhjwij)()However,itiscommontousetheprobability,pi,insteadofsamplingabinaryvalueThisisnotnearlyasproblematicasusingprobabilitiesforthedatadrivenhiddenstatesanditreducessamplingnoisethusallowingfasterlearningThereissomeevidencethatitleadstoslightlyworsedensitymodels(TijmenTieleman,personalcommunication,)ThisprobablydoesnotmatterwhenusinganRBMtopretrainalayerofhiddenfeaturesforuseinadeepbeliefnetCollectingthestatisticsneededforlearningAssumingthatthevisibleunitsareusingrealvaluedprobabilitiesinsteadofstochasticbinaryvalues,therearetwosensiblewaystocollectthepositivestatisticsfortheconnectionbetweenvisibleunitiandhiddenunitj:〈pihj〉dataor〈pipj〉datawherepjisaprobabilityandhjisabinarystatethattakesvaluewithprobabilitypjUsinghjisclosertothemathematicalmodelofanRBM,butusingpjusuallyhaslesssamplingnoisewhichallowsslightlyfasterlearningArecipeforgettingthelearningsignalforCDWhenthehiddenunitsarebeingdrivenbydata,alwaysusestochasticbinarystatesWhentheyarebeingdrivenbyreconstructions,alwaysuseprobabilitieswithoutsamplingAssumingthevisibleunitsusethelogisticfunction,userealvaluedprobabilitiesforboththedataandthereconstructionsWhencollectingthepairwisestatisticsforlearningweightsortheindividualstatisticsforlearningbiases,usetheprobabilities,notthebinarystates,andmakesuretheweightshaverandominitialvaluestobreaksymmetryThesizeofaminibatchItispossibletoupdatetheweightsafterestimatingthegradientonasingletrainingcase,butitisoftenmoreefficienttodividethetrainingsetintosmall“minibatches”oftocasesThisallowsmatrixmatrixmultipliestobeusedwhichisveryadvantageousonGPUboardsorinMatlabUsinghjalwayscreatesmorenoiseinthepositivestatisticsthanusingpjbutitcanactuallycreatelessnoiseinthedifferenceofthepositiveandnegativestatisticsbecausethenegativestatisticsdependonthebinarydecisionforthestateofjthatisusedforcreatingthereconstructionTheprobabilityofjwhendrivenbythereconstructionishighlycorrelatedwiththebinarydecisionthatwasmadeforjwhenitwasdrivenbythedataSothereisnothingrandomaboutthegenerationofthereconstructionsgiventhebinarystatesofthehiddenunitsTheword“batch”isconfusingandwillbeavoidedbecausewhenitisusedtocontrastwith“online”itusuallymeanstheentiretrainingsetToavoidhavingtochangethelearningratewhenthesizeofaminibatchischanged,itishelpfultodividethetotalgradientcomputedonaminibatchbythesizeoftheminibatch,sowhentalkingaboutlearningrateswewillassumethattheymultiplytheaverage,percasegradientcomputedonaminibatch,notthetotalgradientfortheminibatchItisaseriousmistaketomaketheminibatchestoolargewhenusingstochasticgradientdescentIncreasingtheminibatchsizebyafactorofNleadstoamorereliablegradientestimatebutitdoesnotincreasethemaximumstablelearningratebyafactorofN,sotheneteffectisthattheweightupdatesaresmallerpergradientevaluationArecipefordividingthetrainingsetintominibatchesFordatasetsthatcontainasmallnumberofequiprobableclasses,theidealminibatchsizeisoftenequaltothenumberofclassesandeachminibatchshouldcontainoneexampleofeachclasstoreducethesamplingerrorwhenestimatingthegradientforthewholetrainingsetfromasingleminibatchForotherdatasets,firstrandomizetheorderofthetrainingexamplesthenuseminibatchesofsizeaboutMonitoringtheprogressoflearningItiseasytocomputethesquarederrorbetweenthedataandthereconstructions,sothisquantityisoftenprintedoutduringlearningThereconstructionerrorontheentiretrainingsetshouldfallrapidlyandconsistentlyatthestartoflearningandthenmoreslowlyDuetothenoiseinthegradientestimates,thereconstructionerrorontheindividualminibatcheswillfluctuategentlyaftertheinitialrapiddescentItmayalsooscillategentlywithaperiodofafewminibatcheswhenusinghighmomentum(seesection)Althoughitisconvenient,thereconstructionerrorisactuallyaverypoormeasureoftheprogressoflearningItisnotthefunctionthatCDnlearningisapproximatelyoptimizing,especiallyforn>>,anditsystematicallyconfoundstwodifferentquantitiesthatarechangingduringthelearningThefirstisthedifferencebetweentheempiricaldistributionofthetrainingdataandtheequilibriumdistributionoftheRBMThesecondisthemixingrateofthealternatingGibbsMarkovchainIfthemixingrateisverylow,thereconstructionerrorwillbeverysmallevenwhenthedistributionsofthedataandthemodelareverydifferentAstheweightsincreasethemixingratefalls,sodecreasesinreconstructionerrordonotnecessarilymeanthatthemodelisimprovingand,conversely,smallincreasesdonotnecessarilymeanthemodelisgettingworseLargeincreases,however,areabadsignexceptwhentheyaretemporaryandcausedbychangesinthelearningrate,momentum,weightcostorsparsitymetaparametersArecipeforusingthereconstructionerrorUseitbutdon’ttrustitIfyoureallywanttoknowwhatisgoingonduringthelearning,usemultiplehistogramsandgraphicdisplaysasdescribedinsectionAlsoconsiderusingAnnealedImportanceTheeasywaytoparallelizethelearningonaclusteristodivideeachminibatchintosubminibatchesandtousedifferentcorestocomputethegradientsoneachsubminibatchThegradientscomputedbydifferentcoresmustthenbecombinedTominimizetheratioofcommunicationtocomputation,itistemptingtomakethesubminibatcheslargeThisusuallymakesthelearningmuchlessefficient,thuswipingoutmuchofthegainachievedbyusingmultiplecores(VinodNair,personalcommunication,)Sampling(SalakhutdinovandMurray,)toestimatethedensityonheldoutdataIfyouarelearningajointdensitymodeloflabelleddata(seesection),considermonitoringthediscriminativeperformanceonthetrainingdataandonaheldoutvalidationsetMonitoringtheoverfittingWhenlearningagenerativemodel,theobviousquantitytomonitoristheprobabilitythatthecurrentmodelassignstoadatapointWhenthisprobabilitystartstodecreaseforheldoutvalidationdata,itistimetostoplearningUnfortunately,forlargeRBMs,itisverydifficulttocomputethisprobabilitybecauseitrequiresknowledgeofthepartitionfunctionNevertheless,itispossibletodirectlymonitortheoverfittingbycomparingthefreeenergiesoftrainingdataandheldoutvalidationdataInthiscomparison,thepartitionfunctioncancelsoutThefreeenergyofadatavectorcanbecomputedinatimethatislinearinthenumberofhiddenunits(seesection)Ifthemodelisnotoverfittingatall,theaveragefreeenergyshouldbeaboutthesameontrainingandvalidationdataAsthemodelstartstooverfittheaveragefreeenergyofthevalidationdatawillriserelativetotheaveragefreeenergyofthetrainingdataandthisgap

用户评价(0)

关闭

新课改视野下建构高中语文教学实验成果报告(32KB)

抱歉,积分不足下载失败,请稍后再试!

提示

试读已结束,如需要继续阅读或者下载,敬请购买!

评分:

/21

意见
反馈

立即扫码关注

爱问共享资料微信公众号

返回
顶部

举报
资料