关闭

关闭

封号提示

内容

首页 A Low-Power Vector Processor Using Logarithmic …

A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systems.pdf

A Low-Power Vector Processor Us…

上传者: rapanda 2013-12-10 评分 5 0 136 18 616 暂无简介 简介 举报

简介:本文档为《A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systemspdf》,可适用于IT/计算机领域,主题内容包含ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSyst符等。

ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSystemsByeongGyuNamandHoiJunYooDeptofEECS,KoreaAdvancedInstituteofScienceandTechnology(KAIST),Guseongdong,Yuseonggu,Daejeon,,RepublicofKoreaAbstractAlowpower,highperformancewaybitvectorprocessorisdevelopedforhandheldDgraphicssystemsItcontainsafloatingpointunifiedmatrix,vector,andelementaryfunctionunitByutilizingthelogarithmicarithmetic,theunitachievessinglecyclethroughputforalltheseoperationsexceptforthematrixvectormultiplicationwithcyclethroughputTheprocessorfeaturedbythisfunctionunit,cascadedintegerfloatdatapath,reconfigurationofdatapath,operandforwardinginlogarithmicdomain,andvertexcachetakesmminµmCMOStechnologyandachievesMverticessforgeometrytransformationandMverticessforOpenGLtransformationandlightingatMHzwithmWpowerconsumptionIINTRODUCTIONThehandheldgraphicsprocessingunits(GPUs)incorporatevectorprocessors,knownasshaders,intheirDgraphicspipelinestagestoprovidemorerealisticimagesIn,avertexshaderisproposedwithwayfloatingpoint(FLP)multipliersforthefastgeometrytransformationanditconsumedalargesiliconareaandpowerForthepowerandareaefficientdesignoftheshaders,amultifunctionunitwasproposedinHowever,itwasafixedpointunitanddidn’tdealwiththematrixvectormultiplication,frequentlyusedforDgeometrytransformationsInthispaper,awaybitFLPvectorprocessorisproposedfortheshadersItadoptsaunifiedmatrix,vector,andelementaryfunctionunit,whichunifiesalltheseoperationsinasinglewayarithmeticunitTheunitoperatesontheFLPdatasincethenewlydefinedgraphicsAPIrequiresmorethanbitFLPprecisionAlthoughitoperatesontheFLPdata,itusesthelogarithmicarithmeticfortheinternalarithmetictoreducethearithmeticcomplexityItsinstructionsetincludesmatrixvectormultiplication(MAT),vectoroperations(VEC)likevectormultiplication(MUL),division(DIV),dividebysquareroot(DSQ),multiplyadd(MAD)anddotproduct(DOT),andelementaryfunctions(ELM)includingtrigonometricfunctions(TRGs),power(POW),andlogarithm(LOG)ItachievessinglecyclethroughputwithmaximumcyclelatencyforalltheseoperationsexceptfortheMATwithcyclethroughputandcyclelatencyTheprocessorhasacascadedstructureofintegerandFLPdatapathsforefficientindexingofFLPoperandsTheconfigurationoftheFLPmultifunctionunitissupportedtoextendtheinstructionsetaccordingtouser’srequirementsItspipelineexploitstheoperandforwardinginthelogarithmicdomaintoimprovepipelinethroughputandthecomputationaccuracyAvertexcacheisadoptedtoreusethepreviouslyprocessedresultsandimprovesthroughputoftheprocessorIIARITHMETICUNITANumberSystemTheproposedarithmeticunitisbasedonthehybridapproachoftheFLPandthelogarithmicnumbersystem(LNS)introducedin,whereoperationsarereducedintosimpleronesintheLNSwhiletheadditionandsubtractionareperformedinFLPsincetheLNSadditionandsubtractionrequirenonlineartermevaluationsThelogarithmicandantilogarithmicconvertersbetweentheFLPandtheLNSareproposedforthishybridnumbersystem(HNS)entryLUT(B)xm>>cim>>dim>>eiCSACPACSACSAcidieibimeaimbientryLUT(B)f>>cif>>diCPACSACSAcidibiXefaifbi(a)Logarithmicconverter(b)AntilogarithmicconverterFigProposednumberconverters)LogarithmicConverterForthebitFLPinput()exm=,itslogarithmicnumberisrepresentedasloglog()xem=Theexponenteistheintegerpartofthelogarithmicnumberandlog()misthefractionalpartThelog()misapproximatedbypiecewiselinearexpressionsaslog()mamb,whereaandbaretheapproximationcoefficientsdefinedforeachapproximationregionThelogarithmicconverterdividesthe(m)intofinersubdivisionsaroundtheregionneartosincetheerrorincreasesastheinputvaluegetsclosertoItachievesmaximumconversionerrorwithapproximationregionsFig(a)showstheproposedlogarithmicconverter)AntilogarithmicConverterForthelogarithmicnumberX,itsFLPnumbercanberepresentedbyXefef==TheintegerpartedirectlybecomestheexponentoftheFLPnumberandthenonlineartermfisapproximatedbypiecewiselinearexpressionslikefafb,whereaandbaretheapproximationcoefficientsdefinedforeachapproximationregionTheantilogarithmicconverterevenlydividesthefintotheapproximationregionssincetheantilogarithmicconversionerrorisspreadovertheentireinputregionevenlyItachieves$IEEEmaximumconversionerrorwithapproximationregionsFig(b)showstheantilogarithmicconverterBOrganizationThearithmeticunitisorganizedwithchannelsandpipelinestagesasshowninFigTheofinputbitFLPoperandsareconvertedintothelogarithmicnumberwithbitinteger,bitfraction,bitsign,andbitzerothroughthelogarithmicconverters(LOGCs)intheETheEincludestheprogrammablemultiplier(PMUL)asshowninFig,whichcanbeusedfortheBoothmultiplierforELM,LOGCsforVECorantilogarithmicconverters(ALOGCs)forMATbyjustaddingentryBLOGandentryBALOGlookuptablestotheBoothmultiplierandsharingtheCSAtreeandaCPAInthisway,thenumberofLOGCsinEisreducedto,whichwereinInE,addersinlogarithmicdomainareprovidedfortheVECsandtheresultingvaluesareconvertedintotheFLPnumbersthroughtheALOGCsTheprogrammableadder(PADD)intheEcanbeprogrammedintoainputFLPaddertreeorwayinputSIMDFLPaddersfortargetoperationsasproposedinTheEprovidesaSIMDFLPaccumulatorforthefinalaccumulationrequiredfortheMATItisalsousedasaroundinglogicforotheroperationsChannelChannelChannelChannelLOGCALOGCxLOGCALOGCLOGCALOGCCPACPACPA>>>>>>LOGCALOGCyzCPA>>CPAyzCPAyzCPAyzCPAxxxMATndxxMATndfpAccfpAccfpAccfpAccTRGTRGTRGTRGMATPOW>>MATMATMATMATMATMATProgrammableMultiplierCPACPAProgrammableAdderFigProposedarithmeticunit)MatrixVectorMultiplicationThegeometrytransformationinDgraphicscanbecomputedbythemultiplicationofmatrixwithelementvector,whichrequiresmultiplicationsandadditionsThiscanbeconvertedintotheHNSasinFigrequiringLOGCs,adders,ALOGCs,andFLPaddersSincethecoefficientsofageometrytransformationmatrixarefixedduringprocessingofDobject,thesecanbepreconvertedintothelogarithmicdomainandusedasconstantsduringtheprocessingThus,theMATonlyrequiresLOGCsforvectorelementconversion,addersinlogarithmicdomain,ALOGCsandFLPaddersThiscanbeimplementedinphasesonthiswayarithmeticunitasillustratedinFigInthisscheme,addersinlogarithmicdomainandALOGCsarerequiredperphaseandtheALOGCsinthefirstphaseareobtainedfromALOGCsinEbyprogrammingthePMULintoALOGCstogetherwiththeALOGCsinETheCPAsinEandEareusedfortheaddersinlogarithmicdomainThemultiplicationresultsfromtheALOGCsinEandtheotherfromtheEareaddedintheEbyprogrammingthePADDintowaySIMDFLPaddertogetthefirstphaseresultWiththesameprocessrepeated,theaccumulationwiththefirstphaseresultinEcompletestheMATThus,theMATisimplementedwithcyclethroughputonthiswayarithmeticunit,whereitwasimplementedwithcyclethroughputinconventionalwayFigProgrammablemultiplier(PMUL)ccccccxccccccccxccxxxxccccxccccxcccccccc==logloglogloglogloglogloglogloglogloglogloglogloglogloglogccccccccxxxcccccccclogxFigTwophaseimplementationofMAT)VectorandElementaryFunctionsTheVECsandELMsincludingtheMUL,DIV,DSQ,MAD,DOT,POW,LOG,andTRGsareimplementedbasedontheschemeinSincethevectoroperationsrequireLOGCsforoperandsperchannel,thePMULisprogrammedintoLOGCstomaketheLOGCsforchannelstogetherwiththeLOGCsinESincethepowerisconvertedintothemultiplicationinlogarithmicdomain,itrequiresabbmultiplieranditisimplementedbyprogrammingthePMULintoasinglebbBMULTheTRGsareunifiedwithothersusingtheTaylorseriesasproposedinThispowerseriesrequiresawaybbmultiplierinlogarithmicdomainandfinalsummationofthesetermsThiscanbeimplementedbyprogrammingthePMULintowaybbBMULandthePADDintoainputsummationtreeIIIPROCESSORARCHITECTUREAInstructionSetArchitectureTheprocessorhastwotypesofinstructionformatwithbitsandbitsThebitinstructionsareusedforcontrolandinteger(INT)operations,whilethebitinstructionsareusedfortheFLPoperationsThisseparationofformatratherthanasinglebitformatresultsinreducedinstructionmemory(IMEM)sizeandpowerdissipationInaloopexecution,theregisterindicesforFLPoperandsmaybecalculateddynamicallyasafunctionoftheloopcounteriasfollowsOPRDstikd,SrcAika,SrcBikb,SrcCikcThus,theINTinstructionscalculatingtheindicescanbeembeddedintheFLPinstructionastheFLPoperandfieldsforefficientcalculationoftheindicesInthiscase,INTcomputationisfollowedbytheFLPcomputationbyasinglebitinstructionshowninFigFigbitFLPinstructionformatBMicroarchitectureFigshowsthemicroarchitecturefortheproposedprocessorIthaswaybitFLPvectorregisterfilesincludingbyteentryvertexinputregister(VIR),generalpurposeregisterfile(GPR),KBentryconstantmemory(CMEM),andbyteentryvertexoutputregister(VOR)TheFLPoperandsarefetchedfromtheGPR,VIRorCMEMandtheresultiswrittenbacktotheGPRorVORindexingthetargetentryinthedestinationregisterfileTheFLPoperandscanbeswizzled,negated,andconvertedintotheabsolutevaluebythesourcemodifiersThearithmeticunitproposedinsectionisusedastheFLPmultifunctionunitinthisprocessorThevertexcacheiscomposedofVORs)CascadedINTFLPdatapathsThisprocessorhasacascadedarchitectureofwaybitINTandwaybitFLPdatapathstoimplementtheembeddedindexcalculationoftheFLPoperandswithoutusingadditionalcyclesForflexibleindexcalculation,thisprocessorincludeswaybitSIMDintegerALUandbyteentryintegerregisterfile(IGPR)TheofbitresultsfromthisunitindexsourceoperandsanddestinationofaFLPoperationWhenamultiplicationisrequiredfortheindexcalculation,thePMULinFLParithmeticunitisprogrammedintoabbintegermultiplyadd(IMAD)unitsincetheconversionerrorisnotallowedfortheindexcalculation)DatapathReconfigurationTheFLPunitinsectionIIhasseveralMUXestounifyvariousoperationsandrevealsallthecontrolpointstotheprogrammersothattheycanmakearbitraryoperationsonitbyprogrammingthebitcontrolsignalsForexample,variousTRGssupportedinthisprocessorcanbeprogrammedbyasingleconfigurationinstruction(CFG)withsomeconfigurationdataratherthanincludingalloftheTRGsinthelimitedinstructionspaceTheprogrammedconfigurationdataarestoredinthebyteentryconfigurationregisterfile(CFR)andaccessedbytheCFGThus,theconfigurationcanbechangedineverycycleIMEM(KB)GPR(B)IGPR(B)SWZNEGABSSWZNEGABSSWZNEGABSFLPmultifunctionunitALUbbbbbbbbbbbbbFetchDecodebbVertexFetchbVertexCMEM(KB)VIR(B)DMEM(KB)bbCFR(B)bconfigurationFloatDecCFGbbctrlIndexIndexIndexIndexDataDataDataIndexDataVertexCache(VORs)(KB)FigMicroarchitectureofproposedprocessor)LogarithmicdomainForwardingTheoperandforwardingimprovesthethroughputoftheprocessorpipelineInthisprocessor,theoperandforwardingisalsosupportedinthelogarithmicdomainAsshowninFig,inthecaseofconsecutiveFLPoperationswithoutrequiringfinalFLPadders,theantilogarithmicandlogarithmicconversionscanceleachotherandtheintermediatelogarithmicvalueofthepreviousFLPoperationisforwardeddirectlyintothelogarithmicdomainofthenextFLPoperationbypassingtheantilogarithmicandlogarithmicconvertersoftwooperationsThisreducesthepipelinelatencyandthecomputationerrorbybypassingtherepeatedantilogarithmicandlogarithmicconversions,whicharethesourcesoferrorsEEELogEEEAlogEEELogEELogEEAlogEAlogEEEEabtcabcopopopopLogAlogLogLogLogAlogLogLogLogAlogLogarithmicdomainforwardingConventionalforwardingCancelledELogEAlogcyclereductionFigLogarithmicdomainforwarding)VertexCacheAtransformationandlighting(TnL)vertexcacheisprovidedtoreusethepreviouslyprocessedverticeswithoutexecutingtheTnLroutineTheKBSRAMcancontainresultverticeswithhitrateThisleadstoasinglecycleTnLfortheverticesinthevertexcacheandpeakgeometrytransformation(TFM)ofMverticesstogetherwiththecycleMATIVIMPLEMENTIONRESULTSAChipImplementationTheproposedprocessorisintegratedintoaDgraphicschipasavertexshaderandfabricatedinµmmetalCMOStechnologyThechipmicrographandsummaryofchipcharacteristicsareshowninFigThecoresizeismmandoperatesatMHzconsumingmWatVFigshowstheshmooTheclocktopipelineregistersisfullygatedtodisableunnecessaryswitchingunderthecontrolofeachinstructionAlltheFLPinstructionsareprocessedwithsinglecyclethroughput,exceptfortheMATwithcyclethroughputTableIshowsthelatenciesandthroughputsforFLPinstructionsFigChipmicrographOperatingFrequency(MHz)FigShmooTABLEITHELATENCYTHROUGHPUTOFSELECTEDINSTRUCTIONSADDDIVDSQMADDOTMATPOWSINThroughputLatencyBComparisonTheperformanceiscomparedforthefullOpenGLTnLroutinewithmodelview,normal,andperspectivetransformations,normalizationsoflight,view,normal,andBlinnhalfvectors,andintensitycalculationsofdiffuseandspecularlightingsforasinglelightsourceTheroutineincludesMATs,DSQs,DIVs,DOTs,MADs,POW,andADDsAfterreschedulingofthecodetoavoidthedependencies,therequiredexecutioncycleforitontheproposedprocessoriscyclesTableIIshowsthecomparisonresultsandthepeakTFMperformanceisalsocomparedOurworkshowsandperformanceimprovementsforTFMandTnL,respectively,whilereducingandpowerandareaoverhead,respectively,fromthelatestworkTABLEIITHECOMPARISONRESULTSPerformance(Mverticess)RefTFMTnLPower(mW)Area(mm)Freq(MHz)Process(µm)NANANAThisworkVCONCLUSIONAhighperformance,powerandareaefficientvectorprocessorisproposedforDgraphicsshadersItadoptsawaybitFLPmultifunctionunitUsinglogarithmicarithmetic,theunitunifiesthevector,matrix,andelementaryfunctionsintoasinglearithmeticunitandachievessinglecyclethroughputforalltheoperations,exceptfortheMATwithcyclethroughputWiththehelpofthemultifunctionunittogetherwithcascadedintegerfloatdatapath,datapathreconfiguration,logarithmicdomainforwarding,andhitratevertexcacheachievesMverticessforTFMandMverticessforTnLatMHzComparingwithpreviouswork,itshowsandperformanceimprovementsforTFMandTnL,respectively,whilereducingandpowerandareaoverhead,respectivelyREFERENCESKhronosGroup,OpenGLES,http:wwwkhronosorgCHYu,etal,“AMverticessMultithreadedVLIWVertexProcessorforMobileMultimediaApplications,”inIEEEISSCCDigTechPapers,FebBGNam,etal,ALowPowerUnifiedArithmeticUnitforProgrammableHandheldDGraphicsSystems,”inProcIEEECICC,SeptFLai,etal,“AHybridNumberSystemProcessorwithGeometricandComplexArithmeticCapabilities,”IEEETransonComputer,Vol,No,AugustJHSohn,etal,“AFixedpointMultimediaCoprocessorwithMverticessProgrammableSIMDVertexShaderforMobileApplications,”inProcESSCIRC,SeptBGNam,etal,“AmWDGraphicsProcessorwithMverticessVertexShaderandPowerDomainsofDynamicVoltageandFrequencyScaling,”inIEEEISSCCDigTechPapers,FebOpenGLARB,http:wwwopenglorgFArakawa,etal,“AnEmbeddedProcessorCoreforConsumerApplianceswithGFLOPSandMpolygonss,”inIEEEISSCCDigTechPapers,FebDKim,etal,“AnSoCwithGtexelssDGraphicsFullPipelineforConsumerApplications,”IEEEJSSC,vol,no,pp,JanMainMenuCIRCMENUFrontMatterTableofContentsAuthorIndexSearchPrintViewFullPageZoomInZoomOutGoToPreviousDocumentHelp

类似资料

编辑推荐

五代十国制度研究_杜文玉.pdf

时间管理:小强升职记.pdf

一世珍藏的美文130篇.pdf

高中化学必修2.pdf

08版通信建设工程费用定额.pdf

职业精品

精彩专题

上传我的资料

精选资料

热门资料排行换一换

  • 【中华经典随笔】扬州画舫录.pdf

  • 《幻化网秘密藏续释——光明藏》不…

  • 穆诗雄:英语专业毕业论文写作.p…

  • 汽轮机原理.pdf

  • 圣武记(上、下册).pdf

  • 白日薄西山:大汉帝国的衰亡.pdf

  • 袁世凯全传 袁世凯轶事(1966…

  • [中国传统服饰图鉴].古月.扫描…

  • 格林童话全集.[德]格林兄弟.人…

  • 资料评价:

    / 4
    所需积分:1 立即下载

    意见
    反馈

    返回
    顶部