首页 > > > A Low-Power Vector Processor Using Logarithmic …

A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systems.pdf

A Low-Power Vector Processor Us…

上传者: rapanda 2013-12-10 评分1 评论0 下载0 收藏10 阅读量616 暂无简介 简介 举报

简介:本文档为《A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systemspdf》,可适用于硬件技术领域,主题内容包含ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSyst符等。

ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSystemsByeongGyuNamandHoiJunYooDeptofEECS,KoreaAdvancedInstituteofScienceandTechnology(KAIST),Guseongdong,Yuseonggu,Daejeon,,RepublicofKoreaAbstractAlowpower,highperformancewaybitvectorprocessorisdevelopedforhandheldDgraphicssystemsItcontainsafloatingpointunifiedmatrix,vector,andelementaryfunctionunitByutilizingthelogarithmicarithmetic,theunitachievessinglecyclethroughputforalltheseoperationsexceptforthematrixvectormultiplicationwithcyclethroughputTheprocessorfeaturedbythisfunctionunit,cascadedintegerfloatdatapath,reconfigurationofdatapath,operandforwardinginlogarithmicdomain,andvertexcachetakesmminµmCMOStechnologyandachievesMverticessforgeometrytransformationandMverticessforOpenGLtransformationandlightingatMHzwithmWpowerconsumptionIINTRODUCTIONThehandheldgraphicsprocessingunits(GPUs)incorporatevectorprocessors,knownasshaders,intheirDgraphicspipelinestagestoprovidemorerealisticimagesIn,avertexshaderisproposedwithwayfloatingpoint(FLP)multipliersforthefastgeometrytransformationanditconsumedalargesiliconareaandpowerForthepowerandareaefficientdesignoftheshaders,amultifunctionunitwasproposedinHowever,itwasafixedpointunitanddidn’tdealwiththematrixvectormultiplication,frequentlyusedforDgeometrytransformationsInthispaper,awaybitFLPvectorprocessorisproposedfortheshadersItadoptsaunifiedmatrix,vector,andelementaryfunctionunit,whichunifiesalltheseoperationsinasinglewayarithmeticunitTheunitoperatesontheFLPdatasincethenewlydefinedgraphicsAPIrequiresmorethanbitFLPprecisionAlthoughitoperatesontheFLPdata,itusesthelogarithmicarithmeticfortheinternalarithmetictoreducethearithmeticcomplexityItsinstructionsetincludesmatrixvectormultiplication(MAT),vectoroperations(VEC)likevectormultiplication(MUL),division(DIV),dividebysquareroot(DSQ),multiplyadd(MAD)anddotproduct(DOT),andelementaryfunctions(ELM)includingtrigonometricfunctions(TRGs),power(POW),andlogarithm(LOG)ItachievessinglecyclethroughputwithmaximumcyclelatencyforalltheseoperationsexceptfortheMATwithcyclethroughputandcyclelatencyTheprocessorhasacascadedstructureofintegerandFLPdatapathsforefficientindexingofFLPoperandsTheconfigurationoftheFLPmultifunctionunitissupportedtoextendtheinstructionsetaccordingtouser’srequirementsItspipelineexploitstheoperandforwardinginthelogarithmicdomaintoimprovepipelinethroughputandthecomputationaccuracyAvertexcacheisadoptedtoreusethepreviouslyprocessedresultsandimprovesthroughputoftheprocessorIIARITHMETICUNITANumberSystemTheproposedarithmeticunitisbasedonthehybridapproachoftheFLPandthelogarithmicnumbersystem(LNS)introducedin,whereoperationsarereducedintosimpleronesintheLNSwhiletheadditionandsubtractionareperformedinFLPsincetheLNSadditionandsubtractionrequirenonlineartermevaluationsThelogarithmicandantilogarithmicconvertersbetweentheFLPandtheLNSareproposedforthishybridnumbersystem(HNS)entryLUT(B)xm>>cim>>dim>>eiCSACPACSACSAcidieibimeaimbientryLUT(B)f>>cif>>diCPACSACSAcidibiXefaifbi(a)Logarithmicconverter(b)AntilogarithmicconverterFigProposednumberconverters)LogarithmicConverterForthebitFLPinput()exm=,itslogarithmicnumberisrepresentedasloglog()xem=Theexponenteistheintegerpartofthelogarithmicnumberandlog()misthefractionalpartThelog()misapproximatedbypiecewiselinearexpressionsaslog()mamb,whereaandbaretheapproximationcoefficientsdefinedforeachapproximationregionThelogarithmicconverterdividesthe(m)intofinersubdivisionsaroundtheregionneartosincetheerrorincreasesastheinputvaluegetsclosertoItachievesmaximumconversionerrorwithapproximationregionsFig(a)showstheproposedlogarithmicconverter)AntilogarithmicConverterForthelogarithmicnumberX,itsFLPnumbercanberepresentedbyXefef==TheintegerpartedirectlybecomestheexponentoftheFLPnumberandthenonlineartermfisapproximatedbypiecewiselinearexpressionslikefafb,whereaandbaretheapproximationcoefficientsdefinedforeachapproximationregionTheantilogarithmicconverterevenlydividesthefintotheapproximationregionssincetheantilogarithmicconversionerrorisspreadovertheentireinputregionevenlyItachieves$IEEEmaximumconversionerrorwithapproximationregionsFig(b)showstheantilogarithmicconverterBOrganizationThearithmeticunitisorganizedwithchannelsandpipelinestagesasshowninFigTheofinputbitFLPoperandsareconvertedintothelogarithmicnumberwithbitinteger,bitfraction,bitsign,andbitzerothroughthelogarithmicconverters(LOGCs)intheETheEincludestheprogrammablemultiplier(PMUL)asshowninFig,whichcanbeusedfortheBoothmultiplierforELM,LOGCsforVECorantilogarithmicconverters(ALOGCs)forMATbyjustaddingentryBLOGandentryBALOGlookuptablestotheBoothmultiplierandsharingtheCSAtreeandaCPAInthisway,thenumberofLOGCsinEisreducedto,whichwereinInE,addersinlogarithmicdomainareprovidedfortheVECsandtheresultingvaluesareconvertedintotheFLPnumbersthroughtheALOGCsTheprogrammableadder(PADD)intheEcanbeprogrammedintoainputFLPaddertreeorwayinputSIMDFLPaddersfortargetoperationsasproposedinTheEprovidesaSIMDFLPaccumulatorforthefinalaccumulationrequiredfortheMATItisalsousedasaroundinglogicforotheroperationsChannelChannelChannelChannelLOGCALOGCxLOGCALOGCLOGCALOGCCPACPACPA>>>>>>LOGCALOGCyzCPA>>CPAyzCPAyzCPAyzCPAxxxMATndxxMATndfpAccfpAccfpAccfpAccTRGTRGTRGTRGMATPOW>>MATMATMATMATMATMATProgrammableMultiplierCPACPAProgrammableAdderFigProposedarithmeticunit)MatrixVectorMultiplicationThegeometrytransformationinDgraphicscanbecomputedbythemultiplicationofmatrixwithelementvector,whichrequiresmultiplicationsandadditionsThiscanbeconvertedintotheHNSasinFigrequiringLOGCs,adders,ALOGCs,andFLPaddersSincethecoefficientsofageometrytransformationmatrixarefixedduringprocessingofDobject,thesecanbepreconvertedintothelogarithmicdomainandusedasconstantsduringtheprocessingThus,theMATonlyrequiresLOGCsforvectorelementconversion,addersinlogarithmicdomain,ALOGCsandFLPaddersThiscanbeimplementedinphasesonthiswayarithmeticunitasillustratedinFigInthisscheme,addersinlogarithmicdomainandALOGCsarerequiredperphaseandtheALOGCsinthefirstphaseareobtainedfromALOGCsinEbyprogrammingthePMULintoALOGCstogetherwiththeALOGCsinETheCPAsinEandEareusedfortheaddersinlogarithmicdomainThemultiplicationresultsfromtheALOGCsinEandtheotherfromtheEareaddedintheEbyprogrammingthePADDintowaySIMDFLPaddertogetthefirstphaseresultWiththesameprocessrepeated,theaccumulationwiththefirstphaseresultinEcompletestheMATThus,theMATisimplementedwithcyclethroughputonthiswayarithmeticunit,whereitwasimplementedwithcyclethroughputinconventionalwayFigProgrammablemultiplier(PMUL)ccccccxccccccccxccxxxxccccxccccxcccccccc==logloglogloglogloglogloglogloglogloglogloglogloglogloglogccccccccxxxcccccccclogxFigTwophaseimplementationofMAT)VectorandElementaryFunctionsTheVECsandELMsincludingtheMUL,DIV,DSQ,MAD,DOT,POW,LOG,andTRGsareimplementedbasedontheschemeinSincethevectoroperationsrequireLOGCsforoperandsperchannel,thePMULisprogrammedintoLOGCstomaketheLOGCsforchannelstogetherwiththeLOGCsinESincethepowerisconvertedintothemultiplicationinlogarithmicdomain,itrequiresabbmultiplieranditisimplementedbyprogrammingthePMULintoasinglebbBMULTheTRGsareunifiedwithothersusingtheTaylorseriesasproposedinThispowerseriesrequiresawaybbmultiplierinlogarithmicdomainandfinalsummationofthesetermsThiscanbeimplementedbyprogrammingthePMULintowaybbBMULandthePADDintoainputsummationtreeIIIPROCESSORARCHITECTUREAInstructionSetArchitectureTheprocessorhastwotypesofinstructionformatwithbitsandbitsThebitinstructionsareusedforcontrolandinteger(INT)operations,whilethebitinstructionsareusedfortheFLPoperationsThisseparationofformatratherthanasinglebitformatresultsinreducedinstructionmemory(IMEM)sizeandpowerdissipationInaloopexecution,theregisterindicesforFLPoperandsmaybecalculateddynamicallyasafunctionoftheloopcounteriasfollowsOPRDstikd,SrcAika,SrcBikb,SrcCikcThus,theINTinstructionscalculatingtheindicescanbeembeddedintheFLPinstructionastheFLPoperandfieldsforefficientcalculationoftheindicesInthiscase,INTcomputationisfollowedbytheFLPcomputationbyasinglebitinstructionshowninFigFigbitFLPinstructionformatBMicroarchitectureFigshowsthemicroarchitecturefortheproposedprocessorIthaswaybitFLPvectorregisterfilesincludingbyteentryvertexinputregister(VIR),generalpurposeregisterfile(GPR),KBentryconstantmemory(CMEM),andbyteentryvertexoutputregister(VOR)TheFLPoperandsarefetchedfromtheGPR,VIRorCMEMandtheresultiswrittenbacktotheGPRorVORindexingthetargetentryinthedestinationregisterfileTheFLPoperandscanbeswizzled,negated,andconvertedintotheabsolutevaluebythesourcemodifiersThearithmeticunitproposedinsectionisusedastheFLPmultifunctionunitinthisprocessorThevertexcacheiscomposedofVORs)CascadedINTFLPdatapathsThisprocessorhasacascadedarchitectureofwaybitINTandwaybitFLPdatapathstoimplementtheembeddedindexcalculationoftheFLPoperandswithoutusingadditionalcyclesForflexibleindexcalculation,thisprocessorincludeswaybitSIMDintegerALUandbyteentryintegerregisterfile(IGPR)TheofbitresultsfromthisunitindexsourceoperandsanddestinationofaFLPoperationWhenamultiplicationisrequiredfortheindexcalculation,thePMULinFLParithmeticunitisprogrammedintoabbintegermultiplyadd(IMAD)unitsincetheconversionerrorisnotallowedfortheindexcalculation)DatapathReconfigurationTheFLPunitinsectionIIhasseveralMUXestounifyvariousoperationsandrevealsallthecontrolpointstotheprogrammersothattheycanmakearbitraryoperationsonitbyprogrammingthebitcontrolsignalsForexample,variousTRGssupportedinthisprocessorcanbeprogrammedbyasingleconfigurationinstruction(CFG)withsomeconfigurationdataratherthanincludingalloftheTRGsinthelimitedinstructionspaceTheprogrammedconfigurationdataarestoredinthebyteentryconfigurationregisterfile(CFR)andaccessedbytheCFGThus,theconfigurationcanbechangedineverycycleIMEM(KB)GPR(B)IGPR(B)SWZNEGABSSWZNEGABSSWZNEGABSFLPmultifunctionunitALUbbbbbbbbbbbbbFetchDecodebbVertexFetchbVertexCMEM(KB)VIR(B)DMEM(KB)bbCFR(B)bconfigurationFloatDecCFGbbctrlIndexIndexIndexIndexDataDataDataIndexDataVertexCache(VORs)(KB)FigMicroarchitectureofproposedprocessor)LogarithmicdomainForwardingTheoperandforwardingimprovesthethroughputoftheprocessorpipelineInthisprocessor,theoperandforwardingisalsosupportedinthelogarithmicdomainAsshowninFig,inthecaseofconsecutiveFLPoperationswithoutrequiringfinalFLPadders,theantilogarithmicandlogarithmicconversionscanceleachotherandtheintermediatelogarithmicvalueofthepreviousFLPoperationisforwardeddirectlyintothelogarithmicdomainofthenextFLPoperationbypassingtheantilogarithmicandlogarithmicconvertersoftwooperationsThisreducesthepipelinelatencyandthecomputationerrorbybypassingtherepeatedantilogarithmicandlogarithmicconversions,whicharethesourcesoferrorsEEELogEEEAlogEEELogEELogEEAlogEAlogEEEEabtcabcopopopopLogAlogLogLogLogAlogLogLogLogAlogLogarithmicdomainforwardingConventionalforwardingCancelledELogEAlogcyclereductionFigLogarithmicdomainforwarding)VertexCacheAtransformationandlighting(TnL)vertexcacheisprovidedtoreusethepreviouslyprocessedverticeswithoutexecutingtheTnLroutineTheKBSRAMcancontainresultverticeswithhitrateThisleadstoasinglecycleTnLfortheverticesinthevertexcacheandpeakgeometrytransformation(TFM)ofMverticesstogetherwiththecycleMATIVIMPLEMENTIONRESULTSAChipImplementationTheproposedprocessorisintegratedintoaDgraphicschipasavertexshaderandfabricatedinµmmetalCMOStechnologyThechipmicrographandsummaryofchipcharacteristicsareshowninFigThecoresizeismmandoperatesatMHzconsumingmWatVFigshowstheshmooTheclocktopipelineregistersisfullygatedtodisableunnecessaryswitchingunderthecontrolofeachinstructionAlltheFLPinstructionsareprocessedwithsinglecyclethroughput,exceptfortheMATwithcyclethroughputTableIshowsthelatenciesandthroughputsforFLPinstructionsFigChipmicrographOperatingFrequency(MHz)FigShmooTABLEITHELATENCYTHROUGHPUTOFSELECTEDINSTRUCTIONSADDDIVDSQMADDOTMATPOWSINThroughputLatencyBComparisonTheperformanceiscomparedforthefullOpenGLTnLroutinewithmodelview,normal,andperspectivetransformations,normalizationsoflight,view,normal,andBlinnhalfvectors,andintensitycalculationsofdiffuseandspecularlightingsforasinglelightsourceTheroutineincludesMATs,DSQs,DIVs,DOTs,MADs,POW,andADDsAfterreschedulingofthecodetoavoidthedependencies,therequiredexecutioncycleforitontheproposedprocessoriscyclesTableIIshowsthecomparisonresultsandthepeakTFMperformanceisalsocomparedOurworkshowsandperformanceimprovementsforTFMandTnL,respectively,whilereducingandpowerandareaoverhead,respectively,fromthelatestworkTABLEIITHECOMPARISONRESULTSPerformance(Mverticess)RefTFMTnLPower(mW)Area(mm)Freq(MHz)Process(µm)NANANAThisworkVCONCLUSIONAhighperformance,powerandareaefficientvectorprocessorisproposedforDgraphicsshadersItadoptsawaybitFLPmultifunctionunitUsinglogarithmicarithmetic,theunitunifiesthevector,matrix,andelementaryfunctionsintoasinglearithmeticunitandachievessinglecyclethroughputforalltheoperations,exceptfortheMATwithcyclethroughputWiththehelpofthemultifunctionunittogetherwithcascadedintegerfloatdatapath,datapathreconfiguration,logarithmicdomainforwarding,andhitratevertexcacheachievesMverticessforTFMandMverticessforTnLatMHzComparingwithpreviouswork,itshowsandperformanceimprovementsforTFMandTnL,respectively,whilereducingandpowerandareaoverhead,respectivelyREFERENCESKhronosGroup,OpenGLES,http:wwwkhronosorgCHYu,etal,“AMverticessMultithreadedVLIWVertexProcessorforMobileMultimediaApplications,”inIEEEISSCCDigTechPapers,FebBGNam,etal,ALowPowerUnifiedArithmeticUnitforProgrammableHandheldDGraphicsSystems,”inProcIEEECICC,SeptFLai,etal,“AHybridNumberSystemProcessorwithGeometricandComplexArithmeticCapabilities,”IEEETransonComputer,Vol,No,AugustJHSohn,etal,“AFixedpointMultimediaCoprocessorwithMverticessProgrammableSIMDVertexShaderforMobileApplications,”inProcESSCIRC,SeptBGNam,etal,“AmWDGraphicsProcessorwithMverticessVertexShaderandPowerDomainsofDynamicVoltageandFrequencyScaling,”inIEEEISSCCDigTechPapers,FebOpenGLARB,http:wwwopenglorgFArakawa,etal,“AnEmbeddedProcessorCoreforConsumerApplianceswithGFLOPSandMpolygonss,”inIEEEISSCCDigTechPapers,FebDKim,etal,“AnSoCwithGtexelssDGraphicsFullPipelineforConsumerApplications,”IEEEJSSC,vol,no,pp,JanMainMenuCIRCMENUFrontMatterTableofContentsAuthorIndexSearchPrintViewFullPageZoomInZoomOutGoToPreviousDocumentHelp

精彩专题

热门资料

L9347.pdf

L9352B Application Note.pdf

L9352B.pdf

L9374.pdf

该用户的其他资料

  • 名称/格式
  • 评分
  • 下载次数
  • 资料大小
  • 上传时间

用户评论

0/200
    暂无评论
上传我的资料

相关资料换一换

  • Vector+Math+for+…

  • Designing Embedd…

  • Cell processor l…

  • A Low-Power, 3–5…

  • Sub-threshold De…

  • A Holter of Low …

  • 3D Math Primer f…

  • 3D+Math+Primer+f…

  • 3D+Math+Primer+f…

资料评价:

/ 4
所需积分:1 立即下载
返回
顶部
举报
资料
关闭

温馨提示

感谢您对爱问共享资料的支持,精彩活动将尽快为您呈现,敬请期待!