首页 A Low-Power Vector Processor Using Logarithmic …

A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systems.pdf

A Low-Power Vector Processor Us…

上传者: rapanda 2013-12-10 评分1 评论0 下载0 收藏0 阅读量616 暂无简介 简介 举报

简介:本文档为《A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systemspdf》,可适用于硬件技术领域,主题内容包含ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSyst符等。

ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSystemsByeongGyuNamandHoiJunYooDeptofEECS,KoreaAdvancedInstituteofScienceandTechnology(KAIST),Guseongdong,Yuseonggu,Daejeon,,RepublicofKoreaAbstractAlowpower,highperformancewaybitvectorprocessorisdevelopedforhandheldDgraphicssystemsItcontainsafloatingpointunifiedmatrix,vector,andelementaryfunctionunitByutilizingthelogarithmicarithmetic,theunitachievessinglecyclethroughputforalltheseoperationsexceptforthematrixvectormultiplicationwithcyclethroughputTheprocessorfeaturedbythisfunctionunit,cascadedintegerfloatdatapath,reconfigurationofdatapath,operandforwardinginlogarithmicdomain,andvertexcachetakesmminµmCMOStechnologyandachievesMverticessforgeometrytransformationandMverticessforOpenGLtransformationandlightingatMHzwithmWpowerconsumptionIINTRODUCTIONThehandheldgraphicsprocessingunits(GPUs)incorporatevectorprocessors,knownasshaders,intheirDgraphicspipelinestagestoprovidemorerealisticimagesIn,avertexshaderisproposedwithwayfloatingpoint(FLP)multipliersforthefastgeometrytransformationanditconsumedalargesiliconareaandpowerForthepowerandareaefficientdesignoftheshaders,amultifunctionunitwasproposedinHowever,itwasafixedpointunitanddidn’tdealwiththematrixvectormultiplication,frequentlyusedforDgeometrytransformationsInthispaper,awaybitFLPvectorprocessorisproposedfortheshadersItadoptsaunifiedmatrix,vector,andelementaryfunctionunit,whichunifiesalltheseoperationsinasinglewayarithmeticunitTheunitoperatesontheFLPdatasincethenewlydefinedgraphicsAPIrequiresmorethanbitFLPprecisionAlthoughitoperatesontheFLPdata,itusesthelogarithmicarithmeticfortheinternalarithmetictoreducethearithmeticcomplexityItsinstructionsetincludesmatrixvectormultiplication(MAT),vectoroperations(VEC)likevectormultiplication(MUL),division(DIV),dividebysquareroot(DSQ),multiplyadd(MAD)anddotproduct(DOT),andelementaryfunctions(ELM)includingtrigonometricfunctions(TRGs),power(POW),andlogarithm(LOG)ItachievessinglecyclethroughputwithmaximumcyclelatencyforalltheseoperationsexceptfortheMATwithcyclethroughputandcyclelatencyTheprocessorhasacascadedstructureofintegerandFLPdatapathsforefficientindexingofFLPoperandsTheconfigurationoftheFLPmultifunctionunitissupportedtoextendtheinstructionsetaccordingtouser’srequirementsItspipelineexploitstheoperandforwardinginthelogarithmicdomaintoimprovepipelinethroughputandthecomputationaccuracyAvertexcacheisadoptedtoreusethepreviouslyprocessedresultsandimprovesthroughputoftheprocessorIIARITHMETICUNITANumberSystemTheproposedarithmeticunitisbasedonthehybridapproachoftheFLPandthelogarithmicnumbersystem(LNS)introducedin,whereoperationsarereducedintosimpleronesintheLNSwhiletheadditionandsubtractionareperformedinFLPsincetheLNSadditionandsubtractionrequirenonlineartermevaluationsThelogarithmicandantilogarithmicconvertersbetweentheFLPandtheLNSareproposedforthishybridnumbersystem(HNS)entryLUT(B)xm>>cim>>dim>>eiCSACPACSACSAcidieibimeaimbientryLUT(B)f>>cif>>diCPACSACSAcidibiXefaifbi(a)Logarithmicconverter(b)AntilogarithmicconverterFigProposednumberconverters)LogarithmicConverterForthebitFLPinput()exm=,itslogarithmicnumberisrepresentedasloglog()xem=Theexponenteistheintegerpartofthelogarithmicnumberandlog()misthefractionalpartThelog()misapproximatedbypiecewiselinearexpressionsaslog()mamb,whereaandbaretheapproximationcoefficientsdefinedforeachapproximationregionThelogarithmicconverterdividesthe(m)intofinersubdivisionsaroundtheregionneartosincetheerrorincreasesastheinputvaluegetsclosertoItachievesmaximumconversionerrorwithapproximationregionsFig(a)showstheproposedlogarithmicconverter)AntilogarithmicConverterForthelogarithmicnumberX,itsFLPnumbercanberepresentedbyXefef==TheintegerpartedirectlybecomestheexponentoftheFLPnumberandthenonlineartermfisapproximatedbypiecewiselinearexpressionslikefafb,whereaandbaretheapproximationcoefficientsdefinedforeachapproximationregionTheantilogarithmicconverterevenlydividesthefintotheapproximationregionssincetheantilogarithmicconversionerrorisspreadovertheentireinputregionevenlyItachieves$IEEEmaximumconversionerrorwithapproximationregionsFig(b)showstheantilogarithmicconverterBOrganizationThearithmeticunitisorganizedwithchannelsandpipelinestagesasshowninFigTheofinputbitFLPoperandsareconvertedintothelogarithmicnumberwithbitinteger,bitfraction,bitsign,andbitzerothroughthelogarithmicconverters(LOGCs)intheETheEincludestheprogrammablemultiplier(PMUL)asshowninFig,whichcanbeusedfortheBoothmultiplierforELM,LOGCsforVECorantilogarithmicconverters(ALOGCs)forMATbyjustaddingentryBLOGandentryBALOGlookuptablestotheBoothmultiplierandsharingtheCSAtreeandaCPAInthisway,thenumberofLOGCsinEisreducedto,whichwereinInE,addersinlogarithmicdomainareprovidedfortheVECsandtheresultingvaluesareconvertedintotheFLPnumbersthroughtheALOGCsTheprogrammableadder(PADD)intheEcanbeprogrammedintoainputFLPaddertreeorwayinputSIMDFLPaddersfortargetoperationsasproposedinTheEprovidesaSIMDFLPaccumulatorforthefinalaccumulationrequiredfortheMATItisalsousedasaroundinglogicforotheroperationsChannelChannelChannelChannelLOGCALOGCxLOGCALOGCLOGCALOGCCPACPACPA>>>>>>LOGCALOGCyzCPA>>CPAyzCPAyzCPAyzCPAxxxMATndxxMATndfpAccfpAccfpAccfpAccTRGTRGTRGTRGMATPOW>>MATMATMATMATMATMATProgrammableMultiplierCPACPAProgrammableAdderFigProposedarithmeticunit)MatrixVectorMultiplicationThegeometrytransformationinDgraphicscanbecomputedbythemultiplicationofmatrixwithelementvector,whichrequiresmultiplicationsandadditionsThiscanbeconvertedintotheHNSasinFigrequiringLOGCs,adders,ALOGCs,andFLPaddersSincethecoefficientsofageometrytransformationmatrixarefixedduringprocessingofDobject,thesecanbepreconvertedintothelogarithmicdomainandusedasconstantsduringtheprocessingThus,theMATonlyrequiresLOGCsforvectorelementconversion,addersinlogarithmicdomain,ALOGCsandFLPaddersThiscanbeimplementedinphasesonthiswayarithmeticunitasillustratedinFigInthisscheme,addersinlogarithmicdomainandALOGCsarerequiredperphaseandtheALOGCsinthefirstphaseareobtainedfromALOGCsinEbyprogrammingthePMULintoALOGCstogetherwiththeALOGCsinETheCPAsinEandEareusedfortheaddersinlogarithmicdomainThemultiplicationresultsfromtheALOGCsinEandtheotherfromtheEareaddedintheEbyprogrammingthePADDintowaySIMDFLPaddertogetthefirstphaseresultWiththesameprocessrepeated,theaccumulationwiththefirstphaseresultinEcompletestheMATThus,theMATisimplementedwithcyclethroughputonthiswayarithmeticunit,whereitwasimplementedwithcyclethroughputinconventionalwayFigProgrammablemultiplier(PMUL)ccccccxccccccccxccxxxxccccxccccxcccccccc==logloglogloglogloglogloglogloglogloglogloglogloglogloglogccccccccxxxcccccccclogxFigTwophaseimplementationofMAT)VectorandElementaryFunctionsTheVECsandELMsincludingtheMUL,DIV,DSQ,MAD,DOT,POW,LOG,andTRGsareimplementedbasedontheschemeinSincethevectoroperationsrequireLOGCsforoperandsperchannel,thePMULisprogrammedintoLOGCstomaketheLOGCsforchannelstogetherwiththeLOGCsinESincethepowerisconvertedintothemultiplicationinlogarithmicdomain,itrequiresabbmultiplieranditisimplementedbyprogrammingthePMULintoasinglebbBMULTheTRGsareunifiedwithothersusingtheTaylorseriesasproposedinThispowerseriesrequiresawaybbmultiplierinlogarithmicdomainandfinalsummationofthesetermsThiscanbeimplementedbyprogrammingthePMULintowaybbBMULandthePADDintoainputsummationtreeIIIPROCESSORARCHITECTUREAInstructionSetArchitectureTheprocessorhastwotypesofinstructionformatwithbitsandbitsThebitinstructionsareusedforcontrolandinteger(INT)operations,whilethebitinstructionsareusedfortheFLPoperationsThisseparationofformatratherthanasinglebitformatresultsinreducedinstructionmemory(IMEM)sizeandpowerdissipationInaloopexecution,theregisterindicesforFLPoperandsmaybecalculateddynamicallyasafunctionoftheloopcounteriasfollowsOPRDstikd,SrcAika,SrcBikb,SrcCikcThus,theINTinstructionscalculatingtheindicescanbeembeddedintheFLPinstructionastheFLPoperandfieldsforefficientcalculationoftheindicesInthiscase,INTcomputationisfollowedbytheFLPcomputationbyasinglebitinstructionshowninFigFigbitFLPinstructionformatBMicroarchitectureFigshowsthemicroarchitecturefortheproposedprocessorIthaswaybitFLPvectorregisterfilesincludingbyteentryvertexinputregister(VIR),generalpurposeregisterfile(GPR),KBentryconstantmemory(CMEM),andbyteentryvertexoutputregister(VOR)TheFLPoperandsarefetchedfromtheGPR,VIRorCMEMandtheresultiswrittenbacktotheGPRorVORindexingthetargetentryinthedestinationregisterfileTheFLPoperandscanbeswizzled,negated,andconvertedintotheabsolutevaluebythesourcemodifiersThearithmeticunitproposedinsectionisusedastheFLPmultifunctionunitinthisprocessorThevertexcacheiscomposedofVORs)CascadedINTFLPdatapathsThisprocessorhasacascadedarchitectureofwaybitINTandwaybitFLPdatapathstoimplementtheembeddedindexcalculationoftheFLPoperandswithoutusingadditionalcyclesForflexibleindexcalculation,thisprocessorincludeswaybitSIMDintegerALUandbyteentryintegerregisterfile(IGPR)TheofbitresultsfromthisunitindexsourceoperandsanddestinationofaFLPoperationWhenamultiplicationisrequiredfortheindexcalculation,thePMULinFLParithmeticunitisprogrammedintoabbintegermultiplyadd(IMAD)unitsincetheconversionerrorisnotallowedfortheindexcalculation)DatapathReconfigurationTheFLPunitinsectionIIhasseveralMUXestounifyvariousoperationsandrevealsallthecontrolpointstotheprogrammersothattheycanmakearbitraryoperationsonitbyprogrammingthebitcontrolsignalsForexample,variousTRGssupportedinthisprocessorcanbeprogrammedbyasingleconfigurationinstruction(CFG)withsomeconfigurationdataratherthanincludingalloftheTRGsinthelimitedinstructionspaceTheprogrammedconfigurationdataarestoredinthebyteentryconfigurationregisterfile(CFR)andaccessedbytheCFGThus,theconfigurationcanbechangedineverycycleIMEM(KB)GPR(B)IGPR(B)SWZNEGABSSWZNEGABSSWZNEGABSFLPmultifunctionunitALUbbbbbbbbbbbbbFetchDecodebbVertexFetchbVertexCMEM(KB)VIR(B)DMEM(KB)bbCFR(B)bconfigurationFloatDecCFGbbctrlIndexIndexIndexIndexDataDataDataIndexDataVertexCache(VORs)(KB)FigMicroarchitectureofproposedprocessor)LogarithmicdomainForwardingTheoperandforwardingimprovesthethroughputoftheprocessorpipelineInthisprocessor,theoperandforwardingisalsosupportedinthelogarithmicdomainAsshowninFig,inthecaseofconsecutiveFLPoperationswithoutrequiringfinalFLPadders,theantilogarithmicandlogarithmicconversionscanceleachotherandtheintermediatelogarithmicvalueofthepreviousFLPoperationisforwardeddirectlyintothelogarithmicdomainofthenextFLPoperationbypassingtheantilogarithmicandlogarithmicconvertersoftwooperationsThisreducesthepipelinelatencyandthecomputationerrorbybypassingtherepeatedantilogarithmicandlogarithmicconversions,whicharethesourcesoferrorsEEELogEEEAlogEEELogEELogEEAlogEAlogEEEEabtcabcopopopopLogAlogLogLogLogAlogLogLogLogAlogLogarithmicdomainforwardingConventionalforwardingCancelledELogEAlogcyclereductionFigLogarithmicdomainforwarding)VertexCacheAtransformationandlighting(TnL)vertexcacheisprovidedtoreusethepreviouslyprocessedverticeswithoutexecutingtheTnLroutineTheKBSRAMcancontainresultverticeswithhitrateThisleadstoasinglecycleTnLfortheverticesinthevertexcacheandpeakgeometrytransformation(TFM)ofMverticesstogetherwiththecycleMATIVIMPLEMENTIONRESULTSAChipImplementationTheproposedprocessorisintegratedintoaDgraphicschipasavertexshaderandfabricatedinµmmetalCMOStechnologyThechipmicrographandsummaryofchipcharacteristicsareshowninFigThecoresizeismmandoperatesatMHzconsumingmWatVFigshowstheshmooTheclocktopipelineregistersisfullygatedtodisableunnecessaryswitchingunderthecontrolofeachinstructionAlltheFLPinstructionsareprocessedwithsinglecyclethroughput,exceptfortheMATwithcyclethroughputTableIshowsthelatenciesandthroughputsforFLPinstructionsFigChipmicrographOperatingFrequency(MHz)FigShmooTABLEITHELATENCYTHROUGHPUTOFSELECTEDINSTRUCTIONSADDDIVDSQMADDOTMATPOWSINThroughputLatencyBComparisonTheperformanceiscomparedforthefullOpenGLTnLroutinewithmodelview,normal,andperspectivetransformations,normalizationsoflight,view,normal,andBlinnhalfvectors,andintensitycalculationsofdiffuseandspecularlightingsforasinglelightsourceTheroutineincludesMATs,DSQs,DIVs,DOTs,MADs,POW,andADDsAfterreschedulingofthecodetoavoidthedependencies,therequiredexecutioncycleforitontheproposedprocessoriscyclesTableIIshowsthecomparisonresultsandthepeakTFMperformanceisalsocomparedOurworkshowsandperformanceimprovementsforTFMandTnL,respectively,whilereducingandpowerandareaoverhead,respectively,fromthelatestworkTABLEIITHECOMPARISONRESULTSPerformance(Mverticess)RefTFMTnLPower(mW)Area(mm)Freq(MHz)Process(µm)NANANAThisworkVCONCLUSIONAhighperformance,powerandareaefficientvectorprocessorisproposedforDgraphicsshadersItadoptsawaybitFLPmultifunctionunitUsinglogarithmicarithmetic,theunitunifiesthevector,matrix,andelementaryfunctionsintoasinglearithmeticunitandachievessinglecyclethroughputforalltheoperations,exceptfortheMATwithcyclethroughputWiththehelpofthemultifunctionunittogetherwithcascadedintegerfloatdatapath,datapathreconfiguration,logarithmicdomainforwarding,andhitratevertexcacheachievesMverticessforTFMandMverticessforTnLatMHzComparingwithpreviouswork,itshowsandperformanceimprovementsforTFMandTnL,respectively,whilereducingandpowerandareaoverhead,respectivelyREFERENCESKhronosGroup,OpenGLES,http:wwwkhronosorgCHYu,etal,“AMverticessMultithreadedVLIWVertexProcessorforMobileMultimediaApplications,”inIEEEISSCCDigTechPapers,FebBGNam,etal,ALowPowerUnifiedArithmeticUnitforProgrammableHandheldDGraphicsSystems,”inProcIEEECICC,SeptFLai,etal,“AHybridNumberSystemProcessorwithGeometricandComplexArithmeticCapabilities,”IEEETransonComputer,Vol,No,AugustJHSohn,etal,“AFixedpointMultimediaCoprocessorwithMverticessProgrammableSIMDVertexShaderforMobileApplications,”inProcESSCIRC,SeptBGNam,etal,“AmWDGraphicsProcessorwithMverticessVertexShaderandPowerDomainsofDynamicVoltageandFrequencyScaling,”inIEEEISSCCDigTechPapers,FebOpenGLARB,http:wwwopenglorgFArakawa,etal,“AnEmbeddedProcessorCoreforConsumerApplianceswithGFLOPSandMpolygonss,”inIEEEISSCCDigTechPapers,FebDKim,etal,“AnSoCwithGtexelssDGraphicsFullPipelineforConsumerApplications,”IEEEJSSC,vol,no,pp,JanMainMenuCIRCMENUFrontMatterTableofContentsAuthorIndexSearchPrintViewFullPageZoomInZoomOutGoToPreviousDocumentHelp

职业精品

(汽车)产品营销策划书范文.doc

HH牙膏营销方案策划书.doc

加班管理人力资源考勤管理系统方案.doc

物品采购管理制度-正式.doc

用户评论

0/200
    暂无评论
上传我的资料

精彩专题

相关资料换一换

  • Designing Embedd…

  • Cell processor l…

  • A Low-Power, 3–5…

  • Sub-threshold De…

  • A Holter of Low …

  • 3D Math Primer f…

  • 3D+Math+Primer+f…

  • 3D+Math+Primer+f…

  • Vector+Math+for+…

资料评价:

/ 4
所需积分:1 立即下载

意见
反馈

返回
顶部