关闭

关闭

封号提示

内容

首页 A Low-Power Vector Processor Using Logarithmic …

A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systems.pdf

A Low-Power Vector Processor Us…

上传者: rapanda 2013-12-10 评分1 评论0 下载0 收藏0 阅读量616 暂无简介 简介 举报

简介:本文档为《A Low-Power Vector Processor Using Logarithmic Arithmetic for Handheld 3D Graphics Systemspdf》,可适用于硬件技术领域,主题内容包含ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSyst符等。

ALowPowerVectorProcessorUsingLogarithmicArithmeticforHandheldDGraphicsSystemsByeongGyuNamandHoiJunYooDeptofEECS,KoreaAdvancedInstituteofScienceandTechnology(KAIST),Guseongdong,Yuseonggu,Daejeon,,RepublicofKoreaAbstractAlowpower,highperformancewaybitvectorprocessorisdevelopedforhandheldDgraphicssystemsItcontainsafloatingpointunifiedmatrix,vector,andelementaryfunctionunitByutilizingthelogarithmicarithmetic,theunitachievessinglecyclethroughputforalltheseoperationsexceptforthematrixvectormultiplicationwithcyclethroughputTheprocessorfeaturedbythisfunctionunit,cascadedintegerfloatdatapath,reconfigurationofdatapath,operandforwardinginlogarithmicdomain,andvertexcachetakesmminµmCMOStechnologyandachievesMverticessforgeometrytransformationandMverticessforOpenGLtransformationandlightingatMHzwithmWpowerconsumptionIINTRODUCTIONThehandheldgraphicsprocessingunits(GPUs)incorporatevectorprocessors,knownasshaders,intheirDgraphicspipelinestagestoprovidemorerealisticimagesIn,avertexshaderisproposedwithwayfloatingpoint(FLP)multipliersforthefastgeometrytransformationanditconsumedalargesiliconareaandpowerForthepowerandareaefficientdesignoftheshaders,amultifunctionunitwasproposedinHowever,itwasafixedpointunitanddidn’tdealwiththematrixvectormultiplication,frequentlyusedforDgeometrytransformationsInthispaper,awaybitFLPvectorprocessorisproposedfortheshadersItadoptsaunifiedmatrix,vector,andelementaryfunctionunit,whichunifiesalltheseoperationsinasinglewayarithmeticunitTheunitoperatesontheFLPdatasincethenewlydefinedgraphicsAPIrequiresmorethanbitFLPprecisionAlthoughitoperatesontheFLPdata,itusesthelogarithmicarithmeticfortheinternalarithmetictoreducethearithmeticcomplexityItsinstructionsetincludesmatrixvectormultiplication(MAT),vectoroperations(VEC)likevectormultiplication(MUL),division(DIV),dividebysquareroot(DSQ),multiplyadd(MAD)anddotproduct(DOT),andelementaryfunctions(ELM)includingtrigonometricfunctions(TRGs),power(POW),andlogarithm(LOG)ItachievessinglecyclethroughputwithmaximumcyclelatencyforalltheseoperationsexceptfortheMATwithcyclethroughputandcyclelatencyTheprocessorhasacascadedstructureofintegerandFLPdatapathsforefficientindexingofFLPoperandsTheconfigurationoftheFLPmultifunctionunitissupportedtoextendtheinstructionsetaccordingtouser’srequirementsItspipelineexploitstheoperandforwardinginthelogarithmicdomaintoimprovepipelinethroughputandthecomputationaccuracyAvertexcacheisadoptedtoreusethepreviouslyprocessedresultsandimprovesthroughputoftheprocessorIIARITHMETICUNITANumberSystemTheproposedarithmeticunitisbasedonthehybridapproachoftheFLPandthelogarithmicnumbersystem(LNS)introducedin,whereoperationsarereducedintosimpleronesintheLNSwhiletheadditionandsubtractionareperformedinFLPsincetheLNSadditionandsubtractionrequirenonlineartermevaluationsThelogarithmicandantilogarithmicconvertersbetweentheFLPandtheLNSareproposedforthishybridnumbersystem(HNS)entryLUT(B)xm>>cim>>dim>>eiCSACPACSACSAcidieibimeaimbientryLUT(B)f>>cif>>diCPACSACSAcidibiXefaifbi(a)Logarithmicconverter(b)AntilogarithmicconverterFigProposednumberconverters)LogarithmicConverterForthebitFLPinput()exm=,itslogarithmicnumberisrepresentedasloglog()xem=Theexponenteistheintegerpartofthelogarithmicnumberandlog()misthefractionalpartThelog()misapproximatedbypiecewiselinearexpressionsaslog()mamb,whereaandbaretheapproximationcoefficientsdefinedforeachapproximationregionThelogarithmicconverterdividesthe(m)intofinersubdivisionsaroundtheregionneartosincetheerrorincreasesastheinputvaluegetsclosertoItachievesmaximumconversionerrorwithapproximationregionsFig(a)showstheproposedlogarithmicconverter)AntilogarithmicConverterForthelogarithmicnumberX,itsFLPnumbercanberepresentedbyXefef==TheintegerpartedirectlybecomestheexponentoftheFLPnumberandthenonlineartermfisapproximatedbypiecewiselinearexpressionslikefafb,whereaandbaretheapproximationcoefficientsdefinedforeachapproximationregionTheantilogarithmicconverterevenlydividesthefintotheapproximationregionssincetheantilogarithmicconversionerrorisspreadovertheentireinputregionevenlyItachieves$IEEEmaximumconversionerrorwithapproximationregionsFig(b)showstheantilogarithmicconverterBOrganizationThearithmeticunitisorganizedwithchannelsandpipelinestagesasshowninFigTheofinputbitFLPoperandsareconvertedintothelogarithmicnumberwithbitinteger,bitfraction,bitsign,andbitzerothroughthelogarithmicconverters(LOGCs)intheETheEincludestheprogrammablemultiplier(PMUL)asshowninFig,whichcanbeusedfortheBoothmultiplierforELM,LOGCsforVECorantilogarithmicconverters(ALOGCs)forMATbyjustaddingentryBLOGandentryBALOGlookuptablestotheBoothmultiplierandsharingtheCSAtreeandaCPAInthisway,thenumberofLOGCsinEisreducedto,whichwereinInE,addersinlogarithmicdomainareprovidedfortheVECsandtheresultingvaluesareconvertedintotheFLPnumbersthroughtheALOGCsTheprogrammableadder(PADD)intheEcanbeprogrammedintoainputFLPaddertreeorwayinputSIMDFLPaddersfortargetoperationsasproposedinTheEprovidesaSIMDFLPaccumulatorforthefinalaccumulationrequiredfortheMATItisalsousedasaroundinglogicforotheroperationsChannelChannelChannelChannelLOGCALOGCxLOGCALOGCLOGCALOGCCPACPACPA>>>>>>LOGCALOGCyzCPA>>CPAyzCPAyzCPAyzCPAxxxMATndxxMATndfpAccfpAccfpAccfpAccTRGTRGTRGTRGMATPOW>>MATMATMATMATMATMATProgrammableMultiplierCPACPAProgrammableAdderFigProposedarithmeticunit)MatrixVectorMultiplicationThegeometrytransformationinDgraphicscanbecomputedbythemultiplicationofmatrixwithelementvector,whichrequiresmultiplicationsandadditionsThiscanbeconvertedintotheHNSasinFigrequiringLOGCs,adders,ALOGCs,andFLPaddersSincethecoefficientsofageometrytransformationmatrixarefixedduringprocessingofDobject,thesecanbepreconvertedintothelogarithmicdomainandusedasconstantsduringtheprocessingThus,theMATonlyrequiresLOGCsforvectorelementconversion,addersinlogarithmicdomain,ALOGCsandFLPaddersThiscanbeimplementedinphasesonthiswayarithmeticunitasillustratedinFigInthisscheme,addersinlogarithmicdomainandALOGCsarerequiredperphaseandtheALOGCsinthefirstphaseareobtainedfromALOGCsinEbyprogrammingthePMULintoALOGCstogetherwiththeALOGCsinETheCPAsinEandEareusedfortheaddersinlogarithmicdomainThemultiplicationresultsfromtheALOGCsinEandtheotherfromtheEareaddedintheEbyprogrammingthePADDintowaySIMDFLPaddertogetthefirstphaseresultWiththesameprocessrepeated,theaccumulationwiththefirstphaseresultinEcompletestheMATThus,theMATisimplementedwithcyclethroughputonthiswayarithmeticunit,whereitwasimplementedwithcyclethroughputinconventionalwayFigProgrammablemultiplier(PMUL)ccccccxccccccccxccxxxxccccxccccxcccccccc==logloglogloglogloglogloglogloglogloglogloglogloglogloglogccccccccxxxcccccccclogxFigTwophaseimplementationofMAT)VectorandElementaryFunctionsTheVECsandELMsincludingtheMUL,DIV,DSQ,MAD,DOT,POW,LOG,andTRGsareimplementedbasedontheschemeinSincethevectoroperationsrequireLOGCsforoperandsperchannel,thePMULisprogrammedintoLOGCstomaketheLOGCsforchannelstogetherwiththeLOGCsinESincethepowerisconvertedintothemultiplicationinlogarithmicdomain,itrequiresabbmultiplieranditisimplementedbyprogrammingthePMULintoasinglebbBMULTheTRGsareunifiedwithothersusingtheTaylorseriesasproposedinThispowerseriesrequiresawaybbmultiplierinlogarithmicdomainandfinalsummationofthesetermsThiscanbeimplementedbyprogrammingthePMULintowaybbBMULandthePADDintoainputsummationtreeIIIPROCESSORARCHITECTUREAInstructionSetArchitectureTheprocessorhastwotypesofinstructionformatwithbitsandbitsThebitinstructionsareusedforcontrolandinteger(INT)operations,whilethebitinstructionsareusedfortheFLPoperationsThisseparationofformatratherthanasinglebitformatresultsinreducedinstructionmemory(IMEM)sizeandpowerdissipationInaloopexecution,theregisterindicesforFLPoperandsmaybecalculateddynamicallyasafunctionoftheloopcounteriasfollowsOPRDstikd,SrcAika,SrcBikb,SrcCikcThus,theINTinstructionscalculatingtheindicescanbeembeddedintheFLPinstructionastheFLPoperandfieldsforefficientcalculationoftheindicesInthiscase,INTcomputationisfollowedbytheFLPcomputationbyasinglebitinstructionshowninFigFigbitFLPinstructionformatBMicroarchitectureFigshowsthemicroarchitecturefortheproposedprocessorIthaswaybitFLPvectorregisterfilesincludingbyteentryvertexinputregister(VIR),generalpurposeregisterfile(GPR),KBentryconstantmemory(CMEM),andbyteentryvertexoutputregister(VOR)TheFLPoperandsarefetchedfromtheGPR,VIRorCMEMandtheresultiswrittenbacktotheGPRorVORindexingthetargetentryinthedestinationregisterfileTheFLPoperandscanbeswizzled,negated,andconvertedintotheabsolutevaluebythesourcemodifiersThearithmeticunitproposedinsectionisusedastheFLPmultifunctionunitinthisprocessorThevertexcacheiscomposedofVORs)CascadedINTFLPdatapathsThisprocessorhasacascadedarchitectureofwaybitINTandwaybitFLPdatapathstoimplementtheembeddedindexcalculationoftheFLPoperandswithoutusingadditionalcyclesForflexibleindexcalculation,thisprocessorincludeswaybitSIMDintegerALUandbyteentryintegerregisterfile(IGPR)TheofbitresultsfromthisunitindexsourceoperandsanddestinationofaFLPoperationWhenamultiplicationisrequiredfortheindexcalculation,thePMULinFLParithmeticunitisprogrammedintoabbintegermultiplyadd(IMAD)unitsincetheconversionerrorisnotallowedfortheindexcalculation)DatapathReconfigurationTheFLPunitinsectionIIhasseveralMUXestounifyvariousoperationsandrevealsallthecontrolpointstotheprogrammersothattheycanmakearbitraryoperationsonitbyprogrammingthebitcontrolsignalsForexample,variousTRGssupportedinthisprocessorcanbeprogrammedbyasingleconfigurationinstruction(CFG)withsomeconfigurationdataratherthanincludingalloftheTRGsinthelimitedinstructionspaceTheprogrammedconfigurationdataarestoredinthebyteentryconfigurationregisterfile(CFR)andaccessedbytheCFGThus,theconfigurationcanbechangedineverycycleIMEM(KB)GPR(B)IGPR(B)SWZNEGABSSWZNEGABSSWZNEGABSFLPmultifunctionunitALUbbbbbbbbbbbbbFetchDecodebbVertexFetchbVertexCMEM(KB)VIR(B)DMEM(KB)bbCFR(B)bconfigurationFloatDecCFGbbctrlIndexIndexIndexIndexDataDataDataIndexDataVertexCache(VORs)(KB)FigMicroarchitectureofproposedprocessor)LogarithmicdomainForwardingTheoperandforwardingimprovesthethroughputoftheprocessorpipelineInthisprocessor,theoperandforwardingisalsosupportedinthelogarithmicdomainAsshowninFig,inthecaseofconsecutiveFLPoperationswithoutrequiringfinalFLPadders,theantilogarithmicandlogarithmicconversionscanceleachotherandtheintermediatelogarithmicvalueofthepreviousFLPoperationisforwardeddirectlyintothelogarithmicdomainofthenextFLPoperationbypassingtheantilogarithmicandlogarithmicconvertersoftwooperationsThisreducesthepipelinelatencyandthecomputationerrorbybypassingtherepeatedantilogarithmicandlogarithmicconversions,whicharethesourcesoferrorsEEELogEEEAlogEEELogEELogEEAlogEAlogEEEEabtcabcopopopopLogAlogLogLogLogAlogLogLogLogAlogLogarithmicdomainforwardingConventionalforwardingCancelledELogEAlogcyclereductionFigLogarithmicdomainforwarding)VertexCacheAtransformationandlighting(TnL)vertexcacheisprovidedtoreusethepreviouslyprocessedverticeswithoutexecutingtheTnLroutineTheKBSRAMcancontainresultverticeswithhitrateThisleadstoasinglecycleTnLfortheverticesinthevertexcacheandpeakgeometrytransformation(TFM)ofMverticesstogetherwiththecycleMATIVIMPLEMENTIONRESULTSAChipImplementationTheproposedprocessorisintegratedintoaDgraphicschipasavertexshaderandfabricatedinµmmetalCMOStechnologyThechipmicrographandsummaryofchipcharacteristicsareshowninFigThecoresizeismmandoperatesatMHzconsumingmWatVFigshowstheshmooTheclocktopipelineregistersisfullygatedtodisableunnecessaryswitchingunderthecontrolofeachinstructionAlltheFLPinstructionsareprocessedwithsinglecyclethroughput,exceptfortheMATwithcyclethroughputTableIshowsthelatenciesandthroughputsforFLPinstructionsFigChipmicrographOperatingFrequency(MHz)FigShmooTABLEITHELATENCYTHROUGHPUTOFSELECTEDINSTRUCTIONSADDDIVDSQMADDOTMATPOWSINThroughputLatencyBComparisonTheperformanceiscomparedforthefullOpenGLTnLroutinewithmodelview,normal,andperspectivetransformations,normalizationsoflight,view,normal,andBlinnhalfvectors,andintensitycalculationsofdiffuseandspecularlightingsforasinglelightsourceTheroutineincludesMATs,DSQs,DIVs,DOTs,MADs,POW,andADDsAfterreschedulingofthecodetoavoidthedependencies,therequiredexecutioncycleforitontheproposedprocessoriscyclesTableIIshowsthecomparisonresultsandthepeakTFMperformanceisalsocomparedOurworkshowsandperformanceimprovementsforTFMandTnL,respectively,whilereducingandpowerandareaoverhead,respectively,fromthelatestworkTABLEIITHECOMPARISONRESULTSPerformance(Mverticess)RefTFMTnLPower(mW)Area(mm)Freq(MHz)Process(µm)NANANAThisworkVCONCLUSIONAhighperformance,powerandareaefficientvectorprocessorisproposedforDgraphicsshadersItadoptsawaybitFLPmultifunctionunitUsinglogarithmicarithmetic,theunitunifiesthevector,matrix,andelementaryfunctionsintoasinglearithmeticunitandachievessinglecyclethroughputforalltheoperations,exceptfortheMATwithcyclethroughputWiththehelpofthemultifunctionunittogetherwithcascadedintegerfloatdatapath,datapathreconfiguration,logarithmicdomainforwarding,andhitratevertexcacheachievesMverticessforTFMandMverticessforTnLatMHzComparingwithpreviouswork,itshowsandperformanceimprovementsforTFMandTnL,respectively,whilereducingandpowerandareaoverhead,respectivelyREFERENCESKhronosGroup,OpenGLES,http:wwwkhronosorgCHYu,etal,“AMverticessMultithreadedVLIWVertexProcessorforMobileMultimediaApplications,”inIEEEISSCCDigTechPapers,FebBGNam,etal,ALowPowerUnifiedArithmeticUnitforProgrammableHandheldDGraphicsSystems,”inProcIEEECICC,SeptFLai,etal,“AHybridNumberSystemProcessorwithGeometricandComplexArithmeticCapabilities,”IEEETransonComputer,Vol,No,AugustJHSohn,etal,“AFixedpointMultimediaCoprocessorwithMverticessProgrammableSIMDVertexShaderforMobileApplications,”inProcESSCIRC,SeptBGNam,etal,“AmWDGraphicsProcessorwithMverticessVertexShaderandPowerDomainsofDynamicVoltageandFrequencyScaling,”inIEEEISSCCDigTechPapers,FebOpenGLARB,http:wwwopenglorgFArakawa,etal,“AnEmbeddedProcessorCoreforConsumerApplianceswithGFLOPSandMpolygonss,”inIEEEISSCCDigTechPapers,FebDKim,etal,“AnSoCwithGtexelssDGraphicsFullPipelineforConsumerApplications,”IEEEJSSC,vol,no,pp,JanMainMenuCIRCMENUFrontMatterTableofContentsAuthorIndexSearchPrintViewFullPageZoomInZoomOutGoToPreviousDocumentHelp

类似资料

编辑推荐

Introduction to Lens Design with practice ZEMAX examples(Joseph M.Geary;WillBell;2002).pdf

操盘技术图解之三盘口技术图谱 PDF.pdf

中国国民党土地政策研究 1905-1949.pdf

杨甲三临证论治.pdf

星云大师讲《心经》.pdf

职业精品

精彩专题

如何保养头发,头发护理与保养必备常识

一头瀑布般的秀发是每个女孩子的梦想,但是天气变冷,头发干燥如枯草一般,你可能只想尽快剪掉自己这三千烦恼丝。所以,每个女孩子都应该学会爱护自己的头发,如何护理头发才是最有效的呢?护理干燥头发,防止掉发小窍门,头发护理技巧,头发护理与保养方法这里全都有。

用户评论

0/200
    暂无评论
上传我的资料

精选资料

热门资料排行换一换

  • 奶茶饮料工艺研究.pdf

  • Wiley Millionair…

  • 关于财务会计的调查报告.doc.…

  • 行政管理学--夏书章 2008.…

  • 审计学原理--李凤鸣 2008.…

  • 股市淘金必知的量价关系(尼尉圻)…

  • 股市航海图 股市赢利看图篇(唐伟…

  • 01 北京法源寺.pdf

  • 2010年中国人民大学考博英语真…

  • 资料评价:

    / 4
    所需积分:1 立即下载

    意见
    反馈

    返回
    顶部