关闭

关闭

关闭

封号提示

内容

首页 1a. Linear vs Logistic.pdf

1a. Linear vs Logistic.pdf

1a. Linear vs Logistic.pdf

上传者: 咕叽咕叽琦琦 2013-09-15 评分 0 0 0 0 0 0 暂无简介 简介 举报

简介:本文档为《1a. Linear vs Logisticpdf》,可适用于高等教育领域,主题内容包含APPLIEDANDENVIRONMENTALMICROBIOLOGY,$DOI:AEM–May,p–Vol,NoCopyright,America符等。

APPLIEDANDENVIRONMENTALMICROBIOLOGY,$DOI:AEM–May,p–Vol,NoCopyright,AmericanSocietyforMicrobiologyAllRightsReservedComparisonofLogisticRegressionandLinearRegressioninModelingPercentageDataLIHUIZHAO,YUHUANCHEN,†ANDDONALDWSCHAFFNER*DepartmentofFoodScience,CookCollege,theNewJerseyAgriculturalExperimentStation,Rutgers,TheStateUniversityofNewJersey,NewBrunswick,NewJerseyReceivedNovemberAcceptedFebruaryPercentageiswidelyusedtodescribedifferentresultsinfoodmicrobiology,eg,probabilityofmicrobialgrowth,percentinactivated,andpercentofpositivesamplesFoursetsofpercentagedata,percentgrowthpositive,germinationextent,probabilityforonecelltogrow,andmaximumfractionofpositivetubes,wereobtainedfromourownexperimentsandtheliteratureThesedataweremodeledusinglinearandlogisticregressionFivemethodswereusedtocomparethegoodnessoffitofthetwomodels:percentageofpredictionsclosertoobservations,rangeofthedifferences(predictedvalueminusobservedvalue),deviationofthemodel,linearregressionbetweentheobservedandpredictedvalues,andbiasandaccuracyfactorsLogisticregressionwasabetterpredictorofatleastoftheobservationsinallfourdatasetsInallcases,thedeviationoflogisticmodelswasmuchsmallerThelinearcorrelationbetweenobservationsandlogisticpredictionswasalwaysstrongerValidation(accomplishedusingpartofonedataset)alsodemonstratedthatthelogisticmodelwasmoreaccurateinpredictingnewdatapointsBiasandaccuracyfactorswerefoundtobelessinformativewhenevaluatingmodelsdevelopedforpercentagedata,sinceneitheroftheseindicescancomparepredictionsatzeroModelsimplificationforthelogisticmodelwasdemonstratedwithonedatasetThesimplifiedmodelwasaspowerfulinmakingpredictionsasthefulllinearmodel,anditalsogaveclearerinsightindeterminingthekeyexperimentalfactorsMicrobialdataexpressedaspercentageshavebeenmodeledformanyyearsPercentagedatamayhaveverydifferentbiologicalmeaningsandexpressionsIn,Genigeorgisetalinitiatedtheconceptofprobabilityforonecelltogrowandproducetoxin,presentedastheratioofRGoverRI,whereRGisthenumberofcellsinitiatinggrowth,andRIisthenumberofcellsintheinoculum()Inatimetoturbiditymodel,WhitingandOriente()describedthemaximumprobabilityofgrowthwiththeparameterPmax,thisvaluebeingobtainedfromfittingthegrowthcurvewiththelogisticequationCheaetalmodeledtheextentofsporegerminationusingtheplateauvalueofthegerminationcurve()ThepercentgrowthpositiveparameterdescribesthemaximumproportionofwellsthatexhibitedgrowthundervariousenvironmentalconditionsinastudyusingmicroplatesinoculatedwithClostridiumbotulinumspores()AconventionalapproachappliedtomodelingpercentagedataistouselinearregressionwithpolynomialtermsThismethodusuallyresultsinmoderate(R,)(,,,,,,)topoor(R,)()goodnessoffitGenerally,theaccuracyoflinearmodelsformodelingboundedvariables(eg,percentagedata)isnotasgoodasforotherunboundedvariablesobtainedinthesameexperiment,andtheresultinglinearmodelalsopredictspoorlyatvaluesclosetoand(,,)Aninsurmountablelimitationofthelinearapproachisthatthemodelcanpredictpercentagesoutsidetheprobabilityrange,ie,valuesof,or(,,,)Generally,allpredictednegativevaluesareforcedto,andthoseareforcedtoEvenwithoutthismodification,itisnotmeaningfultocomparetheseconditionsForexample,cannotbeinterpretedasahigherpercentgerminationthanLogisticregressionhasbeenwidelyusedinmedicalresearch(,,,,,)Inthefieldofpredictivefoodmicrobiology,logisticmodelshavebeendevelopedtodescribethebacterialgrowthnogrowthinterface(,,,)Inthesemodels,thedatawerepresentedintheformat,asinatypicalbinomialdatasetGenigeorgisetalfirstpresentedtheconceptoftheprobabilitythatonecellcouldgrowinaspecificenvironment()Later,thisprobabilitywasmodeledinvarioussystemsusinglogisticregressioncombinedwithalinearregressionofthelagperiod(,,,,,)RobertsetalusedasimilarconceptandtheregressionapproachtomodeltoxinproductionbyCbotulinuminpasteurizedporkslurry()Coleetalmodeledtheprobabilityofgrowthofspoilageyeastinamodelfruitdrinkbydirectlyrelatingthelogitofprobabilitywiththeenvironmentalfactors()Inthesestudies,probability(acontinuousnumberbetweenand)insteadofadichotomousvariable(ie,,)wasmodeledAspointedoutbyRatkowskyandRoss(),theresponsemodeledbylogisticregressionatagivencombinationoflimitingfactorscaneitherhaveavalueofororbeaprobabilityProbability,generallyexpressedbydividingthenumberofsuccessesbythetotalnumberoftrials,issimplyasummarizationofbinomialdataandthuscanbeapproximatedbyalogisticgenerallinearmodel()Inthisstudy,wecomparedthegoodnessoffitoflinearregressiontologisticregressionformodelingpercentagesWemodeleddatafromourownresearchandfromtheliterature*CorrespondingauthorMailingaddress:DepartmentofFoodScience,CookCollege,theNewJerseyAgriculturalExperimentStation,Rutgers,TheStateUniversityofNewJersey,DudleyRd,NewBrunswick,NJPhone:(),extFax:()Email:schaffneraesoprutgersedu†Presentaddress:NationalFoodProcessorsAssociation,Washington,DC(includingpublicationsfromourgroup)anddevelopedmodelsusingboththelogisticandlinearapproachesinexactlythesamemannerFivedifferentapproacheswereusedtocomparethegoodnessoffitofthetwomodelsInalmostallcases,thelogisticmodelsdisplayedgreateraccuracyandresultedinlessbiasedpredictionsMATERIALSANDMETHODSDatacollectionFourdifferentsetsofpercentagedatawerecollectedfrompreviousexperiments(,,,)EachsethaditsownuniquebiologicalmeaningandwascollectedwithadifferentmethodWeightisthedegreeofemphasisamodelputsonanobservationTheweightforapercentagedatumpointisthetotalnumberofobservationsassociatedwiththispercentage()Forexample,whenoftubesturnturbid,thepercentageis()andtheweightforthispercentageisTheassignmentofweightswasdetermineddifferentlyforeachdataset,asdescribedbelowDatasetI:dataforpercentgrowthpositivewerecollectedbyZhaoetal()ThisdatasetcontainedtheexactnumbersofwellsthatshowedgrowthandnogrowthThetotalnumberofwellsineachconditionisthesame,sotheweightassignedforeachconditionisthesameEnvironmentalfactorsstudiedwerepH,sodiumchlorideconcentration,andinoculumsizeinacompletebyfactorialdesignwithatotalofdifferentconditionsDatasetII:extentofgerminationdatawerecollectedbyCheaetal()ThetotalnumberofsporesstudiedforeachconditionwasbetweenandThesmalldifferenceinthetotalnumberineachconditionisnegligible,andequalweightforallthedatapointswasassumedinlogisticregressionEnvironmentalfactorsstudiedwerepH,sodiumchlorideconcentration,andtemperatureinacompletebyfactorialdesignwithatotalofdifferentconditionsDatasetIII:RazavilarandGenigeorgisstudiedtheprobabilityofonecellofListeriamonocytogenestogrow,asaffectedbysodiumchlorideconcentration,time,andtemperature()Weightswerenotobtainable,sothisparameterwasassumedtobethesameineachcaseDatasetIV:PmaxwastheparameterusedtoindicatethemaximumfractionofpositivetubesinoculatedwithCbotulinum()ItwasobtainedbyfittingtheexperimentaldatawithalogisticequationThetotalnumberoftubesvariedbyconditionandwasusedastheweightinlogisticregressionFourenvironmentalfactors,pH,sodiumchlorideconcentration,temperature,andinoculumsize,werestudiedinatotalofdifferentconditionsAsubset,containingdatapointsatC,wasnotusedtodevelopmodelsinstead,thesedatapointswereusedlatertovalidatethemodelsdevelopedfromtheremainingpointsModelingwithlinearandlogisticregressionBothlinearandlogisticmodelsweredevelopedinSplus(MathSoft,Inc,Seattle,Wash)foranobjectivecomparisonThegeneralizedlinearmodeling(“glm”)functionwasusedforbothmethodsThelinkfunctionforlogisticregressionis“binomial”andforlinearregressionis“gaussian”Thefullmodelsgeneratedbyeachapproach,withthesamenumberoftermsinthesameformat,wereusedtoensurethevalidityofthecomparisonThelinearmodelwiththreepredictorvariableshasthefollowinggeneralformat:PercentagebCXCYCZ()CXCYCZCXYCXZCYZwherePercentageistheobservedpercentage,bistheintercept,X,Y,andZarethepredictorvariables,andCisarethecoefficientsThelogisticmodelwiththreepredictorvariableshasthefollowinggeneralformat:logit~P!lnSPPDbCXCYCZ()CXCYCZCXYCXZCYZwherePistheprobabilitythattheeventwouldoccuraccordingtothemodelandtheremainingsymbolshavethesamemeaningasinequationModelcomparison(i)AdjustmentwithpredictionsfromlinearmodelsPredictionsfromlinearmodelscanbegreaterthanorlessthanInpractice,thesepredictionsaregenerallyforcedtobeand,respectively(,,,)Tomakethecomparisonofthetwomodelsfairer,predictionsfromlinearmodelswereforcedintotherangeoftointhismannerForallcomparisons,themodifiedpredictionsfromlinearmodelswereused,exceptasnotedbelow(ii)MethodstocomparemodelpredictionsOutofrangepredictionsfromlinearmodelswerecountedinMethodThenumberofpredictionsfromlogisticregressionthatwereclosertotheobservedvalueswasalsocalculatedForthiscalculation,theabsolutevalueofthedifference(predictedminusobserved)wasusedWeexcludedsomeobservationswhoselinearregressionpredictionswereoutofrangeinthecalculationofthepercentageofcloserpredictionsThisisrequiredbecauselogisticregressionpredictsstrictlybetweenandByforcingoutofrangelinearpredictionstobeor,wemayinappropriatelymakesomelinearpredictionsseembetterForexample,iftheobservationis,thelogisticpredictionis,andthelinearpredictionis,ifweforcethelinearpredictiontobe,itwillfalselybejudgedbetterInMethod,wecomparedtherangesofthedifferencesbetweenthepredictedandtheobservedvaluesPointsummariesofthedifferences(predictedminusobserved),ie,minimum,firstquarter,median,mean,thirdquarter,andmaximum,wereobtained,andtherangeandinterquarterrange(IQR)werecalculatedRangemaximumminimum()IQRthirdquarterfirstquarter()ThesmallerthevaluesoftherangeandIQR,thecloserthepredictionsaretotheobservationsTherangeissensitivetooutlyingpointswhosepredictedandobservedvaluesareverydifferent,whiletheIQRisnotaffectedasmuchForMethod,thedeviationofthemodelfromobservationswascalculatedasfollows:DeviationO(predictedobserved)()Thesmallerthedeviation,thecloserthemodelpredictionsweretotheobservationsMethodcannotdetectpredictionsthatarefarfromtheobservationsMethodallowsfordetectionofthesewidedeviationsbymeasuringtherangeofthedifferencesbetweenthepredictedandobservedvalues,butitisunabletoindicatewhichmodelresultsinagreaternumberofpredictionsclosertotheobservedMethodtakesbothconsiderationsintoaccountMethodusedgraphsoftheobservedvalues(xaxis)versuspredictedvalues(yaxis)frombothmodelsAsimplelinearregressionwasfittedtothepoints,andtheintercept,theslope,andRwereobtainedIfthepredictionsareinperfectagreementwiththeobservedvalues,theinterceptshouldbe,theslopeshouldbe,andRshouldbeTheclosertheinterceptisto,theslopeisto,andRisto,thebetteristhegeneralpredictivepowerofthemodelAslopeoflessthanindicatesthatthemodelunderpredictstheobservationMethodusedbiasandaccuracyvaluesasaquantitativewaytomeasurethegoodnessoffitofthemodels(,)Thebiasfactorindicatesbyhowmuch,onaverage,amodeloverpredicts(biasfactor)orunderpredicts(biasfactor,)theobserveddataBiasfactornOlogSpredictedvalueobservedvalueD()TheaccuracyfactorindicatesbyhowmuchthepredictionsdifferfromtheobserveddataAccuracyfactornOUlogSpredictedvalueobservedvalueDU()Inbothequations,nisthenumberofobservationsusedinthecalculationInaperfectmodel,boththebiasandaccuracyfactorsareequaltoSimplificationofthelogisticmodelDatafromCheaetal()wereusedtodemonstratetheprocedureforreducingthenumberofparametersinthelogisticmodelandtoshowhowbetterphysiologicalinsightintotheexperimentmightbederivedfromthereducedmodelZHAOETALAPPLENVIRONMICROBIOLRESULTSDatasetI:percentgrowthpositiveThirtypercentofthepredictionsfromlinearregressionareoutofthetorange(Table)FifteenpredictionsfromthelogisticmodelareclosertotheobservedSevenlinearpredictionswereinaccuratelymadebetterbyforcingpredictionsoverto,andoneconditionwasmadefalselybetterbyforcingthepredictionlowerthantoThepercentagebetterpredictedbylogisticregressioniscalculatedbyexcludingthesedatapoints:Percentagebetterbylogistic()Therangeofthedifferences(predictedminusobserved)fromlogisticregressionismorethantimessmallerthanthatfromlinearregressionTheIQRfromlogisticregressionisaboutonethirdofthatfromlinearregressionThedeviationvalueofthelogisticmodelismorethantimessmallerPredictionsfromlogisticregressionaremuchbetterthanthosefromlinearregressionovertheentirerangeandespeciallyatpointsclosertoand(Fig)Threepredictionsbythelinearmodel,eachwithanobservationof,are,,and,whilethelogisticpredictionsaremuchbetter:,,andThetwoobservationswerepredictedbythelinearmodeltobeand,,whilethelogisticpredictionswereandAnotherobservationatthelowerrange,,waspredictedtobeandbylinearregressionandlogisticregression,respectivelyThefittedlineforthepredictedvaluesfromlogisticregressionwasveryclosetoaperfectfit(Table)Thefittedlineforthelinearmodelpredictionsversustheobservationswasconsiderablyworse,withaslopeofabout,suggestingsystematicunderpredictionThebiasandaccuracyfactorsforlogisticregressionTABLEComparisonofresultsforlinearandlogisticregressionswithfivedifferentmethodsinfourdifferentdatasetsMethodnoParameterDatasetI(n)DatasetII(n)DatasetIII(n)DatasetIVmodel(n)DatasetIVvalidation(n)LinearLogisticLinearLogisticLinearLogisticLinearLogisticLinearLogisticNoofpredictionsNoofpredictions,NoofpredictionsoflogisticmodelclosertoobservedNAaNANANANAofpredictionsoflogisticmodelclosertoobservedNANANANANAbMininumstquarterMedianMeanrdquarterMaximumRangeIQRcDeviationdInterceptSlopeRBiasAccuracysinobservationaNA,notapplicablebDifferencesofvalues(predictedobserved)weresummarizedrangemaximumminimumIQRthirdquarterfirstquartercDeviationofthemodelfromobservationwascalculatedasS(predictedobserved)dLinearregressionwasdonebetweenpredictedandobservedvaluesAperfectmodelwouldhaveaninterceptvalueof,aslopeof,andanRvalueofFIGGoodnessoffitoflinearregressionandlogisticregressionforCbotulinumpercentgrowthpositive(DatasetI)fromZhaoetal()VOL,MODELINGPERCENTAGEWITHLOGISTICREGRESSIONareslightlyclosertothanthoseforlinearregression(Table)DatasetII:germinationextentApproximatelyofthelinearpredictionsareoutofrange(Table)ApproximatelyofthepredictionsfromlogisticregressionareclosertotheobservedvaluesTherangeofthedifferences(predictedminusobserved)fromthelogisticmodelislessthanonethirdthatfromthelinearmodel,andtheIQRisalmostonesixththatfromthelinearmodelThedeviationvalueofthelogisticmodelistimessmallerThelinefittedtothepredictedvaluesfromthelogisticmodelcomparedtoobservedvaluesisveryclosetotheperfectfittingline(Fig,Table)Thefittedlineforpredictionsfromthelinearmodelhadaslopeofonly,suggestingunderprediction(Table)Threeofsevenobservedvaluesofhadhigherlinearpredictions,at,,and,whiletheremainingfourpredictionswerenegativeAllsevenlogisticpredictionsareverycloseto,withthelargestbeingThreehigherobservations,,,and,werepredictedtobe,,andbythelinearmodel,whilelogisticregressionproducedmuchmoreaccuratepredictionsat,,and,respectivelyThebiasfactorsforthetwomodelsarealmostthesame,andthelogisticmodelisslightlymoreaccurateasjudgedbytheaccuracyfactor(Table)DatasetIII:probabilityofonecellofListeriamonocytogenestogrowThisisaveryspecialdatasetsinceofdatapointsareeitherorResultsdemonstratedthatlogisticregressionisamuchmorepowerfultoolwhenmodelingthistypeofdatasetApproximatelyofthelinearpredictedvaluesareoutofrangeAllobservationsarepredictedbetterbylogisticregressionTherangeofthedifferences(predictedminusobserved)fromthelogisticmodelismorethanfoldsmallerTheIQRanddeviationvaluefromthelogist

用户评论(0)

0/200

精彩专题

上传我的资料

每篇奖励 +2积分

资料评价:

/7
1下载券 下载 加入VIP, 送下载券

意见
反馈

立即扫码关注

爱问共享资料微信公众号

返回
顶部