首页 The evolution of partial least squares models

The evolution of partial least squares models

举报
开通vip

The evolution of partial least squares models ber 2010, Published online in Wiley Online Library: 2010 l least squares models ric approaches in tabolic phenotyping sa a Je nd an in ltiv ith ent ct t ou 1. IN Metabon measurem response genetic m defined study of t broadened to encompass a wider charac...

The evolution of partial least squares models
ber 2010, Published online in Wiley Online Library: 2010 l least squares models ric approaches in tabolic phenotyping sa a Je nd an in ltiv ith ent ct t ou 1. IN Metabon measurem response genetic m defined study of t broadened to encompass a wider characterization of metabolic tions, information-rich analytical approaches are required. Several often provide complementary information for the analytical Review f - , J. K. Nicholson, E. Holmes, M.-E. Dumas Biomolecular Medicine, Department of Surgery and Cancer, Faculty of 6 3 6 methods in addition to NMR can produce metabolic signatures of biomaterials, including mass spectrometry (MS) [10], and optical Medicine, Imperial College London, Sir Alexander Fleming Building, Exhibition Road, South Kensington, London SW7 2AZ, UK example, disease processes, toxic reactions or genetic manipula- responses, and often the two terms are used interchangeably. The concept of metabonomics arose from pioneering work [3] on the application of nuclear magnetic resonance (NMR) spectroscopy to study the metabolic composition of biofluids, cells and tissues and was aimed at the augmentation and complementation of the information provided by measuring the genetic, and later proteomic, responses to xenobiotic exposure. Genomics, proteomics, and metabonomics [4]/metabolomics [5] now form a core part of the fundamental systems biology framework [7], and are required for an understanding of the integrated function of the living organism [8,9], with each ‘-omic’ platform generating large amounts of data. To investigate the complex metabolic consequences of, for scientist. The obtained spectra of biofluids such as plasma and urine, or of tissues and tissue extracts, provide metabolic patterns corresponding to the metabolic status of the organism as a function of genetic, environmental or toxicological influence. * Correspondence to: M.-E. Dumas, Biomolecular Medicine, Department o Surgery and Cancer, Faculty of Medicine, Imperial College London, Sir Alex ander Fleming Building, Exhibition Road, South Kensington, London SW7 2AZ UK. E-mail: m.dumas@imperial.ac.uk a J. M. Fonville, S. E. Richards, R. H. Barton, C. L. Boulange, T. M. D. Ebbels, J. Chemom applications. Flexibility of PLS methods in general and of O-PLS in particular allows implementation of derivative methods such as O2-PLS, O-PLS-variance components, nonlinear methods, and batch modeling to improve analysis of complex data sets, which facilitates extraction of information related to subtle biological processes. These approaches can be used to address issues present in complex multi-factorial data sets. Thus, we highlight the key advantages and limitations of the different latent variable applications for top-down systems biology and assess the differences between the methods available. Copyright � 2010 John Wiley & Sons, Ltd. Keywords: metabonomics; multivariate; partial least squares; O-PLS; systems biology TRODUCTION omics [1–6] was originally defined as ‘the quantitative ent of the dynamic multi-parametric metabolic of living systems to pathophysiological stimuli or odification’ [1]. In the same era, metabolomics was in plant biology and microbiology as ‘the systematic he metabolic complement of the cell’ [2], but has since spectroscopic techniques such as Raman, Fourier transform infrared (FTIR), near infrared (NIR) and ultraviolet–visible (UV–Vis) spectroscopies. Simplification of the mass spectra can be obtained by coupling MS to a separation technology such as gas chromatography (GC/MS) or high-performance liquid chromatography (HPLC/MS) [11]. NMR and MS are currently the most frequently used techniques for metabolic profiling [12], and the characteristic differences in their nature mean that they the power and potential of which is illustrated with key papers. Recent improvements based on the removal of orthogonal variation are discussed in terms of interpretation enhancement, and are supported by relevant Received: 8 March 2010, Revised: 1 July 2010, Accepted: 20 Septem The evolution of partia and related chemomet metabonomics and me Judith M. Fonvillea, Selena E. Richard L. Boulangea, Timothy M. D. Ebbelsa, and Marc-Emmanuel Dumasa* Metabonomics is a key element in systems biology, a quantitative or qualitative metabolic data. Underst achieved by integration of ‘omics’ approaches includ increasing the complexity of the full data sets. Mu characterizing metabolic information associated w techniques that have evolved from principal compon on improved interpretation and modeling with respe metabonomic applications. Visualization is of param (wileyonlinelibrary.com) DOI: 10.1002/cem.1359 etrics 2010; 24: 636–649 Copyright � 20 , Richard H. Barton , Claire remy K. Nicholsona, Elaine Holmesa with current analytical methods, generates vast amounts of ding of the global function of the living organism can be g metabonomics, genomics, transcriptomics and proteomics, ariate statistical approaches are well suited to extract the each level of dynamic process. In this review, we discuss analysis and partial least squares (PLS) methods with a focus o biomarker recovery and data visualization in the context of nt importance to investigate complex metabolic signatures, 10 John Wiley & Sons, Ltd. Ongoing analytical developments include development of cryogenic probes and increases in magnetic field strength for NMR as well as mass resolution for MS. The enhanced sensitivity and resolution gained by these developments increases the capability for biomarker detection, at the expense of increased spectral complexity. Spectral complexity was traditionally addressed by peak picking or by binning signals across the spectra in order to generate a smaller or more manageable data set. For NMR spectra typically 250 buckets of 0.04 ppm resolution would be generated [13,14]. With increased computational capacity, this procedure has largely been replaced by importing full resolution spectra (e.g. 32k data points) [15]. For this reason, the analyses are performed using the full data set rather than a peak-picked selection, which increases data dimensionality. This full spectral import can increase the information recovered, for example potential biomarkers, from the statistical modeling per- formed on the data. Similarly, mass resolution has now increased up to 10 000 for time-of-flight analyzers and up to 100 000 for Fourier-transform MS analyzers, which ensures access to accurate mass measurements and peak lists with �1000–5000 peaks. The analysis of these data sets is now possible with the constant improvement of computing capabilities of both software and hardware for personal workstations. Additionally, increased throughput of samples and the move toward increased cross- platform integration have further increased the complexity and size 2. LATENT VARIABLE METHODS IN METABONOMICS 2.1. Univariate versus multivariate approaches The application of the multivariate advantage to biological studies produces weighted combinations of the original variables (forming the latent variables) that delineate groupings, which are often not visible via any classical univariate analytical approach. Only multivariate methods can really take into account the relations between variables: multi-variable patterns can be significant, even if the individual variables are not. For example, classes of metabolically differentiated groups may only be evident using multivariate approaches. In an early paper on classification of brain tumors [17] it was noted that ‘. . .when these data are subjected to pattern recognition analyses, the possibility will be open for an objective differential diagnosis. . . based solely on differences in. . . biochemical composition’, thus emphasizing the need for unbiased approaches in biochemical studies. Although initially, spectral analysis of metabolic profiles was performed visually [18,19] it was soon recognized that pattern recognition (PR) methods had great potential for metabonomic studies [20]. The multivariate approach [21] results in an unbiased overview of the data set, as facilitated through the representation with latent variables. In such cases as NMR, latent variables may be visually se and Partial least squares models in metabonomics of data sets and present a considerable challenge to the analyst. In order to accommodate the enormity of the data, chemometric methods are often employed to extract the specific metabolic information characterizing each group or dynamic process, in order to form or confirm hypotheses associated with the expression of a metabolic profile. In this review the most commonly used latent variable methods for the extraction of statistically characterized metabolic phenotypes are discussed in semi-chronological order (Figure 1 and Textbox); we especially emphasize the improvement in interpretability of spectral data made possible through the use of orthogonal partial least squares (O-PLS) [16]. Figure 1. A timeline displaying the order of development of PLS and sub analyses. The applications listed here are discussed throughout the text J. Chemometrics 2010; 24: 636–649 Copyright � 2010 John Wil presented in pseudo-spectral form as will be discussed later, a great aid to the experienced spectroscopist [15]. Two commonly used multivariate methods are principal component analysis (PCA [22], unsupervised) and partial least squares regression (PLS [23], supervised). In these models, a multivariate latent variable is constructed from the input (X matrix, e.g. spectral variables). Latent variables, as opposed to observed variables, are derived from modeling using the original variables. This latent vector maximizes total variance (in PCA), or the covariance between the X and response (Y) matrices, in PLS. quent or related approaches with exemplar applications in metabonomic relate to typical examples, not the timeline. ey & Sons, Ltd. wileyonlinelibrary.com/journal/cem 6 3 7 aly in t is a ist ad , th PL any he ial ce ific uc en th as N co th h a J. M. Fonville et al. 6 3 8 The early use of multivariate analysis in the medical precursor work [17] was followed by the application of these techniques in metabonomic research [6,13,20,24]. An exemplar employment of PR methods in a metabonomics context was the classification and interrogation of the 1H NMR urinalysis data in a variety of experimental toxicity states in the rat [20]. PR reduced the potentially enormous spectroscopic data to a few interpretable latent variables and provided a means of classifying toxicological data. Historical background of PLS methods The origins of PCA itself lie as far back as Cauchy and his eigenan for application in biometric studies, and later used by Hotelling HermannWold. PLS shares some central features of PCA in that it It is neither appropriate nor possible here to provide a detailed h those wanting more historical details on PLS, the review by Gel recommended, as is the textbook by Eriksson et al. [38]. In outline in the area of psychometrics, although Horst also claimed that the 1936 in his original article on canonical analysis. Subsequently, m HermannWold [125], including the path PLS approach as well as t and in fact it was the NIPALS algorithm fromwhich the name ‘part delivery’ date of PLS as being 1977 [125] although the NIPALS pro [128]. It is interesting to note that while econometrics made sign more recent concept of orthogonal variance filtration was introd variation orthogonal to the response vectors led to the developm integrated with the PLS algorithm, giving O-PLS in 2002. Although Hotelling [127], the development of alternative algorithms such speed, efficiency, and stability of subsequent PLS algorithms. The advent of computer numerical methods have powerfully assisted require the analysis of large, complex measurement matrices, suc psychometrics, and econometrics. Multivariate projection approaches are advantageous for a reliable and flexible analysis of NMR and MS data as they reduce the dimensionality of the data to improve data visualization and interpretation. This allows handling of the multicollinearity and information redundancy issues commonly observed in spectral data sets: the spectral signature of a given metabolite may contain multiple signals, for example isotopologues, fragments, and adducts in MS and multiplets in NMR. Often, the number of variables is much larger than the number of observations, which is problematic for classical linear regression methods. All of these issues are appropriately addressed through the use of latent variable methods. 2.2. Principal component analysis Early PR methods employed in metabonomics included non- linear mapping, hierarchical cluster analysis (HCA), and PCA [13,14,20,24,25]. Based on 1H NMR urine spectra, HCA and PCA methods were found to give reliable clustering of renal proximal tubular toxins based on similarity of mechanism of toxicity. PCA is suited as a data representation because the decomposition retains a high proportion of the variance in the original data set. Many early metabonomic studies employing PCA were toxicity studies that involved large toxin-induced metabolic perturbations, which appeared in the first few principal components, and often displayed the separation of the different study groups on the first component. Clustering of liver, kidney, wileyonlinelibrary.com/journal/cem Copyright � 2010 John and testicular toxins could, for example, be obtained through such unsupervised PR methods applied to spectral biofluid data. Moreover, direct visualization of the prominent metabolic changes is possible by plotting the loadings of the PCA model [13,14,20,24,25]. 2.3. PCA-based classification sis (ca. 1829), but PCA was made explicit by Pearson [122], 1901, he 1930s [123,124] before being employed in econometrics by ‘variance harvesting’ approach for the analysis of large data sets. ory and evolution of PCA and subsequent PLS methods, but for i [125] from 1988 on the history and nature of PLS modeling is e PLSmethod was described by Horst in 1961 [125,126], working S principle was already understood by Hotelling [127] as early as PLS applications in the field of econometrics were pioneered by NIPALS algorithm, developed together with his son SvanteWold, least squares’ was drawn [38]. Officially S. Wold places the ‘public dure was published in 1969 in the context of PLS path modeling ant developments in various issues for large matrix analysis, the ed in the context of chemical analysis. Methods filtering out the t of OSC in 1998, implemented using NIPALS, andwhich was later e principle of canonical analysis was introduced very early on by IPALS [129], and SIMPLS [130] greatly improved computational mputational ease with which PLS can be implemented and the e expansion of real-world application of PLS in diverse fields that s biometry, process analysis, chemo-informatics, bio-informatics, Group classification and membership prediction was initially performed using unsupervised or ‘semi-supervised’ PCA-based approaches. The generated PCA model can be used to project new data and determine membership on the basis of scores projection. The PCA-based method called soft independent modeling of class analogy (SIMCA) [26] is based on class groupings: independent PCA models are generated for each of the given classes in a data set and new samples are projected onto each independent model; new sample membership is then assigned by proximity to the sub-models. Often, complementary information such as the unmodeled variance for a sample is incorporated into a Cooman’s plot, showing the proximity to the PCA scores confidence interval space (Hotelling’s T2) that provides a visual classification method for the two-class case. One of the earliest applications of the SIMCA modeling approach in metabolic profiling can be traced back to 1981, when Wold and co-workers [17] used GC/MS to study human cancer cells. Subsequently, applications in metabonomic toxicity, clinical pathologies, and molecular parasitology have been developed [27,28]. A disadvantage of SIMCA for data visualization is the presence of several different sub-models, rather than one overall model with clear-cut interpretation of the loadings. This is the main drawback in using SIMCA for metabonomic applications, as it is hard to ascertain what variables separate the classes. Alternative procedures providing this information include the use of density distributions on the latent variable [29] or the use of discriminant analysis (DA) version of PLS (PLS-DA). Wiley & Sons, Ltd. J. Chemometrics 2010; 24: 636–649 have been suggested [33], which can complicate studies of such areas as cardiovascular disease diagnosis [43,44]. Partial least squares models in metabonomics more subtle effects such as physiological challenges, the effect of probiotics and prebiotics, and nutritional status. A noteworthy application of PLS involved the introduction and validation of the pharmaco-metabonomics concept, in which the individual response to a toxic insult can be predicted prior to a drug treatment [34,35]. Such developments in the pursuit of personalized healthcare were direct consequences of careful and efficient application of PLS modeling. The main virtue of latent variable projection methods such as PLS is the transparency of the models with respect to scores (with possible sub-groupings), and weight vectors, which allow metabolic interpretation. This is more difficult with ‘black box’methods such as support vector machines and other classifiers that need dimensionality reduction prior to nonlinear classification. 2.5. Validation One of the main advantages of PLS is the predictive ability of the obtained models. However, without caution, any supervised modeling is prone to overfitting. Ideally, the availability of a second, independent data set would allow for validation of the calculated PLS models. Alternatively, there are a number of methods to perform cross-validation based on the data that are used for modeling. These include methods that leave either one or a fraction of the samples out (such as n-fold cross-validation), but preference should be given to the use of resampling techniques, such as Monte Carlo cross-validation [36] and bootstrapping [37]. The use of cross-validation gives the possibility to calculate model statistics such as R2 (proportion of explained variance) and Q2 (proportion of predicted variance), as well as the sensitivity, specificity, and the receiver operating characteristic (ROC) [38]. The use of cross-validation also results in predictions from the left-out samples and one can advantageously use these cross-validated scores plots to get more realistic information on the predictivity and quality of the model. Interestingly, the number of components is often determined based on the Q2 value, but as the Q2 is later used to report the 2.4. Partial least squares Unlike PCA, PLS regression is directed by a response data set Y to derive the components from the descriptor data set X that best describe the specified Y structure, as it maximizes the covariance expressing the common structure between X and Y [23,30,31]. PLS is sometimes subdivided into regression analysis and discriminant analysis (PLS-DA). In classification or DA, samples are allocated into the appropriate discrete classes, which are represented by using so-called ‘dummy’ variables (Booleans). PLS and supervised approaches, in general, are often applied to tackle more subtle problems in metabonomics, in which large arrays of data require an approach that permits relationships buried in a background of other large and multiplexed effects to be uncovered within these data. For example in biological systems (e.g., environmental, nutritional, or epidemiological studies) it is often the case that several metabolic contributions control the compo
本文档为【The evolution of partial least squares models】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_165853
暂无简介~
格式:pdf
大小:631KB
软件:PDF阅读器
页数:14
分类:
上传时间:2012-11-27
浏览量:11