ber 2010, Published online in Wiley Online Library: 2010
l least squares models
ric approaches in
tabolic phenotyping
sa a
Je
nd
an
in
ltiv
ith
ent
ct t
ou
1. IN
Metabon
measurem
response
genetic m
defined
study of t
broadened to encompass a wider characterization of metabolic
tions, information-rich analytical approaches are required. Several
often provide complementary information for the analytical
Review
f
-
,
J. K. Nicholson, E. Holmes, M.-E. Dumas
Biomolecular Medicine, Department of Surgery and Cancer, Faculty of
6
3
6
methods in addition to NMR can produce metabolic signatures of
biomaterials, including mass spectrometry (MS) [10], and optical
Medicine, Imperial College London, Sir Alexander Fleming Building, Exhibition
Road, South Kensington, London SW7 2AZ, UK
example, disease processes, toxic reactions or genetic manipula-
responses, and often the two terms are used interchangeably.
The concept of metabonomics arose from pioneering work
[3] on the application of nuclear magnetic resonance (NMR)
spectroscopy to study the metabolic composition of biofluids,
cells and tissues and was aimed at the augmentation and
complementation of the information provided by measuring the
genetic, and later proteomic, responses to xenobiotic exposure.
Genomics, proteomics, and metabonomics [4]/metabolomics
[5] now form a core part of the fundamental systems biology
framework [7], and are required for an understanding of the
integrated function of the living organism [8,9], with each ‘-omic’
platform generating large amounts of data.
To investigate the complex metabolic consequences of, for
scientist. The obtained spectra of biofluids such as plasma and
urine, or of tissues and tissue extracts, provide metabolic patterns
corresponding to the metabolic status of the organism as a
function of genetic, environmental or toxicological influence.
* Correspondence to: M.-E. Dumas, Biomolecular Medicine, Department o
Surgery and Cancer, Faculty of Medicine, Imperial College London, Sir Alex
ander Fleming Building, Exhibition Road, South Kensington, London SW7 2AZ
UK.
E-mail: m.dumas@imperial.ac.uk
a J. M. Fonville, S. E. Richards, R. H. Barton, C. L. Boulange, T. M. D. Ebbels,
J. Chemom
applications. Flexibility of PLS methods in general and of O-PLS in particular allows implementation of derivative
methods such as O2-PLS, O-PLS-variance components, nonlinear methods, and batch modeling to improve analysis of
complex data sets, which facilitates extraction of information related to subtle biological processes. These approaches
can be used to address issues present in complex multi-factorial data sets. Thus, we highlight the key advantages and
limitations of the different latent variable applications for top-down systems biology and assess the differences
between the methods available. Copyright � 2010 John Wiley & Sons, Ltd.
Keywords: metabonomics; multivariate; partial least squares; O-PLS; systems biology
TRODUCTION
omics [1–6] was originally defined as ‘the quantitative
ent of the dynamic multi-parametric metabolic
of living systems to pathophysiological stimuli or
odification’ [1]. In the same era, metabolomics was
in plant biology and microbiology as ‘the systematic
he metabolic complement of the cell’ [2], but has since
spectroscopic techniques such as Raman, Fourier transform
infrared (FTIR), near infrared (NIR) and ultraviolet–visible (UV–Vis)
spectroscopies. Simplification of the mass spectra can be
obtained by coupling MS to a separation technology such as
gas chromatography (GC/MS) or high-performance liquid
chromatography (HPLC/MS) [11]. NMR and MS are currently
the most frequently used techniques for metabolic profiling [12],
and the characteristic differences in their nature mean that they
the power and potential of which is illustrated with key papers. Recent improvements based on the removal of
orthogonal variation are discussed in terms of interpretation enhancement, and are supported by relevant
Received: 8 March 2010, Revised: 1 July 2010, Accepted: 20 Septem
The evolution of partia
and related chemomet
metabonomics and me
Judith M. Fonvillea, Selena E. Richard
L. Boulangea, Timothy M. D. Ebbelsa,
and Marc-Emmanuel Dumasa*
Metabonomics is a key element in systems biology, a
quantitative or qualitative metabolic data. Underst
achieved by integration of ‘omics’ approaches includ
increasing the complexity of the full data sets. Mu
characterizing metabolic information associated w
techniques that have evolved from principal compon
on improved interpretation and modeling with respe
metabonomic applications. Visualization is of param
(wileyonlinelibrary.com) DOI: 10.1002/cem.1359
etrics 2010; 24: 636–649 Copyright � 20
, Richard H. Barton , Claire
remy K. Nicholsona, Elaine Holmesa
with current analytical methods, generates vast amounts of
ding of the global function of the living organism can be
g metabonomics, genomics, transcriptomics and proteomics,
ariate statistical approaches are well suited to extract the
each level of dynamic process. In this review, we discuss
analysis and partial least squares (PLS) methods with a focus
o biomarker recovery and data visualization in the context of
nt importance to investigate complex metabolic signatures,
10 John Wiley & Sons, Ltd.
Ongoing analytical developments include development of
cryogenic probes and increases in magnetic field strength for
NMR as well as mass resolution for MS. The enhanced sensitivity
and resolution gained by these developments increases the
capability for biomarker detection, at the expense of increased
spectral complexity. Spectral complexity was traditionally
addressed by peak picking or by binning signals across the
spectra in order to generate a smaller or more manageable data
set. For NMR spectra typically 250 buckets of 0.04 ppm resolution
would be generated [13,14]. With increased computational
capacity, this procedure has largely been replaced by importing
full resolution spectra (e.g. 32k data points) [15]. For this reason,
the analyses are performed using the full data set rather than a
peak-picked selection, which increases data dimensionality. This
full spectral import can increase the information recovered, for
example potential biomarkers, from the statistical modeling per-
formed on the data. Similarly, mass resolution has now increased
up to 10 000 for time-of-flight analyzers and up to 100 000 for
Fourier-transform MS analyzers, which ensures access to accurate
mass measurements and peak lists with �1000–5000 peaks.
The analysis of these data sets is now possible with the constant
improvement of computing capabilities of both software and
hardware for personal workstations. Additionally, increased
throughput of samples and the move toward increased cross-
platform integration have further increased the complexity and size
2. LATENT VARIABLE METHODS IN
METABONOMICS
2.1. Univariate versus multivariate approaches
The application of the multivariate advantage to biological
studies produces weighted combinations of the original variables
(forming the latent variables) that delineate groupings, which
are often not visible via any classical univariate analytical
approach. Only multivariate methods can really take into account
the relations between variables: multi-variable patterns can be
significant, even if the individual variables are not. For example,
classes of metabolically differentiated groups may only be
evident using multivariate approaches. In an early paper on
classification of brain tumors [17] it was noted that ‘. . .when these
data are subjected to pattern recognition analyses, the possibility
will be open for an objective differential diagnosis. . . based solely on
differences in. . . biochemical composition’, thus emphasizing the
need for unbiased approaches in biochemical studies. Although
initially, spectral analysis of metabolic profiles was performed
visually [18,19] it was soon recognized that pattern recognition
(PR) methods had great potential for metabonomic studies [20].
The multivariate approach [21] results in an unbiased overview of
the data set, as facilitated through the representation with latent
variables. In such cases as NMR, latent variables may be visually
se
and
Partial least squares models in metabonomics
of data sets and present a considerable challenge to the analyst.
In order to accommodate the enormity of the data, chemometric
methods are often employed to extract the specific metabolic
information characterizing each group or dynamic process, in order
to form or confirm hypotheses associated with the expression of a
metabolic profile. In this review the most commonly used latent
variable methods for the extraction of statistically characterized
metabolic phenotypes are discussed in semi-chronological order
(Figure 1 and Textbox); we especially emphasize the improvement
in interpretability of spectral data made possible through the use of
orthogonal partial least squares (O-PLS) [16].
Figure 1. A timeline displaying the order of development of PLS and sub
analyses. The applications listed here are discussed throughout the text
J. Chemometrics 2010; 24: 636–649 Copyright � 2010 John Wil
presented in pseudo-spectral form as will be discussed later, a
great aid to the experienced spectroscopist [15].
Two commonly used multivariate methods are principal
component analysis (PCA [22], unsupervised) and partial least
squares regression (PLS [23], supervised). In these models, a
multivariate latent variable is constructed from the input
(X matrix, e.g. spectral variables). Latent variables, as opposed
to observed variables, are derived from modeling using the
original variables. This latent vector maximizes total variance (in
PCA), or the covariance between the X and response (Y) matrices,
in PLS.
quent or related approaches with exemplar applications in metabonomic
relate to typical examples, not the timeline.
ey & Sons, Ltd. wileyonlinelibrary.com/journal/cem
6
3
7
aly
in t
is a
ist
ad
, th
PL
any
he
ial
ce
ific
uc
en
th
as N
co
th
h a
J. M. Fonville et al.
6
3
8
The early use of multivariate analysis in the medical precursor
work [17] was followed by the application of these techniques
in metabonomic research [6,13,20,24]. An exemplar employment
of PR methods in a metabonomics context was the classification
and interrogation of the 1H NMR urinalysis data in a variety of
experimental toxicity states in the rat [20]. PR reduced the
potentially enormous spectroscopic data to a few interpretable
latent variables and provided a means of classifying toxicological
data.
Historical background of PLS methods
The origins of PCA itself lie as far back as Cauchy and his eigenan
for application in biometric studies, and later used by Hotelling
HermannWold. PLS shares some central features of PCA in that it
It is neither appropriate nor possible here to provide a detailed h
those wanting more historical details on PLS, the review by Gel
recommended, as is the textbook by Eriksson et al. [38]. In outline
in the area of psychometrics, although Horst also claimed that the
1936 in his original article on canonical analysis. Subsequently, m
HermannWold [125], including the path PLS approach as well as t
and in fact it was the NIPALS algorithm fromwhich the name ‘part
delivery’ date of PLS as being 1977 [125] although the NIPALS pro
[128]. It is interesting to note that while econometrics made sign
more recent concept of orthogonal variance filtration was introd
variation orthogonal to the response vectors led to the developm
integrated with the PLS algorithm, giving O-PLS in 2002. Although
Hotelling [127], the development of alternative algorithms such
speed, efficiency, and stability of subsequent PLS algorithms. The
advent of computer numerical methods have powerfully assisted
require the analysis of large, complex measurement matrices, suc
psychometrics, and econometrics.
Multivariate projection approaches are advantageous for a
reliable and flexible analysis of NMR and MS data as they reduce
the dimensionality of the data to improve data visualization and
interpretation. This allows handling of the multicollinearity and
information redundancy issues commonly observed in spectral
data sets: the spectral signature of a given metabolite may
contain multiple signals, for example isotopologues, fragments,
and adducts in MS and multiplets in NMR. Often, the number of
variables is much larger than the number of observations, which
is problematic for classical linear regression methods. All of these
issues are appropriately addressed through the use of latent
variable methods.
2.2. Principal component analysis
Early PR methods employed in metabonomics included non-
linear mapping, hierarchical cluster analysis (HCA), and PCA
[13,14,20,24,25]. Based on 1H NMR urine spectra, HCA and PCA
methods were found to give reliable clustering of renal proximal
tubular toxins based on similarity of mechanism of toxicity. PCA is
suited as a data representation because the decomposition
retains a high proportion of the variance in the original data set.
Many early metabonomic studies employing PCA were
toxicity studies that involved large toxin-induced metabolic
perturbations, which appeared in the first few principal
components, and often displayed the separation of the different
study groups on the first component. Clustering of liver, kidney,
wileyonlinelibrary.com/journal/cem Copyright � 2010 John
and testicular toxins could, for example, be obtained through
such unsupervised PR methods applied to spectral biofluid data.
Moreover, direct visualization of the prominent metabolic
changes is possible by plotting the loadings of the PCA model
[13,14,20,24,25].
2.3. PCA-based classification
sis (ca. 1829), but PCA was made explicit by Pearson [122], 1901,
he 1930s [123,124] before being employed in econometrics by
‘variance harvesting’ approach for the analysis of large data sets.
ory and evolution of PCA and subsequent PLS methods, but for
i [125] from 1988 on the history and nature of PLS modeling is
e PLSmethod was described by Horst in 1961 [125,126], working
S principle was already understood by Hotelling [127] as early as
PLS applications in the field of econometrics were pioneered by
NIPALS algorithm, developed together with his son SvanteWold,
least squares’ was drawn [38]. Officially S. Wold places the ‘public
dure was published in 1969 in the context of PLS path modeling
ant developments in various issues for large matrix analysis, the
ed in the context of chemical analysis. Methods filtering out the
t of OSC in 1998, implemented using NIPALS, andwhich was later
e principle of canonical analysis was introduced very early on by
IPALS [129], and SIMPLS [130] greatly improved computational
mputational ease with which PLS can be implemented and the
e expansion of real-world application of PLS in diverse fields that
s biometry, process analysis, chemo-informatics, bio-informatics,
Group classification and membership prediction was initially
performed using unsupervised or ‘semi-supervised’ PCA-based
approaches. The generated PCA model can be used to project
new data and determine membership on the basis of scores
projection. The PCA-based method called soft independent
modeling of class analogy (SIMCA) [26] is based on class
groupings: independent PCA models are generated for each of
the given classes in a data set and new samples are projected
onto each independent model; new sample membership is then
assigned by proximity to the sub-models. Often, complementary
information such as the unmodeled variance for a sample is
incorporated into a Cooman’s plot, showing the proximity to
the PCA scores confidence interval space (Hotelling’s T2) that
provides a visual classification method for the two-class case.
One of the earliest applications of the SIMCA modeling approach
in metabolic profiling can be traced back to 1981, when Wold
and co-workers [17] used GC/MS to study human cancer cells.
Subsequently, applications in metabonomic toxicity, clinical
pathologies, and molecular parasitology have been developed
[27,28]. A disadvantage of SIMCA for data visualization is the
presence of several different sub-models, rather than one overall
model with clear-cut interpretation of the loadings. This is the
main drawback in using SIMCA for metabonomic applications, as
it is hard to ascertain what variables separate the classes.
Alternative procedures providing this information include the use
of density distributions on the latent variable [29] or the use of
discriminant analysis (DA) version of PLS (PLS-DA).
Wiley & Sons, Ltd. J. Chemometrics 2010; 24: 636–649
have been suggested [33], which can complicate studies of
such areas as cardiovascular disease diagnosis [43,44].
Partial least squares models in metabonomics
more subtle effects such as physiological challenges, the effect of
probiotics and prebiotics, and nutritional status. A noteworthy
application of PLS involved the introduction and validation of
the pharmaco-metabonomics concept, in which the individual
response to a toxic insult can be predicted prior to a drug
treatment [34,35]. Such developments in the pursuit of
personalized healthcare were direct consequences of careful
and efficient application of PLS modeling. The main virtue of
latent variable projection methods such as PLS is the
transparency of the models with respect to scores (with possible
sub-groupings), and weight vectors, which allow metabolic
interpretation. This is more difficult with ‘black box’methods such
as support vector machines and other classifiers that need
dimensionality reduction prior to nonlinear classification.
2.5. Validation
One of the main advantages of PLS is the predictive ability of
the obtained models. However, without caution, any supervised
modeling is prone to overfitting. Ideally, the availability of a
second, independent data set would allow for validation of
the calculated PLS models. Alternatively, there are a number of
methods to perform cross-validation based on the data that
are used for modeling. These include methods that leave
either one or a fraction of the samples out (such as n-fold
cross-validation), but preference should be given to the use of
resampling techniques, such as Monte Carlo cross-validation
[36] and bootstrapping [37]. The use of cross-validation
gives the possibility to calculate model statistics such as R2
(proportion of explained variance) and Q2 (proportion of
predicted variance), as well as the sensitivity, specificity, and
the receiver operating characteristic (ROC) [38]. The use of
cross-validation also results in predictions from the left-out
samples and one can advantageously use these cross-validated
scores plots to get more realistic information on the predictivity
and quality of the model.
Interestingly, the number of components is often determined
based on the Q2 value, but as the Q2 is later used to report the
2.4. Partial least squares
Unlike PCA, PLS regression is directed by a response data set Y
to derive the components from the descriptor data set X that best
describe the specified Y structure, as it maximizes the covariance
expressing the common structure between X and Y [23,30,31].
PLS is sometimes subdivided into regression analysis and
discriminant analysis (PLS-DA). In classification or DA, samples
are allocated into the appropriate discrete classes, which are
represented by using so-called ‘dummy’ variables (Booleans).
PLS and supervised approaches, in general, are often applied
to tackle more subtle problems in metabonomics, in which large
arrays of data require an approach that permits relationships
buried in a background of other large and multiplexed effects
to be uncovered within these data. For example in biological
systems (e.g., environmental, nutritional, or epidemiological
studies) it is often the case that several metabolic contributions
control the compo
本文档为【The evolution of partial least squares models】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。