[09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 502 502–508
BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 4 2011, pages 502–508doi:10.1093/bioinformatics/btq696
Gene expression Advance Access publication December 17, 2010
Using non-uniform read distribution models to improve isoform
expression inference in RNA-Seq
Zhengpeng Wu†, Xi Wang† and Xuegong Zhang∗
MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua
University, Beijing 100084, China
Associate Editor: Ivo Hofacker
ABSTRACT
Motivation: RNA-Seq technology based on next-generation
sequencing provides the unprecedented ability of studying
transcriptomes at high resolution and accuracy, and the potential
of measuring expression of multiple isoforms from the same gene at
high precision. Solved by maximum likelihood estimation, isoform
expression can be inferred in RNA-Seq using statistical models
based on the assumption that sequenced reads are distributed
uniformly along transcripts. Modification of the model is needed
when considering situations where RNA-Seq data do not follow
uniform distribution.
Results: We proposed two curves, the global bias curve (GBC)
and the local bias curves (LBCs), to describe the non-uniformity
of read distributions for all genes in a transcriptome and for each
gene, respectively. Incorporating the bias curves into the uniform
read distribution (URD) model, we introduced non-URD (N-URD)
models to infer isoform expression levels. On a series of systematic
simulation studies, the proposed models outperform the original
model in recovering major isoforms and the expression ratio of
alternative isoforms. We also applied the new model to real RNA-
Seq datasets and found that its inferences on expression ratios of
alternative isoforms are more reasonable. The experiments indicate
that incorporating N-URD information can improve the accuracy in
modeling and inferring isoform expression in RNA-Seq.
Contact: zhangxg@tsinghua.edu.cn
Supplementary information: Supplementary data are available at
Bioinformatics online.
Received on April 6, 2010; revised on November 25, 2010; accepted
on December 11, 2010
1 INTRODUCTION
Applying next-generation high-throughput sequencing technologies
to transcriptomes (RNA-Seq) makes it possible to precisely measure
the abundances of transcripts. This technology has been widely
investigated on the study of gene expressions (Marioni et al., 2008;
Nagalakshmi et al., 2008), splice variants (Pan et al., 2008; Wang
et al., 2008; Wilhelm et al., 2008), novel transcripts discovery
(Mortazavi et al., 2008), RNA sequence polymorphism (Cloonan
et al., 2008) and chimeric transcripts (Zhang et al., 2010) in recent
∗To whom correspondence should be addressed.
†The authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint First Authors.
literature. The ability of RNA-Seq to count sequenced reads along
transcripts gives digital measurement of transcription levels, and is
powerful for identifying and measuring the amount of transcribed
isoforms of alternative splicing genes (Wang et al., 2008).
Changes of isoforms’ expression levels in many genes are
of functional importance in particular biological processes. For
example, switching between major and minor isoforms of genes
plays a crucial role in mouse muscle myogenesis (Trapnell et al.,
2010); isoform expression changes have been shown to be highly
related to the development of many complex diseases, such as
Lewy bodies (Beyer et al., 2008; Humbert et al., 2007) and
progressive supranuclear palsy (Chambers et al., 1999); studies also
obtained the evidence that the expression ratio of two alternative
isoforms from human progesterone receptor (PR) gene could result
in different outcomes of breast cancer treatment (Cork et al.,
2008). Therefore, correctly estimating isoform expression levels
is becoming increasingly important for understanding complicated
biological mechanisms.
In RNA-Seq, under the assumption that the sequenced reads
are sampled independently and uniformly on all transcripts in the
sample, it is straightforward to model the distribution of read counts
of exons as a Poisson distribution (Mortazavi et al., 2008). Using
a uniform read distribution model (URD model for short), Jiang
and Wong (2009) proposed a maximum likelihood estimate (MLE)
method to infer isoform expression levels. However, some studies
show that there are substantial biases in high-throughput sequencing
data (Dohm et al., 2008), which may cause non-uniformity of read
distributions in RNA-Seq data (Mortazavi et al., 2008). Recent
literatures also pointed out that hexamer priming and local sequence
content can affect the sequencing preference (Hansen et al., 2010;
Li,J. et al., 2010). In our investigations, we observed the distributions
of RNA-Seq reads in many datasets are not uniform. Some datasets
have a strong bias toward the 3′ ends of transcripts, which means that
the 3′-ends usually have more reads sequenced. This phenomenon
may be caused by some pre-sequencing procedures such as the
poly(A)-selection step in sample preparation (see Section 4 for
details).
In such situations, the accuracy of isoform expression inference
based on the uniformity assumption will deteriorate. Thus, the
motivation of this study is to develop a method that takes the
non-uniformity of read distribution into consideration to improve
isoform expression inference. Recently, several other groups also
realized the importance of this problem and incorporated non-URD
(N-URD) information into uniform distribution models in different
ways (Howard and Heber, 2010; Li,B. et al., 2010). In our study, we
502 © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
at Shandong U
niversity on June 8, 2013
http://bioinform
atics.oxfordjournals.org/
D
ow
nloaded from
[09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 503 502–508
Non-uniform read distribution models in RNA-Seq
introduced two types of curves to depict the tendency of sequencing
biases, namely the global bias curve (GBC) and the local bias curves
(LBCs). The GBC applies to all genes in a dataset and the LBCs
are specific to particular genes. The curves are estimated from the
read distribution across the measured transcripts. We incorporated
the GBC and LBCs with the URD model and constructed N-URD
models for isoform expression inference. A series of simulation
experiments were conducted to study the performance of the
proposed method and to compare with state-of-the-art methods.
Results show that a significant gain can be achieved in the accuracy
of isoform expression inference. We applied the N-URD models on
several real RNA-Seq datasets, which also illustrated the advantage
of the proposed method.
2 METHODS
2.1 Notations
Assume gene G has n exons of lengths (l1,...,ln ). It has m isoforms with
expression levels (θ1,...,θm ) in an experiment. From RNA-Seq data, we
have a set of reads of counts (x1,...,xn ) on this gene, where xj is the number
of reads mapped onto the j-th exon of gene G. The task is to infer the
expression levels (θ1,...,θm ) from observed read counts and the known gene
structure (the exon composition of each isoform). We use an indicator matrix
(aij)m×n to represent the gene structure, where aij =1 or aij =0 indicates that
the j-th exon is included or excluded in the i-th isoform, respectively. If an
exon appears in two (or more) isoforms as different lengths (e.g. the case
of alternative 5′ end or 3′ end exons), we split the exon into several parts
and treat each part as an exon. Similarly, when we consider reads mapped
to exon junctions (junction reads), we can use the flanking exon segments
of a junction to form a ‘pseudo-exon’ and count it as an extra exon in the
inference. In the experiments, we did on data of ultra-short reads, we did
not consider junction reads for simplicity as they compose only a very small
part in the data.
2.2 Bias curves
We propose two types of bias curves to characterize the distribution of
reads along a transcript. The GBC represents the general tendency of read
distribution for the whole transcriptome, and the LBCs depict gene-specific
read distributions. The curves reflect the relative read distribution bias from
the 5′ end to the 3′ end of a gene.
We use genes with only single isoforms to estimate the GBC. This is
because the read distribution of the gene with multiple isoforms is affected
by the gene structure and may not reflect the general tendency of read
distribution along the gene. We extracted all single-isoform genes according
to the NCBI Reference Sequence (RefSeq) gene annotation (downloaded
from http://genome.ucsc.edu). Because of the uncertainty in lowly expressed
genes, we filter out genes with too few (<100) reads. In order to avoid the
impact of local fluctuation of read distribution, we divide each gene into a
small number (e.g. 10) of bins and get a bar chart or histogram of the read
counts across the bins. The histograms are then normalized to mean 1, and
then averaged among the number of studied single-isoform genes. This gives
us the GBC, which reflects general sequencing efficiency at different relative
locations of a gene in the dataset. Usually, we illustrate the GBC in the form
of curves instead of bar charts. The procedure for calculating the GBC is
illustrated in Figure 1. The GBC depicts the global read distribution bias
which may be caused by the experimental protocol. It is consistent between
technical replicates in the data we studied (Section 3).
It is also observed that read distribution of a specific gene can have its
own pattern that deviates from uniform distribution and also from the GBC.
This kind of bias may be due to local sequence features such as GC-content
variation along transcripts and simple sequence repeats in the genome, which
may affect the hexamer priming efficiency (Hansen et al., 2010) and read
Fig. 1. The flowchart of the calculation of the GBC. The main steps include:
(1) collect single-isoform genes according to known gene annotation; (2)
filter out genes which have too few mapped reads (read-count <100); (3)
calculate the 10-bin histograms of read positions for each gene and normalize
the mean of each curve to be 1; (4) finally, the average of these bias curves
is the GBC.
mappability (Rozowsky et al., 2009). We introduced the LBCs to describe
gene-specific non-uniform distributions.
Read distribution of a gene with multiple isoforms is the mixture of
read distributions of all its expressed isoforms. So a LBC could only be
an approximate description of trend of read distribution along the gene
region. We define for each gene a step function which is constant on each
exon. Suppose the expression level of the i-th isoform is θi, we assign the
value of the step function on the j-th exon as xj/ (lj
∑m
i=1θiaij ), j=1,2,...,n,
which describes the read count normalized by the exon length and isoform
occurrences weighted by their expression levels. The reads on an exon also
may not distribute uniformly. This can be considered by splitting an exon
into multiple ‘sub-exons’ when there are sufficient reads for the exon. In
our experiments, we take an exon as the basic unit for estimating the LBCs.
The LBC of a gene is got by normalizing the step function to be of mean
1. At the beginning of expression inference, the isoform expression levels
are unknown, so we assume expression levels of all isoforms of a gene
are identical (1,1,...,1)1×m in the calculation of the LBC. After inferring
isoform expression with this initial LBC, we can further update the LBC
using the inferred isoform expression levels θi, and sequentially iterate this
procedure. The details of this iteration procedure will be illustrated in details
in Section 2.4. Figure 2 gives the conceptual flowchart for calculating the
LBCs.
We use the bias curves to compensate for read-count variations caused by
such biases in inferring isoform expression. For this purpose, we partition
the bias curves into several segments corresponding to exons of the gene.
The mean value of the curves on each segment is calculated as the weighting
factor for the corresponding exon. Figure 3 illustrates this procedure. In the
following sections, we will show the strategy to incorporate such weights
into the statistical framework of isoform expression inference.
2.3 The URD model
In the URD model used by Jiang and Wong (2009), each observation xj
is assumed to be a random variable following a Poisson distribution with
parameter λj . For example, λj for the j-th exon is λj = ljw∑mi=1 aijθi, where
w is the total number of mapped reads in the RNA-Seq data and aij is the
503
at Shandong U
niversity on June 8, 2013
http://bioinform
atics.oxfordjournals.org/
D
ow
nloaded from
[09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 504 502–508
Z.Wu et al.
Fig. 2. The flowchart of the calculation of the LBC. Given the gene structure
and RNA-Seq data, assuming the isoform expression levels are known, LBC
describes the exon read counts normalized by total weighted-length of the
exons appearing in all isoforms.
Fig. 3. Illustration of the usage of bias curve. (a) A gene is divided into 10
bins from its 5′ to 3′ end, and the read counts on each bin are summarized in
a histogram. (b) The histogram is normalized to have mean value of 1 and is
represented in curve form. (c) To utilize the bias curve in isoform expression
inference, we first partition the curve into a number of parts, which equals
to the number of exons in the studied isoform. The three parts (separated
by red dotted lines) shown here is for a 3-exon isoform with length ratio
25:40:35. (d) The average value of the curve in each partition is calculated
(red dashed lines). The values are used as weights for the corresponding
exons in expression inference.
elements of the gene structure indictor matrix. The corresponding likelihood
function is defined as
L(�|xj)=
e−λj λxjj
xj!
.
Further assuming the independence of xj’s, for a gene G, the joint log-
likelihood function can be written as
log
(
L
(
�|x1,x2,...,xn
))= n∑
j=1
log
(
e−λj λxjj
xj!
)
.
Inserting the λj = ljw∑mi=1 aijθi for each exon, we have
log
(
L
(
�|x1,x2,...,xn
))=−w n∑
j=1
m∑
i=1
ljaijθi
+
n∑
j=1
xj log
(
ljw
m∑
i=1
aijθi
)
−
n∑
j=1
log
(
xj!
)
.
Based on above log-likelihood function, maximum likelihood estimation can
find the optimal parameters which give the inference of isoform expression
levels. Further, taking derivatives for each θi, we get
∂ log
(
L
(
�|x1,x2,...,xn
))
∂θi
=−w
n∑
j=1
ljaij +
n∑
j=1
xjaij∑m
i=1 aijθi
Due to the convexity of above optimization problem, the gradient descending
method can be used to find the solution (Jiang and Wong, 2009).
2.4 The N-URD models
Upon the above URD model, we propose N-URD models by substituting the
indicator matrix (aij) with a weighted indicator matrix (bij) whose elements
are non-negative real numbers calculated from the bias curves. We can
rewrite the log-likelihood function as
log
(
L
(
�|x1,x2,...,xn
))=−w n∑
j=1
m∑
i=1
ljbijθi
+
n∑
j=1
xj log
(
ljw
m∑
i=1
bijθi
)
−
n∑
j=1
log
(
xj!
)
.
Rather than the 0–1 indicator matrix (aij), the weighted indicator matrix (bij)
not only represents the gene structure information, but also gives weights to
the non-zero elements according to the bias tendency of corresponding exons.
It is notable that changing the indicator matrix to the weighted indicator
matrix will not change the convexity of the optimization problem. Different
choice of the weighted matrix will result in different N-URD models.
Given a bias curve, say, the GBC, we assign weights to each exon in the
studied isoform following the strategy in Figure 3. We repeat this procedure
for every isoform in the gene, and obtain the weighted indicator matrix (Gij),
whose non-zero element represents the relative weight of j-th exon in the
expression of i-th isoform. Here, (Gij) indicates the weighted matrix is got
from the GBC. We call the N-URD model with (bij)= (Gij) as the GN-URD
model.
Similarly, we can get the weighted indicator matrix (Lij) starting with a
LBC. Letting (bij)= (Lij) results in the LN-URD model. To make use of the
combination of the GBC and LBC, we can let (bij)= (Gij)α+(Lij)(1−α)
and get the MN-URD model. In this study, we chose α=0.5 to illustrate
the property of the MN-URD model. Besides these three models, we further
study two additional models which use iterative calculation of the LBC in the
MN-URD model. We denote them by 1-M and 5-M which mean the number
of iterations is 1 and 5, respectively. The iterative procedure is:
(1) Set (θˆ1,...,θˆm)= (1,1,...,1) and STEP = 0;
(2) Calculate the LBC using xj/
(
lj
∑m
i=1 θˆiaij
)
, and then get
(
bij
)=(
Gij
)
α+(Lij)(1−α);
(3) Conduct expression inference based on (bij) using MN-URD, and get
the estimated expression values (θˆ1,...,θˆm);
(4) STEP = STEP + 1, if STEP exceed the maximum number, exit;
otherwise go to Step 2.
2.5 Simulation design
As there is no benchmark data with known isoform expressions, we designed
a series of simulation experiments to study the performance of the proposed
504
at Shandong U
niversity on June 8, 2013
http://bioinform
atics.oxfordjournals.org/
D
ow
nloaded from
[09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 505 502–508
Non-uniform read distribution models in RNA-Seq
method. In each experiment setting (with fixed numbers of isoforms and
exons), a number of simulated genes were generated in a random manner.
The indicator matrices that specify gene structure were created first. We set
the proportion of 1’s in the elements of indicator matrices to be consistent
with the RefSeq gene annotation (∼80%). The lengths of exons of the
simulated genes were randomly sampled from the exon lengths in the RefSeq
gene annotation. In order to simulate transcriptome data with similar read
distribution properties as real data, for each simulated isoform, we randomly
sampled the expression level (RPKM) and the corresponding LBC from the
RPKMs and their corresponding LBCs of single-isoform genes in real RNA-
Seq data. For each isoform, we assigned weights of exons according to the
corresponding LBC (similar to Fig. 3c and d). The reads count for each exon
of the gene is determined through sampling from Poisson models.
We studied genes with numbers of isoforms varying from 2 to 5 and
numbers of exons from 6 to 13. For each setting, we generated 1000
different gene structures randomly. We applied the original URD model
and the proposed N-URD models to the simulated data and compared their
performances.
2.6 RNA-Seq datasets
The proposed N-URD model was applied to two transcriptome RNA-Seq
datasets. We also used these data to provide the information (single-isoform
gene expression levels and LBCs) for generating our simulation data. The
dataset by Marioni et al. (2008) is composed of about 120 millions reads from
the human liver and kidney tissues. The dataset by Pan et al. (2008) consists
of transcriptome data from diverse normal human tissues each of which have
17–32 millions reads. We downloaded both of them from the Sequence Read
Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra). The read lengths of
the two datasets are 36 and 32 bp, respectively. For brevity, we refer to these
two datasets as Marioni data and Pan data, respectively. We used TopHat
(Trapnell et al., 2009) to map reads to the human genome assembly UCSC
hg18 (NCBI build36) with up to two mismatches allowed. Reads mapped
to multiple genomic loci were excluded. We observed that RNA-Seq data
of different tissues from the same laboratory share large similarity on the
global pattern of read distribution, so we only took the data of one tissue in
each dataset as an example for the presentation and demonstration.
3 RESULTS
3.1 Bias curve
本文档为【Bioinformatics-2011-Wu-502-8】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。