首页 Bioinformatics-2011-Wu-502-8

Bioinformatics-2011-Wu-502-8

举报
开通vip

Bioinformatics-2011-Wu-502-8 [09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 502 502–508 BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 4 2011, pages 502–508doi:10.1093/bioinformatics/btq696 Gene expression Advance Access publication December 17, 2010 Using non-uniform read distribution m...

Bioinformatics-2011-Wu-502-8
[09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 502 502–508 BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 4 2011, pages 502–508doi:10.1093/bioinformatics/btq696 Gene expression Advance Access publication December 17, 2010 Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq Zhengpeng Wu†, Xi Wang† and Xuegong Zhang∗ MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China Associate Editor: Ivo Hofacker ABSTRACT Motivation: RNA-Seq technology based on next-generation sequencing provides the unprecedented ability of studying transcriptomes at high resolution and accuracy, and the potential of measuring expression of multiple isoforms from the same gene at high precision. Solved by maximum likelihood estimation, isoform expression can be inferred in RNA-Seq using statistical models based on the assumption that sequenced reads are distributed uniformly along transcripts. Modification of the model is needed when considering situations where RNA-Seq data do not follow uniform distribution. Results: We proposed two curves, the global bias curve (GBC) and the local bias curves (LBCs), to describe the non-uniformity of read distributions for all genes in a transcriptome and for each gene, respectively. Incorporating the bias curves into the uniform read distribution (URD) model, we introduced non-URD (N-URD) models to infer isoform expression levels. On a series of systematic simulation studies, the proposed models outperform the original model in recovering major isoforms and the expression ratio of alternative isoforms. We also applied the new model to real RNA- Seq datasets and found that its inferences on expression ratios of alternative isoforms are more reasonable. The experiments indicate that incorporating N-URD information can improve the accuracy in modeling and inferring isoform expression in RNA-Seq. Contact: zhangxg@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. Received on April 6, 2010; revised on November 25, 2010; accepted on December 11, 2010 1 INTRODUCTION Applying next-generation high-throughput sequencing technologies to transcriptomes (RNA-Seq) makes it possible to precisely measure the abundances of transcripts. This technology has been widely investigated on the study of gene expressions (Marioni et al., 2008; Nagalakshmi et al., 2008), splice variants (Pan et al., 2008; Wang et al., 2008; Wilhelm et al., 2008), novel transcripts discovery (Mortazavi et al., 2008), RNA sequence polymorphism (Cloonan et al., 2008) and chimeric transcripts (Zhang et al., 2010) in recent ∗To whom correspondence should be addressed. †The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. literature. The ability of RNA-Seq to count sequenced reads along transcripts gives digital measurement of transcription levels, and is powerful for identifying and measuring the amount of transcribed isoforms of alternative splicing genes (Wang et al., 2008). Changes of isoforms’ expression levels in many genes are of functional importance in particular biological processes. For example, switching between major and minor isoforms of genes plays a crucial role in mouse muscle myogenesis (Trapnell et al., 2010); isoform expression changes have been shown to be highly related to the development of many complex diseases, such as Lewy bodies (Beyer et al., 2008; Humbert et al., 2007) and progressive supranuclear palsy (Chambers et al., 1999); studies also obtained the evidence that the expression ratio of two alternative isoforms from human progesterone receptor (PR) gene could result in different outcomes of breast cancer treatment (Cork et al., 2008). Therefore, correctly estimating isoform expression levels is becoming increasingly important for understanding complicated biological mechanisms. In RNA-Seq, under the assumption that the sequenced reads are sampled independently and uniformly on all transcripts in the sample, it is straightforward to model the distribution of read counts of exons as a Poisson distribution (Mortazavi et al., 2008). Using a uniform read distribution model (URD model for short), Jiang and Wong (2009) proposed a maximum likelihood estimate (MLE) method to infer isoform expression levels. However, some studies show that there are substantial biases in high-throughput sequencing data (Dohm et al., 2008), which may cause non-uniformity of read distributions in RNA-Seq data (Mortazavi et al., 2008). Recent literatures also pointed out that hexamer priming and local sequence content can affect the sequencing preference (Hansen et al., 2010; Li,J. et al., 2010). In our investigations, we observed the distributions of RNA-Seq reads in many datasets are not uniform. Some datasets have a strong bias toward the 3′ ends of transcripts, which means that the 3′-ends usually have more reads sequenced. This phenomenon may be caused by some pre-sequencing procedures such as the poly(A)-selection step in sample preparation (see Section 4 for details). In such situations, the accuracy of isoform expression inference based on the uniformity assumption will deteriorate. Thus, the motivation of this study is to develop a method that takes the non-uniformity of read distribution into consideration to improve isoform expression inference. Recently, several other groups also realized the importance of this problem and incorporated non-URD (N-URD) information into uniform distribution models in different ways (Howard and Heber, 2010; Li,B. et al., 2010). In our study, we 502 © The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com at Shandong U niversity on June 8, 2013 http://bioinform atics.oxfordjournals.org/ D ow nloaded from [09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 503 502–508 Non-uniform read distribution models in RNA-Seq introduced two types of curves to depict the tendency of sequencing biases, namely the global bias curve (GBC) and the local bias curves (LBCs). The GBC applies to all genes in a dataset and the LBCs are specific to particular genes. The curves are estimated from the read distribution across the measured transcripts. We incorporated the GBC and LBCs with the URD model and constructed N-URD models for isoform expression inference. A series of simulation experiments were conducted to study the performance of the proposed method and to compare with state-of-the-art methods. Results show that a significant gain can be achieved in the accuracy of isoform expression inference. We applied the N-URD models on several real RNA-Seq datasets, which also illustrated the advantage of the proposed method. 2 METHODS 2.1 Notations Assume gene G has n exons of lengths (l1,...,ln ). It has m isoforms with expression levels (θ1,...,θm ) in an experiment. From RNA-Seq data, we have a set of reads of counts (x1,...,xn ) on this gene, where xj is the number of reads mapped onto the j-th exon of gene G. The task is to infer the expression levels (θ1,...,θm ) from observed read counts and the known gene structure (the exon composition of each isoform). We use an indicator matrix (aij)m×n to represent the gene structure, where aij =1 or aij =0 indicates that the j-th exon is included or excluded in the i-th isoform, respectively. If an exon appears in two (or more) isoforms as different lengths (e.g. the case of alternative 5′ end or 3′ end exons), we split the exon into several parts and treat each part as an exon. Similarly, when we consider reads mapped to exon junctions (junction reads), we can use the flanking exon segments of a junction to form a ‘pseudo-exon’ and count it as an extra exon in the inference. In the experiments, we did on data of ultra-short reads, we did not consider junction reads for simplicity as they compose only a very small part in the data. 2.2 Bias curves We propose two types of bias curves to characterize the distribution of reads along a transcript. The GBC represents the general tendency of read distribution for the whole transcriptome, and the LBCs depict gene-specific read distributions. The curves reflect the relative read distribution bias from the 5′ end to the 3′ end of a gene. We use genes with only single isoforms to estimate the GBC. This is because the read distribution of the gene with multiple isoforms is affected by the gene structure and may not reflect the general tendency of read distribution along the gene. We extracted all single-isoform genes according to the NCBI Reference Sequence (RefSeq) gene annotation (downloaded from http://genome.ucsc.edu). Because of the uncertainty in lowly expressed genes, we filter out genes with too few (<100) reads. In order to avoid the impact of local fluctuation of read distribution, we divide each gene into a small number (e.g. 10) of bins and get a bar chart or histogram of the read counts across the bins. The histograms are then normalized to mean 1, and then averaged among the number of studied single-isoform genes. This gives us the GBC, which reflects general sequencing efficiency at different relative locations of a gene in the dataset. Usually, we illustrate the GBC in the form of curves instead of bar charts. The procedure for calculating the GBC is illustrated in Figure 1. The GBC depicts the global read distribution bias which may be caused by the experimental protocol. It is consistent between technical replicates in the data we studied (Section 3). It is also observed that read distribution of a specific gene can have its own pattern that deviates from uniform distribution and also from the GBC. This kind of bias may be due to local sequence features such as GC-content variation along transcripts and simple sequence repeats in the genome, which may affect the hexamer priming efficiency (Hansen et al., 2010) and read Fig. 1. The flowchart of the calculation of the GBC. The main steps include: (1) collect single-isoform genes according to known gene annotation; (2) filter out genes which have too few mapped reads (read-count <100); (3) calculate the 10-bin histograms of read positions for each gene and normalize the mean of each curve to be 1; (4) finally, the average of these bias curves is the GBC. mappability (Rozowsky et al., 2009). We introduced the LBCs to describe gene-specific non-uniform distributions. Read distribution of a gene with multiple isoforms is the mixture of read distributions of all its expressed isoforms. So a LBC could only be an approximate description of trend of read distribution along the gene region. We define for each gene a step function which is constant on each exon. Suppose the expression level of the i-th isoform is θi, we assign the value of the step function on the j-th exon as xj/ (lj ∑m i=1θiaij ), j=1,2,...,n, which describes the read count normalized by the exon length and isoform occurrences weighted by their expression levels. The reads on an exon also may not distribute uniformly. This can be considered by splitting an exon into multiple ‘sub-exons’ when there are sufficient reads for the exon. In our experiments, we take an exon as the basic unit for estimating the LBCs. The LBC of a gene is got by normalizing the step function to be of mean 1. At the beginning of expression inference, the isoform expression levels are unknown, so we assume expression levels of all isoforms of a gene are identical (1,1,...,1)1×m in the calculation of the LBC. After inferring isoform expression with this initial LBC, we can further update the LBC using the inferred isoform expression levels θi, and sequentially iterate this procedure. The details of this iteration procedure will be illustrated in details in Section 2.4. Figure 2 gives the conceptual flowchart for calculating the LBCs. We use the bias curves to compensate for read-count variations caused by such biases in inferring isoform expression. For this purpose, we partition the bias curves into several segments corresponding to exons of the gene. The mean value of the curves on each segment is calculated as the weighting factor for the corresponding exon. Figure 3 illustrates this procedure. In the following sections, we will show the strategy to incorporate such weights into the statistical framework of isoform expression inference. 2.3 The URD model In the URD model used by Jiang and Wong (2009), each observation xj is assumed to be a random variable following a Poisson distribution with parameter λj . For example, λj for the j-th exon is λj = ljw∑mi=1 aijθi, where w is the total number of mapped reads in the RNA-Seq data and aij is the 503 at Shandong U niversity on June 8, 2013 http://bioinform atics.oxfordjournals.org/ D ow nloaded from [09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 504 502–508 Z.Wu et al. Fig. 2. The flowchart of the calculation of the LBC. Given the gene structure and RNA-Seq data, assuming the isoform expression levels are known, LBC describes the exon read counts normalized by total weighted-length of the exons appearing in all isoforms. Fig. 3. Illustration of the usage of bias curve. (a) A gene is divided into 10 bins from its 5′ to 3′ end, and the read counts on each bin are summarized in a histogram. (b) The histogram is normalized to have mean value of 1 and is represented in curve form. (c) To utilize the bias curve in isoform expression inference, we first partition the curve into a number of parts, which equals to the number of exons in the studied isoform. The three parts (separated by red dotted lines) shown here is for a 3-exon isoform with length ratio 25:40:35. (d) The average value of the curve in each partition is calculated (red dashed lines). The values are used as weights for the corresponding exons in expression inference. elements of the gene structure indictor matrix. The corresponding likelihood function is defined as L(�|xj)= e−λj λxjj xj! . Further assuming the independence of xj’s, for a gene G, the joint log- likelihood function can be written as log ( L ( �|x1,x2,...,xn ))= n∑ j=1 log ( e−λj λxjj xj! ) . Inserting the λj = ljw∑mi=1 aijθi for each exon, we have log ( L ( �|x1,x2,...,xn ))=−w n∑ j=1 m∑ i=1 ljaijθi + n∑ j=1 xj log ( ljw m∑ i=1 aijθi ) − n∑ j=1 log ( xj! ) . Based on above log-likelihood function, maximum likelihood estimation can find the optimal parameters which give the inference of isoform expression levels. Further, taking derivatives for each θi, we get ∂ log ( L ( �|x1,x2,...,xn )) ∂θi =−w n∑ j=1 ljaij + n∑ j=1 xjaij∑m i=1 aijθi Due to the convexity of above optimization problem, the gradient descending method can be used to find the solution (Jiang and Wong, 2009). 2.4 The N-URD models Upon the above URD model, we propose N-URD models by substituting the indicator matrix (aij) with a weighted indicator matrix (bij) whose elements are non-negative real numbers calculated from the bias curves. We can rewrite the log-likelihood function as log ( L ( �|x1,x2,...,xn ))=−w n∑ j=1 m∑ i=1 ljbijθi + n∑ j=1 xj log ( ljw m∑ i=1 bijθi ) − n∑ j=1 log ( xj! ) . Rather than the 0–1 indicator matrix (aij), the weighted indicator matrix (bij) not only represents the gene structure information, but also gives weights to the non-zero elements according to the bias tendency of corresponding exons. It is notable that changing the indicator matrix to the weighted indicator matrix will not change the convexity of the optimization problem. Different choice of the weighted matrix will result in different N-URD models. Given a bias curve, say, the GBC, we assign weights to each exon in the studied isoform following the strategy in Figure 3. We repeat this procedure for every isoform in the gene, and obtain the weighted indicator matrix (Gij), whose non-zero element represents the relative weight of j-th exon in the expression of i-th isoform. Here, (Gij) indicates the weighted matrix is got from the GBC. We call the N-URD model with (bij)= (Gij) as the GN-URD model. Similarly, we can get the weighted indicator matrix (Lij) starting with a LBC. Letting (bij)= (Lij) results in the LN-URD model. To make use of the combination of the GBC and LBC, we can let (bij)= (Gij)α+(Lij)(1−α) and get the MN-URD model. In this study, we chose α=0.5 to illustrate the property of the MN-URD model. Besides these three models, we further study two additional models which use iterative calculation of the LBC in the MN-URD model. We denote them by 1-M and 5-M which mean the number of iterations is 1 and 5, respectively. The iterative procedure is: (1) Set (θˆ1,...,θˆm)= (1,1,...,1) and STEP = 0; (2) Calculate the LBC using xj/ ( lj ∑m i=1 θˆiaij ) , and then get ( bij )=( Gij ) α+(Lij)(1−α); (3) Conduct expression inference based on (bij) using MN-URD, and get the estimated expression values (θˆ1,...,θˆm); (4) STEP = STEP + 1, if STEP exceed the maximum number, exit; otherwise go to Step 2. 2.5 Simulation design As there is no benchmark data with known isoform expressions, we designed a series of simulation experiments to study the performance of the proposed 504 at Shandong U niversity on June 8, 2013 http://bioinform atics.oxfordjournals.org/ D ow nloaded from [09:59 2/2/2011 Bioinformatics-btq696.tex] Page: 505 502–508 Non-uniform read distribution models in RNA-Seq method. In each experiment setting (with fixed numbers of isoforms and exons), a number of simulated genes were generated in a random manner. The indicator matrices that specify gene structure were created first. We set the proportion of 1’s in the elements of indicator matrices to be consistent with the RefSeq gene annotation (∼80%). The lengths of exons of the simulated genes were randomly sampled from the exon lengths in the RefSeq gene annotation. In order to simulate transcriptome data with similar read distribution properties as real data, for each simulated isoform, we randomly sampled the expression level (RPKM) and the corresponding LBC from the RPKMs and their corresponding LBCs of single-isoform genes in real RNA- Seq data. For each isoform, we assigned weights of exons according to the corresponding LBC (similar to Fig. 3c and d). The reads count for each exon of the gene is determined through sampling from Poisson models. We studied genes with numbers of isoforms varying from 2 to 5 and numbers of exons from 6 to 13. For each setting, we generated 1000 different gene structures randomly. We applied the original URD model and the proposed N-URD models to the simulated data and compared their performances. 2.6 RNA-Seq datasets The proposed N-URD model was applied to two transcriptome RNA-Seq datasets. We also used these data to provide the information (single-isoform gene expression levels and LBCs) for generating our simulation data. The dataset by Marioni et al. (2008) is composed of about 120 millions reads from the human liver and kidney tissues. The dataset by Pan et al. (2008) consists of transcriptome data from diverse normal human tissues each of which have 17–32 millions reads. We downloaded both of them from the Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra). The read lengths of the two datasets are 36 and 32 bp, respectively. For brevity, we refer to these two datasets as Marioni data and Pan data, respectively. We used TopHat (Trapnell et al., 2009) to map reads to the human genome assembly UCSC hg18 (NCBI build36) with up to two mismatches allowed. Reads mapped to multiple genomic loci were excluded. We observed that RNA-Seq data of different tissues from the same laboratory share large similarity on the global pattern of read distribution, so we only took the data of one tissue in each dataset as an example for the presentation and demonstration. 3 RESULTS 3.1 Bias curve
本文档为【Bioinformatics-2011-Wu-502-8】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_183238
暂无简介~
格式:pdf
大小:598KB
软件:PDF阅读器
页数:0
分类:
上传时间:2013-06-09
浏览量:4