首页 Transcriptom-Assemble

Transcriptom-Assemble

举报
开通vip

Transcriptom-Assemble Identifying the full set of transcripts — including large and small RNAs, novel transcripts from unannotated genes, splicing isoforms and gene-fusion transcripts — serves as the foundation for a comprehensive study of the transcriptome. For a long time, ...

Transcriptom-Assemble
Identifying the full set of transcripts — including large and small RNAs, novel transcripts from unannotated genes, splicing isoforms and gene-fusion transcripts — serves as the foundation for a comprehensive study of the transcriptome. For a long time, our knowledge of the transcriptome was largely derived from gene predictions and limited EST evidence and has there- fore been partial and biased. Recently, however, whole- transcriptome sequencing using next-generation sequencing (NGS) technologies, or RNA sequencing (RNA-seq), has started to reveal the complex landscape and dynamics of the transcriptome from yeast to human at an unprecedented level of sensitivity and accuracy1–4. Compared with traditional low-throughput EST sequencing by Sanger technology, which only detects the more abundant transcripts, the enormous sequencing depth (100–1,000 reads per base pair of a transcript) of a typical RNA-seq experiment offers a near-complete snapshot of a transcriptome, including the rare tran- scripts that have regulatory roles. In contrast to alterna- tive high-throughput technologies, such as microarrays, RNA-seq achieves base-pair-level resolution and a much higher dynamic range of expression levels, and it is also capable of de novo annotation1,2. Despite these advantages, sequence reads obtained from the common NGS platforms, including Illumina, SOLiD and 454, are often very short (35–500 bp)5. As a result, it is neces- sary to reconstruct the full-length transcripts by tran- scriptome assembly, except in the case of small classes of RNA — such as microRNAs, PIWI-interacting RNAs (piRNAs), small nucleolar (snoRNAs) and small inter- fering (siRNAs) — which are shorter than the sequencing length and do not require assembly. Reconstructing a comprehensive transcriptome from short reads has many informatics challenges. Similar to short-read genome assembly, transcriptome assembly involves piecing together short, low-quality reads. Typical NGS data sets are very large (several gigabases to tera- bases), which requires computing systems to have large memories and/or many cores to run parallel algorithms. Several short-read assemblers have been developed to tackle these challenges6–9, including Velvet6, ABYSS7 and ALLPATHS8. Although these tools have achieved reason- able success in the assembly of genomes9,10, they cannot directly be applied to transcriptome assembly, mainly because of three considerations. First, whereas DNA sequencing depth is expected to be the same across a genome, the sequencing depth of transcripts can vary by several orders of magnitude. Many short-read genome assemblers use sequencing depth to distinguish repetitive regions of the genome, a feature that would mark abun- dant transcripts as repetitive. Sequencing depth is also used by assemblers to calculate an optimal set of parame- ters for genome assembly, which would probably result in only a small set of transcripts being favoured in the tran- scriptome assembly. Second, unlike genomic sequencing, in which both strands are sequenced, RNA-seq experi- ments can be strand-specific. Transcriptome assemblers will need to take advantage of strand information to resolve overlapping sense and antisense transcripts11–14. Finally, transcriptome assembly is challenging, because transcript variants from the same gene can share exons and are difficult to resolve unambiguously. Given the complexity of most transcriptomes and the above chal- lenges, exclusively reconstructing all of the transcripts and their variants from short reads has been difficult. Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, 2800 Mitchell Drive, MS100 Walnut Creek, California 94598, USA. e‑mails: jamartin@lbl.gov; zhongwang@lbl.gov doi:10.1038/nrg3068 Published online 7 September 2011 RNA sequencing (RNA-seq). An experimental protocol that uses next- generation sequencing technologies to sequence the RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each RNA. Sequencing depth The average number of reads representing a given nucleotide in the reconstructed sequence. A 10× sequence depth means that each nucleotide of the transcript was sequenced, on average, ten times. Next-generation transcriptome assembly Jeffrey A. Martin and Zhong Wang Abstract | Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches — reference-based, de novo and combined strategies — along with some perspectives on transcriptome assembly in the near future. S T U DY D E S I G N S REVIEWS NATURE REVIEWS | GENETICS VOLUME 12 | OCTOBER 2011 | 671 © 2011 Macmillan Publishers Limited. All rights reserved Paired-end protocol A library construction and sequencing strategy in which both ends of a DNA fragment are sequenced to produce pairs of reads (mate pairs). Contigs An abbreviation for contiguous sequences that is used to indicate a contiguous piece of DNA assembled from shorter overlapping sequence reads. In the past 3 years, improvements in data quality and the rapid evolution of assembly algorithms have made it possible to address the above challenges. In this Review, we show how these exciting advances have led to a wealth of assembled transcriptomes from short reads15–25, and we provide practical guidelines for imple- menting a transcriptome assembly experiment. After describing the experimental and informatics considera- tions that need to be made before assembly, we discuss three assembly strategies: assembly based on a reference genome, de novo assembly and a combined approach that merges the two strategies. We focus on the strengths and weaknesses of the three strategies in the context of both gene-dense transcriptomes and large transcrip- tomes with pervasive alternative splicing. Finally, we give some perspectives on the future of transcriptome assembly in light of the rapid evolution of sequencing technologies and high-performance computing. Considerations prior to assembly To ensure a high-quality transcriptome assembly, par- ticular care should be taken in designing the RNA-seq experiment. The steps of a typical transcriptome assem- bly experiment are shown in FIG. 1. In the data generation phase (FIG. 1a), total RNAs or mRNAs are fragmented and converted into a library of cDNAs containing sequencing adaptors. The cDNA library is then sequenced by next- generation sequencers to produce millions to billions of short reads from one end or both ends of the cDNA fragments. In the data analysis phase (FIG. 1b), these short reads are pre-processed to remove sequencing errors and other artefacts. The reads are subsequently assembled to reconstruct the original RNAs and to assess their abun- dance (‘expression counting’). Expression counting is not trivial for transcriptomes with extensive alternative splicing26: transcripts often share some exons, causing uncertainty as to which transcript each read belongs to. The accuracy and precision of gene expression counting are influenced by cDNA library construction methods, sequencing technologies and data pre-treatment tech- niques27. Similarly, these factors can influence the quality of assembled transcriptomes, as discussed below. Library construction. To increase the number of assem- bled transcripts, especially the less abundant ones, riboso- mal RNA (rRNA) and abundant transcripts are removed during the first steps of library construction. Poly(A) selection is very effective at enriching mRNAs in eukary- otes, but this selection approach will miss non-coding RNAs (ncRNAs) and mRNAs that lack a poly(A) tail. In order to retain RNAs without a poly(A) tail in the assem- bled transcriptome, rRNA contamination can instead be removed by hybridization-based depletion methods28,29. These methods increase the opportunity for the detec- tion and assembly of rare transcripts by reducing the representation of rRNAs and other highly abundant tran- scripts30, which often constitute most of the reads in RNA- seq data sets. Note that these depletion methods may bias the quantification of highly abundant transcripts, and so if quantification is a goal of the study, then the sequencing of non-depleted libraries will be required. Another consideration to make during library con- struction, provided the starting RNA quantities are not limiting, is whether to eliminate the PCR amplifica- tion step from the protocol. PCR amplification results in a low sequencing coverage for transcripts or regions within a transcript that have a high GC content31. This can, in turn, cause gaps in the assembled transcripts and can cause other transcripts to be missing from the assembly altogether. Amplification-free protocols have been developed to overcome this problem31,32. The latest single-molecule sequencing technologies from Helicos and Pacific Biosciences do not require PCR amplifica- tion before sequencing33. The Helicos system can even directly sequence RNAs without cDNA library construc- tion1,34, which should greatly reduce biases in sequencing coverage. However, these single-molecule technologies suffer from high error rates. Overall, sequencing cover- age of the transcriptome from amplification-free pro- tocols is more even and contiguous across transcripts, making it easier for assemblers to construct full-length transcripts across GC-rich regions of the transcriptome. Last, the use of strand-specific RNA-seq protocols27 aids in the assembly and quantification of overlapping transcripts that are derived from opposite strands of the genome. This consideration is especially important for gene-dense genomes, such as those of bacteria, archaea and lower eukaryotes, but it is also important for detect- ing antisense transcription, which is common in higher eukaryotes. Sequencing. The major factors to consider before sequencing a sample are: the choice of sequencing plat- form, the sequencing read length and whether to use a paired-end protocol. All of the current NGS technolo- gies have successfully been used to assemble transcrip- tomes35–37, and they differ mostly in their throughput and cost. The choice of sequencing technology largely depends on the technology to which a user has access and the budget constraints for sequencing. In general, the assembly of large and complex transcriptomes (plants and mammals) requires extensive sequencing and is fre- quently done on Illumina or SOLiD platforms. The 454 technology offers longer reads, and it can be used alone for small transcriptomes. Illumina, SOLiD and 454 technol- ogy can also be combined in a ‘hybrid assembly’ strategy: short reads that are sequenced at a greater depth are assembled into contigs, and long reads are subsequently used to scaffold the contigs and resolve variants38,39. For transcriptome assembly, longer reads are gener- ally preferred, as they greatly reduce the complexity of the assembly. It is worth noting, however, that the prob- lem posed by short reads can be alleviated by using a paired-end protocol, in which 75–150 bp are sequenced from both ends of short DNA fragments (100–250 bp), and the overlapping reads are computationally joined together to form a longer read40. Paired reads from long inserts (500–1,000 bp) offer long-range exon con- nectivity, in a similar way to reads obtained using 454 technology. For this reason, some assemblers, such as ALLPATHS, require at least two libraries with different R E V I E W S 672 | OCTOBER 2011 | VOLUME 12 www.nature.com/reviews/genetics © 2011 Macmillan Publishers Limited. All rights reserved 0CVWTG�4GXKGYU�^�)GPGVKEU D��&CVC�CPCN[UKUC��&CVC�IGPGTCVKQP ����O40#�QT�VQVCN�40# ����.KICVG�UGSWGPEG�CFCRVQTU ����5GSWGPEG�E&0#�GPFU ����(TCIOGPV�40# ����4GOQXG�EQPVCOKPCPV�&0# ����4GXGTUG�VTCPUETKDG� ������KPVQ�E&0# ����5GNGEV�C�TCPIG�QH�UK\GU ����4CY�TGCFU ����2QUV�RTQEGUU�VTCPUETKRVU ����%QTTGEV�GTTQTU� QRVKQPCN� ����4GOQXG�CTVGHCEVU ����#UUGODNG�KPVQ�VTCPUETKRVU ����#NKIP�TGCFU�VQ�VTCPUETKRVU� VQ�SWCPVKH[�GZRTGUUKQP 4GOQXG�T40#! 5GNGEV�O40#! 5VTCPF�URGEKȮE�40#�UGS! 2%4�CORNKȮECVKQP! Figure 1 | The data generation and analysis steps of a typical RNA-seq experiment. a | Data generation. To generate an RNA sequencing (RNA-seq) data set, RNA (light blue) is first extracted (stage 1), DNA contamination is removed using DNase (stage 2), and the remaining RNA is broken up into short fragments (stage 3). The RNA fragments are then reverse transcribed into cDNA (yellow, stage 4), sequencing adaptors (blue) are ligated (stage 5), and fragment size selection is undertaken (stage 6). Finally, the ends of the cDNAs are sequenced using next-generation sequencing technologies to produce many short reads (red, stage 7). If both ends of the cDNAs are sequenced, then paired-end reads are generated, as shown here by dashed lines between the pairs. b | Data analysis. After sequencing, reads are pre-processed by removing low-quality reads and artefacts, such as adaptor sequences (blue), contaminant DNA (green) and PCR duplicates (stages 1 and 2). Next, the sequence errors (red crosses) are optionally removed (stage 3) to improve the read quality (see main text for details). The pre-processed reads are then assembled into transcripts (orange, stage 4) and polished by post-assembly processes to remove assembly errors (blue crosses). The transcripts are then post-processed (stage 5), and the expression level of each transcript is then estimated by counting the number of reads that align to each transcript (stage 6). rRNA, ribosomal RNA. R E V I E W S NATURE REVIEWS | GENETICS VOLUME 12 | OCTOBER 2011 | 673 © 2011 Macmillan Publishers Limited. All rights reserved Low-complexity reads Short DNA sequences composed of stretches of homopolymer nucleotides or simple sequence repeats. Quality scores An integer representing the probability that a given base in a nucleic acid sequence is correct. k-mer frequency The number of times that each k-mer (that is, a short oligonucleotide of length k) appears in a set of DNA sequences. Splice-aware aligner A program that is designed to align cDNA reads to a genome. Traversing A method for systematically visiting all nodes in a mathematical graph. Seed-and-extend aligners An alignment strategy that first builds a hash table containing the location of each k-mer (seed) within the reference genome. These algorithms then extend these seeds in both directions to find the best alignment (or alignments) for each read. Burrows–Wheeler transform (BWT). This reorders the characters within a sequence, which allows for better data compression. Many short-read aligners implement this transform in order to use less memory when aligning reads to a genome. Parallel computing A computer programming model for distributing data processing across multiple processors, so that multiple tasks can be carried out simultaneously. insert sizes8. The combination of short and long insert libraries should be helpful in capturing transcripts of various sizes while also helping to resolve alternatively spliced isoforms. Data pre-processing. Removing artefacts from RNA- seq data sets before assembly improves the read quality, which, in turn, improves the accuracy and computa- tional efficiency of the assembly. This step is straight- forward and can be executed using several tools41–44. In general, three types of artefacts should be removed from raw RNA-seq data: sequencing adaptors43,44, which origi- nate from failed or short DNA insertions during library preparation; low-complexity reads43; and near-identical reads that are derived from PCR amplification15. Adaptor and low-complexity sequences can lead to misassem- blies. PCR duplicates are more common in long-insert libraries, and their presence can skew mate-pair sta- tistics that are used by many assemblers for scaffold- ing. When their identities are known, rRNA and other RNA contaminants should also be removed to improve assembly speed. Sequencing errors in NGS reads can be removed or corrected by analysing the quality score and/or the k-mer frequency. For most NGS data sets, low quality scores indicate possible sequencing errors. Sequencing errors can also be empirically inferred by looking at the fre- quencies of each k-mer in the data set. As the same RNA molecule is sequenced many times, k-mers without errors in them will occur multiple times. By contrast, k-mers that occur in the data set at very low frequencies are probably sequencing errors or are from transcripts with a low abundance. Reads containing these errors can be removed, trimmed or corrected to improve the assembly quality and to decrease the amount of random access memory (RAM) required10,15,42. However, k-mer- based error removal carries a side effect, in that reads derived from rare transcripts may also be removed. This should not be a large problem, as the shallow sequenc- ing depth for these transcripts would not be sufficient to assemble them, even if these reads were retained. Transcriptome assembly strategies Depending on whether a reference genome assembly is available, current transcriptome assembly strategies gen- erally fall into one of three categories: a reference-based strategy, a de novo strategy or a combined strategy that merges the two (FIGS 2–4). In the following sections, we discuss each of these three strategies in detail, includ- ing their pros and cons for the assembly of simple and complex transcriptomes. Reference-based strategy When a reference genome for the target transcriptome is available, the transcriptome assembly can be built upon it. In general, this strategy — which is known as ‘reference-based’ or ‘ab initio’ assembly — involves three steps. First, RNA-seq reads are aligned to a refer- ence genome using a splice-aware aligner, such as Blat45, TopHat46, SpliceMap47, MapSplice48 or GSNAP49 (TABLE 1; FIG. 2a). Second, overlapping reads from each locus are clustered to build a graph representing all possible iso- forms (FIG. 2b). The final step involves traversing the graph to resolve individual isoforms (FIG. 2c,d). Examples of methods that use the reference-based strategy include Cufflinks20, Scripture16 and others17,50 (TABLE 2). Splice-aware aligners generally fall into two classes: seed-and-extend aligners and Burrows–Wheeler transform (BWT) aligners, each of which has clear trade-offs. The seed-and-extend algorithms, such as BLAT and GSNAP, start by quickly finding a ‘seed’ — a substring of the read — that exactly matches the genome and then locally extending the match using Smith–Waterman alignment algorithms. BWT aligners are optimized to align reads with few errors in them and are therefore generally faster than seed-and-extend aligners. Each aligner differs in its implementation for aligning reads across introns. In gen- eral, seed-and-extend aligners shift the gaps in the local alignment to match known splice sites, whereas BWT aligners, such as TopHat, create a database of all possible combinations of splicing junctions within a locus and then align to this database the reads that failed to align to the genome. After the reads are aligned to the genome, two meth- ods are typically used
本文档为【Transcriptom-Assemble】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_518819
暂无简介~
格式:pdf
大小:1001KB
软件:PDF阅读器
页数:0
分类:
上传时间:2013-04-18
浏览量:25