Identifying the full set of transcripts — including large
and small RNAs, novel transcripts from unannotated
genes, splicing isoforms and gene-fusion transcripts —
serves as the foundation for a comprehensive study of
the transcriptome. For a long time, our knowledge
of the transcriptome was largely derived from gene
predictions and limited EST evidence and has there-
fore been partial and biased. Recently, however, whole-
transcriptome sequencing using next-generation
sequencing (NGS) technologies, or RNA sequencing
(RNA-seq), has started to reveal the complex landscape
and dynamics of the transcriptome from yeast to human
at an unprecedented level of sensitivity and accuracy1–4.
Compared with traditional low-throughput EST
sequencing by Sanger technology, which only detects
the more abundant transcripts, the enormous sequencing
depth (100–1,000 reads per base pair of a transcript) of
a typical RNA-seq experiment offers a near-complete
snapshot of a transcriptome, including the rare tran-
scripts that have regulatory roles. In contrast to alterna-
tive high-throughput technologies, such as microarrays,
RNA-seq achieves base-pair-level resolution and a
much higher dynamic range of expression levels, and
it is also capable of de novo annotation1,2. Despite these
advantages, sequence reads obtained from the common
NGS platforms, including Illumina, SOLiD and 454, are
often very short (35–500 bp)5. As a result, it is neces-
sary to reconstruct the full-length transcripts by tran-
scriptome assembly, except in the case of small classes
of RNA — such as microRNAs, PIWI-interacting RNAs
(piRNAs), small nucleolar (snoRNAs) and small inter-
fering (siRNAs) — which are shorter than the sequencing
length and do not require assembly.
Reconstructing a comprehensive transcriptome from
short reads has many informatics challenges. Similar
to short-read genome assembly, transcriptome assembly
involves piecing together short, low-quality reads. Typical
NGS data sets are very large (several gigabases to tera-
bases), which requires computing systems to have large
memories and/or many cores to run parallel algorithms.
Several short-read assemblers have been developed to
tackle these challenges6–9, including Velvet6, ABYSS7 and
ALLPATHS8. Although these tools have achieved reason-
able success in the assembly of genomes9,10, they cannot
directly be applied to transcriptome assembly, mainly
because of three considerations. First, whereas DNA
sequencing depth is expected to be the same across a
genome, the sequencing depth of transcripts can vary by
several orders of magnitude. Many short-read genome
assemblers use sequencing depth to distinguish repetitive
regions of the genome, a feature that would mark abun-
dant transcripts as repetitive. Sequencing depth is also
used by assemblers to calculate an optimal set of parame-
ters for genome assembly, which would probably result in
only a small set of transcripts being favoured in the tran-
scriptome assembly. Second, unlike genomic sequencing,
in which both strands are sequenced, RNA-seq experi-
ments can be strand-specific. Transcriptome assemblers
will need to take advantage of strand information to
resolve overlapping sense and antisense transcripts11–14.
Finally, transcriptome assembly is challenging, because
transcript variants from the same gene can share exons
and are difficult to resolve unambiguously. Given the
complexity of most transcriptomes and the above chal-
lenges, exclusively reconstructing all of the transcripts
and their variants from short reads has been difficult.
Lawrence Berkeley National
Laboratory, DOE Joint
Genome Institute, 2800
Mitchell Drive, MS100
Walnut Creek, California
94598, USA.
e‑mails: jamartin@lbl.gov;
zhongwang@lbl.gov
doi:10.1038/nrg3068
Published online
7 September 2011
RNA sequencing
(RNA-seq). An experimental
protocol that uses next-
generation sequencing
technologies to sequence
the RNA molecules within a
biological sample in an effort
to determine the primary
sequence and relative
abundance of each RNA.
Sequencing depth
The average number of reads
representing a given nucleotide
in the reconstructed sequence.
A 10× sequence depth means
that each nucleotide of the
transcript was sequenced,
on average, ten times.
Next-generation transcriptome
assembly
Jeffrey A. Martin and Zhong Wang
Abstract | Transcriptomics studies often rely on partial reference transcriptomes that fail to
capture the full catalogue of transcripts and their variations. Recent advances in sequencing
technologies and assembly algorithms have facilitated the reconstruction of the entire
transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome.
However, transcriptome assembly from billions of RNA-seq reads, which are often very short,
poses a significant informatics challenge. This Review summarizes the recent developments
in transcriptome assembly approaches — reference-based, de novo and combined strategies
— along with some perspectives on transcriptome assembly in the near future.
S T U DY D E S I G N S
REVIEWS
NATURE REVIEWS | GENETICS VOLUME 12 | OCTOBER 2011 | 671
© 2011 Macmillan Publishers Limited. All rights reserved
Paired-end protocol
A library construction and
sequencing strategy in which
both ends of a DNA fragment
are sequenced to produce
pairs of reads (mate pairs).
Contigs
An abbreviation for contiguous
sequences that is used to
indicate a contiguous piece of
DNA assembled from shorter
overlapping sequence reads.
In the past 3 years, improvements in data quality
and the rapid evolution of assembly algorithms have
made it possible to address the above challenges. In
this Review, we show how these exciting advances have
led to a wealth of assembled transcriptomes from short
reads15–25, and we provide practical guidelines for imple-
menting a transcriptome assembly experiment. After
describing the experimental and informatics considera-
tions that need to be made before assembly, we discuss
three assembly strategies: assembly based on a reference
genome, de novo assembly and a combined approach
that merges the two strategies. We focus on the strengths
and weaknesses of the three strategies in the context of
both gene-dense transcriptomes and large transcrip-
tomes with pervasive alternative splicing. Finally, we
give some perspectives on the future of transcriptome
assembly in light of the rapid evolution of sequencing
technologies and high-performance computing.
Considerations prior to assembly
To ensure a high-quality transcriptome assembly, par-
ticular care should be taken in designing the RNA-seq
experiment. The steps of a typical transcriptome assem-
bly experiment are shown in FIG. 1. In the data generation
phase (FIG. 1a), total RNAs or mRNAs are fragmented and
converted into a library of cDNAs containing sequencing
adaptors. The cDNA library is then sequenced by next-
generation sequencers to produce millions to billions
of short reads from one end or both ends of the cDNA
fragments. In the data analysis phase (FIG. 1b), these short
reads are pre-processed to remove sequencing errors and
other artefacts. The reads are subsequently assembled to
reconstruct the original RNAs and to assess their abun-
dance (‘expression counting’). Expression counting is
not trivial for transcriptomes with extensive alternative
splicing26: transcripts often share some exons, causing
uncertainty as to which transcript each read belongs to.
The accuracy and precision of gene expression counting
are influenced by cDNA library construction methods,
sequencing technologies and data pre-treatment tech-
niques27. Similarly, these factors can influence the quality
of assembled transcriptomes, as discussed below.
Library construction. To increase the number of assem-
bled transcripts, especially the less abundant ones, riboso-
mal RNA (rRNA) and abundant transcripts are removed
during the first steps of library construction. Poly(A)
selection is very effective at enriching mRNAs in eukary-
otes, but this selection approach will miss non-coding
RNAs (ncRNAs) and mRNAs that lack a poly(A) tail. In
order to retain RNAs without a poly(A) tail in the assem-
bled transcriptome, rRNA contamination can instead be
removed by hybridization-based depletion methods28,29.
These methods increase the opportunity for the detec-
tion and assembly of rare transcripts by reducing the
representation of rRNAs and other highly abundant tran-
scripts30, which often constitute most of the reads in RNA-
seq data sets. Note that these depletion methods may
bias the quantification of highly abundant transcripts,
and so if quantification is a goal of the study, then the
sequencing of non-depleted libraries will be required.
Another consideration to make during library con-
struction, provided the starting RNA quantities are not
limiting, is whether to eliminate the PCR amplifica-
tion step from the protocol. PCR amplification results
in a low sequencing coverage for transcripts or regions
within a transcript that have a high GC content31. This
can, in turn, cause gaps in the assembled transcripts
and can cause other transcripts to be missing from the
assembly altogether. Amplification-free protocols have
been developed to overcome this problem31,32. The latest
single-molecule sequencing technologies from Helicos
and Pacific Biosciences do not require PCR amplifica-
tion before sequencing33. The Helicos system can even
directly sequence RNAs without cDNA library construc-
tion1,34, which should greatly reduce biases in sequencing
coverage. However, these single-molecule technologies
suffer from high error rates. Overall, sequencing cover-
age of the transcriptome from amplification-free pro-
tocols is more even and contiguous across transcripts,
making it easier for assemblers to construct full-length
transcripts across GC-rich regions of the transcriptome.
Last, the use of strand-specific RNA-seq protocols27
aids in the assembly and quantification of overlapping
transcripts that are derived from opposite strands of the
genome. This consideration is especially important for
gene-dense genomes, such as those of bacteria, archaea
and lower eukaryotes, but it is also important for detect-
ing antisense transcription, which is common in higher
eukaryotes.
Sequencing. The major factors to consider before
sequencing a sample are: the choice of sequencing plat-
form, the sequencing read length and whether to use
a paired-end protocol. All of the current NGS technolo-
gies have successfully been used to assemble transcrip-
tomes35–37, and they differ mostly in their throughput
and cost.
The choice of sequencing technology largely depends
on the technology to which a user has access and
the budget constraints for sequencing. In general, the
assembly of large and complex transcriptomes (plants
and mammals) requires extensive sequencing and is fre-
quently done on Illumina or SOLiD platforms. The 454
technology offers longer reads, and it can be used alone for
small transcriptomes. Illumina, SOLiD and 454 technol-
ogy can also be combined in a ‘hybrid assembly’ strategy:
short reads that are sequenced at a greater depth are
assembled into contigs, and long reads are subsequently
used to scaffold the contigs and resolve variants38,39.
For transcriptome assembly, longer reads are gener-
ally preferred, as they greatly reduce the complexity of
the assembly. It is worth noting, however, that the prob-
lem posed by short reads can be alleviated by using a
paired-end protocol, in which 75–150 bp are sequenced
from both ends of short DNA fragments (100–250 bp),
and the overlapping reads are computationally joined
together to form a longer read40. Paired reads from
long inserts (500–1,000 bp) offer long-range exon con-
nectivity, in a similar way to reads obtained using 454
technology. For this reason, some assemblers, such as
ALLPATHS, require at least two libraries with different
R E V I E W S
672 | OCTOBER 2011 | VOLUME 12 www.nature.com/reviews/genetics
© 2011 Macmillan Publishers Limited. All rights reserved
0CVWTG�4GXKGYU�^�)GPGVKEU
D��&CVC�CPCN[UKUC��&CVC�IGPGTCVKQP
����O40#�QT�VQVCN�40#
����.KICVG�UGSWGPEG�CFCRVQTU
����5GSWGPEG�E&0#�GPFU
����(TCIOGPV�40#
����4GOQXG�EQPVCOKPCPV�&0#
����4GXGTUG�VTCPUETKDG�
������KPVQ�E&0#
����5GNGEV�C�TCPIG�QH�UK\GU
����4CY�TGCFU
����2QUV�RTQEGUU�VTCPUETKRVU
����%QTTGEV�GTTQTU�
QRVKQPCN�
����4GOQXG�CTVGHCEVU
����#UUGODNG�KPVQ�VTCPUETKRVU
����#NKIP�TGCFU�VQ�VTCPUETKRVU�
VQ�SWCPVKH[�GZRTGUUKQP
4GOQXG�T40#!
5GNGEV�O40#!
5VTCPF�URGEKȮE�40#�UGS!
2%4�CORNKȮECVKQP!
Figure 1 | The data generation and analysis steps of a typical RNA-seq experiment. a | Data generation. To generate
an RNA sequencing (RNA-seq) data set, RNA (light blue) is first extracted (stage 1), DNA contamination is removed
using DNase (stage 2), and the remaining RNA is broken up into short fragments (stage 3). The RNA fragments are then
reverse transcribed into cDNA (yellow, stage 4), sequencing adaptors (blue) are ligated (stage 5), and fragment size
selection is undertaken (stage 6). Finally, the ends of the cDNAs are sequenced using next-generation sequencing
technologies to produce many short reads (red, stage 7). If both ends of the cDNAs are sequenced, then paired-end
reads are generated, as shown here by dashed lines between the pairs. b | Data analysis. After sequencing, reads are
pre-processed by removing low-quality reads and artefacts, such as adaptor sequences (blue), contaminant DNA
(green) and PCR duplicates (stages 1 and 2). Next, the sequence errors (red crosses) are optionally removed (stage 3)
to improve the read quality (see main text for details). The pre-processed reads are then assembled into transcripts
(orange, stage 4) and polished by post-assembly processes to remove assembly errors (blue crosses). The transcripts are
then post-processed (stage 5), and the expression level of each transcript is then estimated by counting the number of
reads that align to each transcript (stage 6). rRNA, ribosomal RNA.
R E V I E W S
NATURE REVIEWS | GENETICS VOLUME 12 | OCTOBER 2011 | 673
© 2011 Macmillan Publishers Limited. All rights reserved
Low-complexity reads
Short DNA sequences
composed of stretches of
homopolymer nucleotides
or simple sequence repeats.
Quality scores
An integer representing the
probability that a given base
in a nucleic acid sequence
is correct.
k-mer frequency
The number of times that
each k-mer (that is, a short
oligonucleotide of length k)
appears in a set of DNA
sequences.
Splice-aware aligner
A program that is designed to
align cDNA reads to a genome.
Traversing
A method for systematically
visiting all nodes in a
mathematical graph.
Seed-and-extend aligners
An alignment strategy that first
builds a hash table containing
the location of each k-mer
(seed) within the reference
genome. These algorithms then
extend these seeds in both
directions to find the best
alignment (or alignments) for
each read.
Burrows–Wheeler
transform
(BWT). This reorders the
characters within a sequence,
which allows for better data
compression. Many short-read
aligners implement this
transform in order to use
less memory when aligning
reads to a genome.
Parallel computing
A computer programming
model for distributing
data processing across
multiple processors, so
that multiple tasks can be
carried out simultaneously.
insert sizes8. The combination of short and long insert
libraries should be helpful in capturing transcripts of
various sizes while also helping to resolve alternatively
spliced isoforms.
Data pre-processing. Removing artefacts from RNA-
seq data sets before assembly improves the read quality,
which, in turn, improves the accuracy and computa-
tional efficiency of the assembly. This step is straight-
forward and can be executed using several tools41–44. In
general, three types of artefacts should be removed from
raw RNA-seq data: sequencing adaptors43,44, which origi-
nate from failed or short DNA insertions during library
preparation; low-complexity reads43; and near-identical
reads that are derived from PCR amplification15. Adaptor
and low-complexity sequences can lead to misassem-
blies. PCR duplicates are more common in long-insert
libraries, and their presence can skew mate-pair sta-
tistics that are used by many assemblers for scaffold-
ing. When their identities are known, rRNA and other
RNA contaminants should also be removed to improve
assembly speed.
Sequencing errors in NGS reads can be removed or
corrected by analysing the quality score and/or the k-mer
frequency. For most NGS data sets, low quality scores
indicate possible sequencing errors. Sequencing errors
can also be empirically inferred by looking at the fre-
quencies of each k-mer in the data set. As the same RNA
molecule is sequenced many times, k-mers without
errors in them will occur multiple times. By contrast,
k-mers that occur in the data set at very low frequencies
are probably sequencing errors or are from transcripts
with a low abundance. Reads containing these errors
can be removed, trimmed or corrected to improve the
assembly quality and to decrease the amount of random
access memory (RAM) required10,15,42. However, k-mer-
based error removal carries a side effect, in that reads
derived from rare transcripts may also be removed. This
should not be a large problem, as the shallow sequenc-
ing depth for these transcripts would not be sufficient to
assemble them, even if these reads were retained.
Transcriptome assembly strategies
Depending on whether a reference genome assembly is
available, current transcriptome assembly strategies gen-
erally fall into one of three categories: a reference-based
strategy, a de novo strategy or a combined strategy that
merges the two (FIGS 2–4). In the following sections, we
discuss each of these three strategies in detail, includ-
ing their pros and cons for the assembly of simple and
complex transcriptomes.
Reference-based strategy
When a reference genome for the target transcriptome
is available, the transcriptome assembly can be built
upon it. In general, this strategy — which is known
as ‘reference-based’ or ‘ab initio’ assembly — involves
three steps. First, RNA-seq reads are aligned to a refer-
ence genome using a splice-aware aligner, such as Blat45,
TopHat46, SpliceMap47, MapSplice48 or GSNAP49 (TABLE 1;
FIG. 2a). Second, overlapping reads from each locus are
clustered to build a graph representing all possible iso-
forms (FIG. 2b). The final step involves traversing the graph
to resolve individual isoforms (FIG. 2c,d). Examples of
methods that use the reference-based strategy include
Cufflinks20, Scripture16 and others17,50 (TABLE 2).
Splice-aware aligners generally fall into two classes:
seed-and-extend aligners and Burrows–Wheeler transform
(BWT) aligners, each of which has clear trade-offs. The
seed-and-extend algorithms, such as BLAT and GSNAP,
start by quickly finding a ‘seed’ — a substring of the read
— that exactly matches the genome and then locally
extending the match using Smith–Waterman alignment
algorithms. BWT aligners are optimized to align reads
with few errors in them and are therefore generally faster
than seed-and-extend aligners. Each aligner differs in its
implementation for aligning reads across introns. In gen-
eral, seed-and-extend aligners shift the gaps in the local
alignment to match known splice sites, whereas BWT
aligners, such as TopHat, create a database of all possible
combinations of splicing junctions within a locus and
then align to this database the reads that failed to align
to the genome.
After the reads are aligned to the genome, two meth-
ods are typically used
本文档为【Transcriptom-Assemble】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。