Readings and resources for Sequence Assembly Algorithms

 

 

NOTE: To get any of the journal papers mentioned here, point your browser to this directory with no file name.

 

Introduction

 

I have not been able to find any single source, such as a textbook or review article that provides a high-level overview of sequence assembly. That is probably because the state of the art is changing so fast that (a) everyone who knows about it is too busy developing methods to teach, and (b) anything that did get written up would be obsolete within a few years. So I have pulled the material for the lectures from several different research papers, none of which can easily be read in its entirety. Below, I will list the major topics and where I got information on each of them. Where possible, I will direct you to certain parts of the papers.

 

Formalization of reconstruction problem, relation to shortest common superstring, and relation of layouts to layout graphs.

 

This material was drawn from (Kececioglu and Myers 1995) and (Myers 1995). These are both fairly formal, theoretical papers, and they present slightly different treatments. The relevant part of K&M is pages 1-6 (ending before section 2.1) and pages 16-18 (ending before section 4.2). It is also interesting to read their experiments in Section 6 (this could give you some inspiration for possible final projects). The remainder of section 2 is mainly of theoretical interest and is not used in practice. In addition, the experiments show that it has problems. Likewise, the remainder of Section 4 is a complicated algorithm for finding the layout graph with greatest total weight that, according to the experiments in Section 6, doesn’t do any better in practice than the simple greedy algorithm. Section 3 is about resolving read orientation. It is interesting, and not too, too difficult, but this approach is superseded in (Myers 1995) by a more complex representation of the overlap and layout edges that rolls orientation into layout.

 

In discussing the relationship between layouts and layout graphs, M&K use the term dovetail chain branching to refer to the subsets of the overlap graph that correspond to connected components of the layout graph, or contigs. (Myers 1995) uses the term dovetail path framework. Both treatments discuss the handling of reads that are completely contained within other reads during layout, something that we have not yet discussed in class (but we will). Fortunately, it is pretty simple.

 

The treatments in (Kececioglu and Myers) and (Myers 1995) are slightly different in the way they handle ambiguity in read orientation. (Kececioglu and Myers) treats it in a separate stage, before layout. Specifically, they put separate nodes in the overlap graph for each read and its reverse complement. They then have an orientation stage in which they decide which version of each read to use. (Myers 1995), which represents a later stage of Myers’s thinking, uses a single node for each read, but incorporates more types of edges to represent the relative orientations at which two reads overlap. E.g., if the overlap involves either the suffixes or prefixes of both reads then one of them has to be reverse complemented; if it involves the prefix of one and the suffix of the other, they are in the same orientation. This is encoded in the edge types. Although the treatment in (Myers 1995) is more modern, the one in (Kececioglu and Myers) is a little simpler and more like the presentation in class, where I ignored the orientation problem.

 

Detecting overlaps

 

My presentation of the k-mer word method of detecting overlaps was based mostly on the “alignment module” of (Batzoglou et al. 2002). This is presented briefly in the introduction, under “Outline of Assembly Algorithm,” but to really follow it you need to read the section “The Alignment Module” under “Details of the Algorithm”. I presented what they call Phase 1 (detection of k-mer hits) and Phase 2 (ungapped extension) but not their Phase 3 and 4, which are necessary for constructing gapped alignments. Also, since they will construct gapped alignments later, they have heuristics for terminating the ungapped alignments before reaching the ends of the reads. Since I assumed we would not construct gapped alignments, I said that the ungapped extensions should go all the way to the end of the reads.

 

A different, but related approach is presented in (Mullikin and Ning 2003). Here, they do not even construct ungapped alignments – they use multiple k-mer hits and stop with a large cluster of reads that are likely to be connected by a chain of overlaps. Nonetheless, I found this discussion of k-mer hits a useful supplement to that of (Batzoglou et al.).

 

(Mullikin and Ning 2003) was also the source for the discussion of the theoretical and actual frequency distributions of k-mer hits.

 

Greedy algorithm for layout

 

None of these papers actually describe the greedy algorithm for finding the maximum weight layout graph, since it is essentially the same as algorithms that were described earlier for the shortest common superstring problem. The introduction to (Kececioglu and Myers) has a good, brief review of prior work and a number of citations on shortest common superstring (including one by Jon Turner in this department!).

 

Detecting cycles during layout

 

This is pretty straightforward, but I don’t know of any source for it in the literature.

 

The problems posed by repeats and how to deal with them

 

Figure 1 of (Myers 1995), described in the first couple of paragraphs of Section 2, provides a good illustration of why repeats can cause assembly errors. A great deal has been written about how to deal with repeats during assembly to avoid these errors. Essentially, they all boil down to doing the initial assembly only on the reads that have at least some non-repetitive component. The all-repeat reads may be ignored or assembled into their own contigs, which may collapse many copies of the repeat.  The gaps left by the centers of long repeats are filled in later. This is done partly by ignoring high-frequency k-mer hits, and partly by recognizing patterns in the overlap graph that are characteristic of what Batzaglou et al. calls “repeat boundaries”. This means the edges of repeat-contigs, where the non-repetitive portion resumes, and edges of non-repeat contigs, where the repetitive portion starts. These are separated out, so the initial stage does not try to decide how to connect up the repetitive parts to the non-repetitive parts. Some of these heuristics are described graphically in (Batzoglou et al.). A similar approach is given a more formal treatment in Section 4 of  (Myers 1995) (but the figures and examples are worth reading through even if you don’t put in the effort to follow the formal details).

 

Incorporation of mate-pair distance constraints

 

I don’t have a good direct source on this yet, but I believe there is probably one in the literature. One way to handle this is to use the Bellman-Ford algorithm for checking sets of difference constraints, as described in (Cormen et al.). Once you understand how this algorithm works, though, I think the same principles can be used more efficiently in the special case of layout.

 

In general, these constraints tell you when a layout is probably wrong, because it is inconsistent with the insert lengths between paired-end reads. They can either be used during layout (as in the double-barreled assembler for Lab 1) or the layout can be done first, and then contigs can be broken up if necessary to avoid constraint violation. The latter approach is taken by (Mullikin and Ning 2003). They can also be used pro-actively, to try to find initial contigs that are very confident because sets of overlaps are supported by mate-pairing and insert-sizes. This approach is taken by (Batzoglou et al.), who referred to these mate-pair supported overlaps as “paired pairs”.

 

Other sources

 

Sequence assembly programs are usually designed for either assembly of shotgun reads from a fairly small sequence, such as a BAC (< 300 Kb) or bacterial genome (~1-3 Mb), or for large sequences, such as an animal or plant genome. The latter are called whole-genome shotgun (WGS) assemblers. The most famous and most widely used BAC assembler is a program called PHRAP, by Phil Green (the author of PHRED). Infuriatingly, no paper has ever been published on PHRAP. The only way to learn about it is to look at the web site (www.phrap.org) or read the source code, which is free for academic use. Another BAC assembler, which has been described in several papers, is CAP (and CAP2 and CAP3). See (Huang and Madan 1999).

 

For WGS assemblers, two of the most recent papers are (Batzoglou et al.), which is updated in (Jaffe et al. 2003), and (Mullikin and Ning). The latter actually uses PHRAP as a subroutine, preprocessing and post processing its assemblies of read-clusters. There are also a number of more recent and more applied papers by Gene Myers where he describes either practical algorithms or assemblies of specific genomes. See (Huson et al. 2001; Myers et al. 2000; Weber and Myers 1997).

 

Finally, there are a bunch of papers debating whether whole-genome shotgun sequencing and assembly for large genomes is a good idea. Most of this literature is now moot as everyone agrees it is a good idea. In this case, Myers won the day, though the specific algorithm he developed for the Celera assembly of the human genome has now been superseded.

 

Batzoglou, S., D.B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J.P. Mesirov, and E.S. Lander. 2002. ARACHNE: a whole-genome shotgun assembler. Genome Res 12: 177-189.

Cormen, T.H., C.E. Leiserson, and R.L. Rivest. 1990. Introduction to algorithms. MIT Press ; McGraw-Hill, Cambridge, Mass.

Huang, X. and A. Madan. 1999. CAP3: A DNA sequence assembly program. Genome Res 9: 868-877.

Huson, D.H., K. Reinert, S.A. Kravitz, K.A. Remington, A.L. Delcher, I.M. Dew, M. Flanigan, A.L. Halpern, Z. Lai, C.M. Mobarry, G.G. Sutton, and E.W. Myers. 2001. Design of a compartmentalized shotgun assembler for the human genome. Bioinformatics 17 Suppl 1: S132-139.

Jaffe, D.B., J. Butler, S. Gnerre, E. Mauceli, K. Lindblad-Toh, J.P. Mesirov, M.C. Zody, and E.S. Lander. 2003. Whole-genome sequence assembly for Mammalian genomes: arachne 2. Genome Res 13: 91-96.

Kececioglu, J.D. and E.W. Myers. 1995. Combinatorial algorithms for DNA sequence assembly. Algorithmica 13: 7-51.

Mullikin, J.C. and Z. Ning. 2003. The phusion assembler. Genome Res 13: 81-90.

Myers, E.W. 1995. Toward simplifying and accurately formulating fragment assembly. J Comput Biol 2: 275-290.

Myers, E.W., G.G. Sutton, A.L. Delcher, I.M. Dew, D.P. Fasulo, M.J. Flanigan, S.A. Kravitz, C.M. Mobarry, K.H. Reinert, K.A. Remington, E.L. Anson, R.A. Bolanos, H.H. Chou, C.M. Jordan, A.L. Halpern, S. Lonardi, E.M. Beasley, R.C. Brandon, L. Chen, P.J. Dunn, Z. Lai, Y. Liang, D.R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G.M. Rubin, M.D. Adams, and J.C. Venter. 2000. A whole-genome assembly of Drosophila. Science 287: 2196-2204.

Weber, J.L. and E.W. Myers. 1997. Human whole-genome shotgun sequencing. Genome Res 7: 401-409.