Genome assembly is a bit of a puzzle

By Alan Tracey and Stephen Doyle, Parasite Genomics, Wellcome Trust Sanger Institute

One of the core aims of the BUG project is to complete our work on the Haemonchus contortus genome and to generate a high quality reference genome assembly for Teladorsagia circumcinta. However, the process by which a high quality genome assembly is generated is a non-trivial matter. In an ideal world, we would simply be able to “read” each base from one end of a chromosome to the other, telomere to telomere. Although recent advancements in DNA sequencing technologies have transformed our ability to analyse genomes, unfortunately it is still extremely difficult with current technologies to assemble medium to large scale genomes. This is because we are limited to breaking the genome into a very large number of very small fragments that are sequenced individually in high-throughput DNA sequencers (an approach called “shotgun sequencing”), leaving us with the challenge of trying to assemble these short sequences back together. Ultimately the aim is to assemble the short sequences together so that they perfectly reflect the true genome sequence.  This process is very much like doing a very difficult jigsaw puzzle that may consist of as little as thousands, but typically millions or billions of pieces, many of which are very similar to each other.

The analogy of a jigsaw puzzle can be used to highlight some of the difficulties faced during the assembly of eukaryotic genomes of organisms such as H. contortus and T. circumcincta. These challenges include: (i) the presence of long stretches of repeated nucleotides (imagine a puzzle of a vast expanse of cloudless blue sky), (ii) biases in sample preparation and sequencing mean that some regions of the genome are poorly represented or missing (lost puzzle pieces result in permanent holes. However, we often don’t know they are missing, or where the holes are, until the rest of the puzzle is close to completion), (iii) the presence of other organisms or tissues in the sample that leads to sequence contamination (somehow pieces from another jigsaw puzzle have found their way into our jigsaw puzzle box), and (iv) genetic differences, whereby the same positions of the genome can be represented in subtly different ways (multiple variations [for example, two in a diploid organism] of a puzzle piece that can fit in the same location). Given that these genomic “jigsaws” consist of so many pieces, we rely heavily on high performance computing and assembly algorithms designed to help put these pieces together; however, both are not perfect and often the pieces are incorrectly assembled together, for example some pieces joined in the wrong order.

Fortunately, we have access to, and are using, some of the latest sequencing technologies available that can make genome assembly easier. “Third generation”, single molecule sequencing such as that produced by the Pacific Biosciences RSII sequencing machine produce much longer reads (thousands of nucleotides long) than the dominant previous generation sequencing technology (from the company Illumina; typically between 100-250-bp long). These longer “Pacbio” reads often enable us to assemble difficult regions such as repeats, which we may not have previously been be able to do in a short-read only assembly. Having bigger, more unique jigsaw puzzle pieces allow us to make more confident joins, resulting in an easier assembly process. To compliment these new long read sequences, we have also been using optical mapping approaches, which give us an independent long-range view of our assembly. Optical mapping employs restriction endonucleases that cut the template DNA in specific places in the genome; by comparing the pattern of these cut sites in hundreds-to-thousands of kilobase-long DNA molecules to our DNA assembly, we can generate very long, contiguous DNA sequences (Figure 1). Quality optical maps can be very difficult to produce due to the challenges associated with extracting and purifying high molecular weight DNA that is essential for this application. However, the challenges with making an optical map are worth overcoming; having an optical map is very much like being able to see the picture on the jigsaw box as we try to assemble all of the pieces, and this gives us invaluable large-scale context to our assembly and ensures that long-range problems, such as incorrectly joined chromosomes, do not happen.


Figure 1. Example of a H. contortus optical contig (top) aligned against a sequence-derived contig (bottom). The vertical black lines in both sequences represent the actual (top) and in silico (bottom) predicted restriction endonuclease cut sites that are used to orientate the two sequences.

At the Wellcome Trust Sanger Institute, we synthesise several types of sequence and mapping evidence (currently Illumina short read, PacBio and optical mapping) and use many years of genome assembly experience to improve genome assemblies beyond what is currently possible by solely automated means. This manual curation process employs a tool called “GAP5”, which allows us to view, edit and reassemble regions of the genome in the final stages of assembly. This approach has been used to assemble the H. contortus genome into chromosomal-scale pieces, which are essentially complete from telomere to telomere (Figure 2). Small improvements are now being made to account for some of the variation seen in this highly polymorphic genome prior to release as a publically available resource for researchers working on Haemonchus and related species.


Figure 2. A “Circos” plot showing syntenic (highly similar) regions shared between H. contortus (purple bars) and Caenorhabditis elegans (numbered, coloured bars). Each coloured line linking a C. elegans chromosome to H. contortus sequence represents a highly similar, translated region shared between the two genomes. This plot demonstrates that much of the genome is shared, and that it exists in a similar orientation between the two organisms.

As we near completion of the H. contortus genome assembly, our next challenge is to assemble the T. circumcinta genome. We are currently working on optimising the Pacbio and optical mapping conditions to maximise sequence lengths of the raw data, which we believe will be crucial in making sense of this genome, which appears to be considerably larger and potentially more complex even than that of H. contortus.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s