Post assembly steps - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki

Post assembly steps

Please find the presentations given at the SIB Bioinformatics of long read sequencing workshop 2017 here. The introduction below roughly follows that given in the presentation.

So, you have spent lots of money sequencing the genome of your favourite beasty, you have a nice contig N50 and now you want to do something with the genome. But wait, your genome is still in 10,000 chunks. So how do you examine genome wide patterns of Fst, dn/ds, introgression . . . etc. etc. etc.? You need to know where these chunks (contigs) lie relative to each other in the genome.

There are a few ways of addressing this issue, and they share the same common principle; you generate linkage information which connects distant regions of a chromosome and integrate this into an existing sequence assembly to connect adjacent contigs/scaffolds. These "long-range scaffolding" approaches can be used for short read assemblies or long read assemblies, but their efficacy relies heavily on the quality (contiguity, low number of gaps) of the initial sequencing assembly, thus long read assemblies respond particularly well to these approaches.

Long range scaffolding approaches

Here we will briefly introduce some long range scaffolding approaches which can be used, once a good sequence assembly has been produced.

The first, Hi-C, relies on the way that DNA is physically positioned in the nucleus. Briefly, sequencing libraries are produced so that regions of chromatin which are physically close are sequenced using paired end sequencing. The two reads in each pair can then be megabases apart, but, as they likely fall in the same chromosome, they can be used to order and orient contigs/scaffolds of an existing sequence assembly.

Another approach, Optical mapping, is essentially a reboot of a classical molecular biology tool, the restriction map. Optical mapping works by attaching a fluorescent probe at restriction sites (i.e. specific short sequence motifs) throughout the genome. Using some pretty well developed technology, DNA is then linearised, passed through flow channels and visualised. This produces a barcode of restriction sites, the pattern of which is highly specific to that particular region of the genome. Due to the nature of the technology, the raw reads can be in the region of 300kb long, and when assembled together, can produce physical maps on the scale of megabases. In parallel to this process, the existing sequence assembly is digested, in silico, using the sequence motif corresponding to the restriction enzyme used for the optical mapping. This again, produces a barcode for each sequence contig/scaffold, which can then be matched to the physical maps to give a hybrid assembly. This approach is cheap, very high throughput and can yield hybrid assemblies many times more contiguous than the sequence assembly alone.

Finally, another classical approach which can be used for long range scaffolding is linkage mapping. Due to the invention of methods like RADseq, it is now reasonably trivial to produce linkage maps containing many thousands of markers (given that the biology of your organism allows it). These markers can be aligned to an existing sequence assembly, and used to order and orient the fragments. Recently, some nice pieces of software have been published to do exactly this, and so, to show the power of such an approach, we will run through an example together.

Next

Go to TUTORIAL: Anchoring contigs/scaffolds using linkage maps.

Go back to Table of content .