Anchoring contigs scaffolds using linkage maps - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki

Before we start . . .

We need to install a few package dependencies. This might take 5-10 minutes so we can start it during the intro. Run the below command

sudo apt-get install less texlive-full

Intro

Next generation sequencing and methods like RADseq or simply whole genome resequencing have made it readily possible to produce high density linkage maps for non-model organisms. Indeed many have been created, and it is often the case that a linkage map exists prior to the production of a reference genome. Although they are often produced with other aims in mind, they have great utility for genome assembly. Knowing the order of interspersed markers throughout the genome allows us to order and orient the contigs or scaffolds of a genome assembly to produce chromosome scale super-scaffolds.

Tutorial details

In this tutorial we will run through a short example using the yellow catfish Tachysurus fulvidraco genome assembly. Detailed methods can be found in the ALLMAPS paper (Tang et al. 2015), but briefly, Tang et al. 2015 sequenced the genome of the yellow catfish and then produced a linkage map with which to scaffold the assembly.

Sequence assembly Stats:

  • Technology used = Illumina HiSeq 2000
  • 9224 scaffolds (N50 = 1Mb), 718Mb in length
  • The 161 scaffolds which were longer than 1Mb were used for linkage map scaffolding.

Linkage maps:

  • 4 maps were made using a backcross model:
    • 1 Male map made with R/QTL (BCMale) - 2442 unique markers
    • 1 Male map made with JOINMAP (JMMale) - 2434 unique markers
    • 1 Female map made with R/QTL (BCFemale) - 2507 unique markers
    • 1 Female map made with JOINMAP (JMFemale) - 2495 unique markers

The (almost 10,000) sequences in these linkage maps were aligned to the scaffolds from the sequence assembly (using BWA) in a step prior to this tutorial. In this tutorial, we will just focus on the interesting bit, integrating the linkage maps and the genome assembly using ALLMAPS. In the interests of time, we will do this only for a subset of the data, using only maps "JMMale" and "JMFemale" and just for chromosome 23 (of 26). If you would like to run a larger example in your own time, you can find the whole dataset here.

Hands On

1. Make the working directory and move into it.

mkdir ALLMAPS
cd ALLMAPS

2. Now we will download some data to work with

wget https://www.dropbox.com/s/b497bi2b594m6fp/ALLMAPS-testdata.zip
unzip ALLMAPS-testdata.zip
cd ALLMAPS-testdata
ls

In this folder we have several important files (you can take a look inside using less if you want):

  • scaffolds.fasta - The file containing the sequence assembly scaffolds
  • JMMale.csv - the alignment locations of the markers in the male linkage map "JMMale"
  • JMFemale.csv - the alignment locations of the markers in the male linkage map "JMFemale"

3. Merge the two linkage maps together:

python -m jcvi.assembly.allmaps merge JMMale.csv JMFemale.csv -o JM-2.bed

This will produce the file "JM-2.bed" containing the merged alignment information for the linkage map markers and the "weights.txt" file which tells the program (in the next step) how to partition the genome into the correct number of chromosomes. If the number of linkage groups between maps differs we would have to reorder this so that the one with the correct number of linkage groups lies at the top. But we don't need to do anything for this example as both maps contain only 1 linkage group.

4. Run the scaffold ordering algorithm:

python -m jcvi.assembly.allmaps path JM-2.bed scaffolds.fasta

This should give you 5 extra files:

  • JM-2.fasta - the ordered chromosome sequences.
  • JM-2.agp - a genome structure format containing the order and orientations.
  • JM-2.chain - useful for more downstream analyses.
  • chr23.pdf - a pdf output for the analyses.
  • JM-2.summary.txt - a summary of the stats before and after merging.

Take a look at JM-2.summary.txt using less. You can see that stats are given per linkage map (top table) and then for the consensus (bottom table).

We can take a look at the Chr23.pdf (see below).

You will see that there are multiple ordered and oriented scaffolds on this linkage group, and importantly there are several regions where there is a disagreement between the linkage maps. This shows the utility of multiple maps, even if they are maps with made with the same data, programs may differ in their ability to circumvent some types of error. In the Tang et al. (2015) the show much better success when they use two programs to make maps, rather than just one.

Although we won't go into this here, ALLMAPS also contains modules which can estimate the length of the gaps of between sequence scaffolds, and also importantly, to potentially split chimeric contigs that result from errors in the sequence assembly - particularly in highly repetitive genomes. For more details on this and ALLMAPS in general check out the ALLMAPS github page.

Finally, there is another program that has recently been published doing a very similar thing, called Chromonomer, from Julian Catchen's lab. Notably, Chromonomer puts more emphasis on the correction of either the linkage map or the sequence assembly during the integration process, thus may give more reliable results, however the two are yet to be formally tested to my knowledge.

⚠️ **GitHub.com Fallback** ⚠️