02 Genome assembly - saltpinna/Genome_analysis_project GitHub Wiki

Canu was used to perform the Genome assembly using the long reads from PacBio. The script used for this can be found under code/scripts and is called Canu_script.sh. The prefix of the result files was set to "genome_assembly_canu" and the output directory was defined, as well as an approximate length of the genome. The sequencing technique used was set to PacBio and lastly, the path to the directory containing the raw read files was stated.

The assembly resulted in 9 contigs, 0 bubbles and 36 unassembled sequences. The longest contig was 2775620 nt which corresponds quite well to the chromsome size found in the article (2.765 million nt), which speaks for this being the chromosome of the bacteria. Shorter contigs are likely to correspond to plasmids in the bacteria.

Questions

What information can you get from the plots and reports given by the assembler (if you get any)?

The report file gives information on the distribution of read lengths, k-mers in raw and corrected reads as well as information on contigs and read overlaps.

What intermediate steps generate informative output about the assembly?

One intermediate step is correction of the reads, which means that the reads are replaced with less noisy consensus sequences computed from overlapping reads (1). This generates valuable information about how many reads were corrected and how many reads are rescued and therefor do not contribute to the correction of the reads making up the longest contig.

How many contigs do you expect? How many do you obtain?

In the article, they found one chromosome and six plasmids, corresponding to seven contigs. The result from the Canu assembly was 9 contigs. This indicates that the assembler failed to assemble all intact contigs and instead assembled some of them as separate contigs. This could be due to the 36 unassembled seqeunces found by the assembler.

What is the difference between a ‘contig’ and a ‘unitig’?

A unitig is an assembled sequence without competing choices, indicating that it is most likely a correct assembly. A contig consists of multiple unitigs that together still form a contigous sequence but with slightly lower confidence (2).

What is the difference between a ‘contig’ and a ‘scaffold’?

A scaffold constist of multiple contigs than span a longer interval. Contigs have lower confidence and may have gaps in the seqeunce.

What are the k-mers? What k-mer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big k-mer?

K-mers are contigous seqeunces of k bases that are used to find overlaps between sequence fragments to assemble longer seqeunces such as untigis and contigs. The size of k-mers used in different steps in the assembly influence the result in that a small k-mer might not be specific enough and find overlaps between reads that do not actually belong together. Too long k-mers might instead be too specific and result in that overlaps between reads, unitigs or contigs are missed. Long k-mers will better be able to handle repeat regions of the seqeunce though, since it is more likely to overlap the whole repeat sequence than a short k-mer is.

Some assemblers can include a read-correction step before doing the assembly. What is this step doing?

Read correction means that the reads are replaced with less noisy consensus sequences based on overlapping reads, in order to increase the accuracy of the bases (1).

How different do different assemblers perform for the same data?

I have only performed assembly with Canu for the PacBio data, so I cannot compare the performance of different assemblers on the same data. I can, however, imagine that different assemblers use different algorithms and may therefor perform differently even when using the same input data.

Can you see any other letter apart from AGTC in your assembly? If so, what are those?

I do not find any other letter in my assembly, but I can imagine it would be possible to see the letter N for bases that couldn't be identified during seqeuncing.

References:

  1. https://canu.readthedocs.io/en/latest/index.html
  2. https://www.nbic.nl/uploads/media/PacBio_Hybrid_Assembly_Practical.pdf, Netherlands Bioinformatics Centre (2013)