Genome Assembly - MaryamDost/GenomeAnalysis GitHub Wiki

Genome Assembly - Canu

The Leptospirillum ferriphilum genome were sequenced by using PacBio. The raw long read from the sequencing were given to analyses to regenerate the result represented in the paper.

A good way to assemble PacBio sequence will be by using Canu because it operates in three phases as its stats on its documentation. These phases include correction, trimming and assembly. The correction phase will improve the accuracy of bases in reads and trimming phase will trim reads to removing suspicious regions such as remaining SMRTbell adapter. The two steps are done to improve the quality of the reads. The assembly phase will order the reads into contigs, generate consensus sequences and create graphs of alternate paths. This project is about to assemble a prokaryotic genome from PacBio data, the use of Canu therefore becomes relevant. The script for how to run Canu is found here.

Result

The Canu-assembly produced three different contigs:

Contig	length	%G~C
tig00000001	2575927	54.2
tig00000002	46463	48.2
tig00000008	2751	1.2

Discussion

tig00000001 has almost the same size a the reference genome (2,569,357 bp) and is very likely to contain the genom. tig00000002 has slightly more base bp than the phage reference genome (41,141 pb) but is most probably it. tig00000008 is very short in size and has also a very low GC content(indicating instability) and is probably a result of sequencing errors/contamination.

Lab-manual question

What information can you get from the plots and reports given by the assembler (if you get any)?

I got a file, WGS_assembly.report, where I i.a. see the number av bases, the amount of coverage, the number of reads, the number of overlaps and also contig length. There are some histograms for example a histogram of read length, a histogram for the unitgging process a histogram of k-mers in raw and corrected reads. It also has the summary of overlaps and corrected data, some trimming information and a summary of contig lengths.

What intermediate steps generate informative output about the assembly?

Correction, trimming and unitigging.

How many contigs do you expect? How many do you obtain?

I expected at least two because that is the amount of contigs they obtained in the article. I obtained 3, the large one of the bacterial chromosomes and the other two is much shorter which is probably from a phage.

What is the difference between a ‘contig’ and a ‘unitig’?

Contigs is contiguous sequence generated by overlapping series of sequence reads that together form an assembled region of a genome. Contigs is a contiguous se-quence of order unitigs.

A unitig is a special kind of contig, a high-confidence contig. These can only be assembled in one way in contrary to contig that can be assembled in more then one way. A unitig is a set of overlapping sequences that together make standard se-quence and do not overlap fragments with conflicting sequences. Contigs may con-tain unitigs within.

What is the difference between a ‘contig’ and a ‘scaffold’?

Scaffold includes contigs and gaps (in between two contigs). I mentioned in the question above contigs is contiguous sequence generated by overlapping series of sequence reads thus consists of no gaps.

What are the k-mers? What k-mer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big k-mer?

If you look at a sequence a k-mer is a substring with length k. what k-mer to use depends on the sequence complexity and the read length. Any k-mer should be evenly represented across the length of the sequence.

The benefit of the small k-mer is that you find many overlaps the problem will be that it may also give false overlaps. The benefit of using a large k-mer is that the overlap true hances the chance of two long sequences overlapping is very un-likely. The problem will be that you will miss short overlaps.

Some assemblers can include a read-correction step before doing the assembly. What is this step doing?

After that the sequence is trimmed, the remaining adapters are removed, the se-quence usually undergoes the read correction step. This mean that the program identifies and fixes, by replacing them by using the consensus sequence, or re-moves sequencing errors.

How different do different assemblers perform for the same data?

The difference in performing assembly between different assemblers depends on the distinctive properties of the genome. These distictive properties are many but take for example, if we consider a genome that has multiple copies of some region of the genome, repeats. This hugely influence the assembly results because that assembler would not be able to distinguish these regions and will probably result in misassemblies. In this case we will need long-read sequencing technology so that the reads are long enough to also include the unique sequences flanking the repeats.

Can you see any other letter apart from AGTC in your assembly? If so, what are those?

No, I cannot see any other letter apart from AGTC. However, fasta files can some-time contain some other letters like N, which indicate that the position belongs to any nucleotide A, T, C or G.