Assembly Evaluation - MaryamDost/GenomeAnalysis GitHub Wiki

Assembly Evaluation

The assembly quality had to be assessed to determine whether the assembly results meet the quality requirements or not. The assembled genome was compared with the reference genome by aligning the sequences and evaluating the result. The following tools were used:

Quality Assessment Tool for Genome Assemblies, QUAST, is a quality assessment tool that through various parameters evaluates and compares genome assemblies. The detailed script on how to run QUAST is found in Code directory.

MUMmer was used to find the maximum exact matches between the sequences. Mummerplot was used to visualize the result produced by MUMmer. It is a tool that generated a dot plor representation of the comparison. The detailed script on how to run MUMmer and produce mummerplot is found in Code directory.

Result

Table 2. Some quality statistics from the report obtained with Quast.

Statistic	WGS_assembly.contigs
Largest alignment	146367
# contigs	3
Largest contig	2575927
Total length	2625141
Reference length	2610531
GC (%)	54.06
Reference GC (%)	54.14
N50	2575927
NG50	2575927
Genome fraction (%)	97.567
Duplication ratio	1.011
Largest alignment	2452183
Total aligned length	2575885
# misassemblies	2
NA75	2452183
NGA75	2452183
LA75	1
LGA75	1
# fully unaligned contigs	2

More statistics is found in the report,genomeAssemblyQC_Qustreport.pdf, obtained with Quast

Figure ?

Discussion

According to NCBI genome list the GC content for Leptospirillum is around 50 % which matched with our GC value. The result shows the we have a high genome fraction value and duplication ratio that is close to 1 which declares that the contigs cover almost all reference genome and does not overlap. More discussion about the similarities in Lab questions.

A dotplot of all the MUMs between two sequences can reveal their macroscopic similarity. The mummerplot shows that genome is assembled from the complimentary DNA stand and because the genome is circular, we have ended in different positions. It is possible to get a more similar result, by sequencing the same DNA strand as the reference genome and manipulate the program to start at the same position. However, this is not necessary as the desired result has already been obtained.

Based on the discussion above and under lab manual questions, we can conclude that the quality of the obtained genome assembly is good enough for further analysis.

Lab manual questions

What do measures like N50, N90, etc. mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, total size, longest contig ...)

N50, N90: N is the length for den shortest contig among contigs that covers a certain percentage of the bases of the assembly. The number 50 and 90, respectively, account for this percentage. The help to evaluate the continuity of the assembly. To assess the quality of the assembly thou depend on many different factors. We see that out contig that contains the genome cover more than 50 percent of the genome which is a further indication of that is the contig contains the bacterial genome. However, we must also analyze the similarity of the total size and number of ORFs and other properties such as GC content with the reference genome, or genomes from similar species in case we lack reference genome. Other thing we can look at is if the assembly and the data are consistent, e.g., deep coverage, consistent paired reads and few mismatches.

How does your assembly compare with the reference assembly? What can have caused the differences?

The assembly compares very well to the reference genome from the article. It is almost of the same length, the length of the contig containing the genome is same and both contains the same % GC-content. The small differences between these values could be due to the use of different methods. The mummer plot shows that we have the same number of exact matches between the sequences. The different is that I have assembled the complimentary DNA stand.

Why do you think your assembly is better/worse than the public one?

The assembled sequence length and contig is longer then the referent which is better but other than that I don’t indicate some remarkable differences between assemblies. Prokaryotic genomes are easy to assemble due to their small and circular size. The same result could be easily produced in spite of using different tools and technique.

References

Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, Glenn Tesler, QUAST: quality assessment tool for genome assemblies Bioinformatics, Volume 29, Issue 8, 15 April 2013, https://academic.oup.com/bioinformatics/article/29/8/1072/228832