02 Polishing & Assessment - Linafina100/GenomeAnalysis GitHub Wiki

Assembly Polishing and Assessment

1. Polishing with Pilon

Draft assemblies generated from long-read sequencing technologies like Nanopore, often contain a higher error rate which could be seen in the FastQC report, see 01 Pre-proccessing. The goal of this step was to correct these sequence errors to improve the overall accuracy of Chromosome 3 before annotation.

The polishing was performed with Pilon software to improve the Flye draft assembly by using more accurate, short Illumina reads, mapping them back to the draft genome to identify and correct structural inconsistencies. The exact script used (5_pilon_polishing_chr3.sh) can be found in the repository's code directory.

2. Assembly Evaluation

2.1 Comparing Pre- and Post Polishing with QUAST

To determine if the polishing step structurally improved the assembly, we compared the pre-polished and post-polished fasta files using QUAST. The statistics of this analysis can be seen in Table 1 below.

Table 1. Summary of QUAST statistics, comparing the initial Flye assembly to assembly post polishing.

The total length increased by roughly 20 kb, indicating that Pilon successfully filled in missing deletions. The L50 remained stable, meaning the polishing process did not break the contigs apart but it not extend them either. Because we are only assembling chromosome 3, it is hard to compare these numbers to the reference assembly where the whole-genome was assembled. The GC-content of 41.96 % of consistent with the GC-content of the whole-genome anlysis of 41.86 %.

Figure 1. Cumulative length of the initial assembly (draft) vs the polished assembly.

The cumulative contig length plot in Figure 1, shows that the polished assembly closely follows the same distribution as the draft assembly, indicating that Pilon did not introduce major structural changes. The slight increase in total assembly length suggests that small errors such as deletions were corrected. Overall, polishing improved sequence accuracy while preserving the original contiguity of the assembly.

3. Gene Completeness Assessment with BUSCO

To assess the biological quality and gene completeness of our polished Chromosome 3, BUSCO was ran on the assembly using the embryophyta_odb10 dataset (which contains 1,614 highly conserved plant genes). Results can be seen below.

Table 2. BUSCO results of the polished assmebly.

BUSCO Category	Count	Percentage
Complete BUSCOs (C)	152	9.4%
-- Complete and single-copy (S)	145	9.0%
-- Complete and duplicated (D)	7	0.4%
Fragmented BUSCOs (F)	7	0.4%
Missing BUSCOs (M)	1455	90.1%
Total BUSCO groups searched (n)	1614	100%

Summary Notation: C:9.4%[S:9.0%,D:0.4%],F:0.4%,M:90.1%,n:1614,E:3.9%

The results above show that only 9.4 % of complete busco genes were found in the embryophyta_odb10 dataset. A score of 90.1% Missing (M) is the expected biological result for this specific analysis. The embryophyta database checks for universal core genes that belong to an entire plant genome but because we are only analyzing Chromosome 3, a majority of those 1,614 core genes could be located on other chromosomes. Therefore, the high proportion of missing BUSCOs should not be interpreted as poor assembly quality, but as a result of analyzing a single chromosome instead of the whole genome. Finding 152 core genes on this chromosome indicates that a subset of complete conserved plant genes are present on the assembled chromosome. The low fraction of duplicated BUSCO genes (0.4%) further indicates that the assembly is not affected by major redundancy or misassemblies. Overall, these results indicate that the assembly is structurally reasonable within the scope of chromosome 3, although BUSCO is not an ideal metric for assessing completeness in partial genome assemblies.