01 Assembly - Linafina100/GenomeAnalysis GitHub Wiki

1. Genome Assembly (Flye) - Nanopore

Genome assembly of chromosome 3 was performed using long-read Nanopore sequencing data. The assembly was performed with Flye, a de-novo genome assembler for long reads which can be used on any genome. It can handle noisy long-read data and produce contiguous assemblies, which is suitable when handling Nanopore reads. Raw Nanopore reads were provided in FASTQ format and could therefore be used directly as input without additional trimming. Assembly was executed using the --nano-raw option, which is optimized for raw Nanopore reads. The analysis was run on the UPPMAX cluster using 16 CPU cores and 64 GB of memory, with a maximum runtime of 5 hours.

Flye performs multiple internal steps. During the read alignment stage, Flye maps reads together which produces alignment files that give an indication of coverage and the consistency of the data. In the initial assembly step (00-assembly), draft contigs are generated. The consensus step (10-consensus) takes these contigs and aligns reads back to them and computes a consensus sequence along with alignment statistics, providing information about error correction and sequence support. The repeat resolution step (20-repeat) generates assembly graphs (.gfa and .gv files) which show how contigs are connected and highlight unresolved repeats or ambiguities in the genome structure. The contigger step (30-contigger) outputs final contigs and statistics, which are most important indicators of assembly quality. Finally, the internal polishing step (40-polishing) improves accuracy and provides coverage. Together, these intermediate outputs allow evaluation of assembly progression, identification of problematic regions such as repeats, and assessment of both structural continuity and sequence accuracy before polishing.

1.1 Initial Assembly Statistics

The assembler outputs from Flye provide important information about the quality and structure of the genome assembly. The assembler gives one statistics output with information such as contig lengths, coverage, and total assembly size, which help assess completeness and continuity of the assembly, see Table 1. The assembly graph (.gfa files) visualizes how contigs are connected and can reveal unresolved repeats or structural ambiguities, see Figure 1. Additionally, log files report parameters used, read coverage, and progress of each assembly step, which are useful for troubleshooting and reproducibility.

Assembly Graph from Flye

Figure 1. Assembly graph visualizing how contigs are connected, revealing structural ambiguities and repeats.

Table 1. Results from initial Flye assembly.

seq_name	length	cov.	circ.	repeat	mult.	alt_group	graph_path
contig_9	2922085	588	N	N	1	*	10,11,-10,9,*
contig_92	2296776	610	N	N	1	*	89,92
contig_113	1854480	574	N	N	1	*	*,113
contig_26	1724286	590	N	N	1	*	126,129,-27,26,89
contig_38	1265633	598	N	N	1	*	135,-136,135,38,73,-75,-73
contig_128	1082798	590	N	N	1	*	128
contig_2	1045829	594	N	N	1	*	94,95,-94,2
contig_74	619105	598	N	N	1	*	74,78
contig_3	490403	568	N	N	1	*	3,124,-6,124,-6,124
contig_125	423241	504	N	N	1	*	125,126
contig_7	402449	578	N	N	1	*	7,*
contig_132	387661	624	N	N	1	*	132
contig_14	351124	546	N	N	1	*	14
contig_62	304909	572	N	N	1	*	62,64
contig_58	255401	590	N	N	1	*	58
contig_104	229580	582	N	N	1	*	104,107,-108,107,-108,107
contig_4	115341	570	N	N	1	*	4
contig_37	104708	589	Y	N	1	*	37
contig_133	98649	628	N	N	1	*	-98,99,-138,-98,99,-138,-98,133
contig_55	97053	297	N	N	1	*	126,129,55,*
contig_137	89469	624	N	N	1	*	137
contig_32	87042	275	N	N	1	*	*,32,-34
contig_79	64772	164	N	N	1	*	,79,
contig_88	59236	295	N	N	1	*	88
contig_16	54334	390	N	N	1	*	16
contig_100	51504	314	N	N	1	*	100,138,-99,98,138,-99,98,138
contig_126	49605	452	N	Y

1.3 Discussion

The initial Flye assembly resolved the Nanopore reads into 60 contigs. The average coverage of these primary contigs is high, approximately 400x, which is indicating high confidence in the overlap graph. Figure 1 reveals that the fragmentation of the main chromosome is could be a result of small, repetitive nodes that Flye was unable to bridge. There are several long continuous paths representing the confidently assembled chromosomal sequences. However, these main paths many times converge into complex, looping structures. These tangles visually represent repetitive elements where the assembler could not resolve a single clear path. Furthermore, the graph shows a distinct baseline of small, disconnected nodes, visually representing the orphaned, low-coverage fragments that failed to anchor to the main assembly backbone. However, the number of contigs is relatively good for a plant genome as the moss Niphotrichum japonicum contains 34.74% of repeat sequences.

If a different assembler would have been used, it would not have given the exact same result, as different assemblers are based on different algorithms and mathematical models. For instance, short-read assemblers often use de Bruijn graph approaches, which can lead to fragmented assemblies in repetitive regions. In contrast, long-read assemblers such as Flye use overlap-based methods that better resolve repeats and produce longer contigs. Even assemblers using the same mathematical foundation can produce different results due to differences in heuristics, such as error filtering, repeat resolution, and coverage thresholds. As a result, assemblies may vary in contiguity, N50, and accuracy, with some methods providing smaller, more accurate assemblies while and others more prone to errors but give larger and fewer contigs.

Bruijn graph approaches often break reads into smaller units called k-mers and reconstruct sequences based on their overlaps. A k-mer is a short DNA substring of length k extracted from a read. For instance, the sequence ATCGCT with k = 3 would be divided into ATC, TCG, CGC, and GCT. The choice of k-mer size strongly affects assembly results. Small k-mers are more tolerant to sequencing errors and work well with low coverage, but they create complex and ambiguous graphs, often leading to fragmented assemblies. In contrast, large k-mers produce cleaner assemblies with longer contigs, but they are sensitive to sequencing errors and require high coverage to avoid gaps and assembly failure.

Flye avoids choosing exact k-mer matches by first approximating sequence overlaps and can therefore handle noisier reads and corrects errors after a draft graph has been constructed. Other long-read assemblers, like Canu, works with a pre-correction step before assembly. This step works by aligning reads to each other and generating a consensus sequence, where true bases are reinforced and sequencing errors are removed. As a result, the corrected reads have higher accuracy, which improves the reliability and contiguity of the final assembly. Flye does not perform this step because it is designed to tolerate high error rates directly, making the process faster and less computationally demanding, while still achieving accurate results through post-assembly error-correction.