00 Pre‐Processing - Linafina100/GenomeAnalysis GitHub Wiki

The first phase of this project focused on establishing a reproducible and well structured directory organization to manage the computational demands of genome assembly. By using symbolic links for the raw data, the integrity of the original files was preserved while avoiding exceeding the 32 GB home directory limit on UPPMAX. Informative and consistent naming conventions were applied to all generated files, while original filenames were retained for raw data to ensure traceability.

Version control was implemented using Git, with a .gitignore file to separate source code and analysis outputs from large genomic datasets. This approach ensures a clean repository and aligns with best practices for managing bioinformatics workflows.

1 Initial Quality Control

Quality assessment of raw sequencing data was performed using FastQC. Analyses were also done on chromosome 3 and whole genome datasets (used for later analysis) using 2 CPU cores. The FastQC reports for chromosome 3 Illumina reads indicated high overall sequence quality as seen in the Figure 1 below. The Per base sequence quality (Figure 1a and 1b) remains well within the "green" zone (Phred score >28) for the majority of the read length, though a slight decay is visible at the 3' ends of the R2 reads, as is typical for Illumina technology. Subfigures c and d show no evidence of adapter contamination, indicating that a negligible proportion of DNA fragments are shorter than 150 bp, which corresponds to the read length reported in the study. This suggests that the sequencing data are of high quality and that only minimal preprocessing is required prior to assembly. It is important that Illumina data has a good quality for subsequent analysis so that sequencing errors, adapter contamination, and low-quality bases do not lead to incorrect read mapping, biased coverage, or errors in genome assembly and gene annotation.

Forward Read (R1)	Reverse Read (R2)

a) Per base sequence quality (R1)	b) Per base sequence quality (R2)

c) Adapter contamination (R1)	d) Adapter contamination (R2)

Figure 1. Initial FastQC report results for illumina reads. a-b show Per base sequence quality for read 1 and read 2, and c-d adapter contamination for R1 and R2.

The FastQC reports for the Nanopore reads indicated lower and more variable sequence quality compared to the Illumina data. As shown in Figure 2a, the per base sequence quality is generally lower, with median Phred scores predominantly in the lower range (<20) and increased variability across read positions, which caused a fail for "Per base sequence quality". This is expected for Nanopore sequencing, which is known to have higher error rates than short read technologies. The FastQC reports for the Nanopore reads showed multiple “FAIL” flags across several modules, including per base quality, GC content, and sequence duplication. However, these deviations are expected for long-read Nanopore sequencing data, which has a higher error rates and variable read lengths compared to Illumina data. Since FastQC is optimized for short-read technologies, these warnings do not necessarily indicate poor data quality. The reads are still fit for genome assembly, as long-read assemblers such as Flye are designed to handle these characteristics.

The sequence length distribution (Figure 2b) shows a wide range of read lengths between 40-142299 bp, with a large proportion of shorter reads and a long tail extending to reads above 100 kb. This broad distribution is characteristic of long read sequencing and is advantageous for genome assembly, as longer reads can span repetitive regions and improve contiguity.

Despite the lower base level accuracy, the Nanopore data is ready for assembly due to the sequence length and coverage. No additional trimming was performed, as the dataset was already provided in a clean format and long read assembly tools are designed to handle the inherent error profile of Nanopore sequencing.



a) Per base sequence quality	b) Sequence length distribution

Figure 2. FastQC results for nanopore reads. Figure a shows the per base sequence quality and b shows the sequence length distribution.

2 Trimming

Preprocessing of Illumina reads for chromosome 3 was done by trimming read ends by removing lower quality bases using Trimmomatic (v0.39). Because of the high quality of the reads found in the FastQC reports, trimming was done using the conservative parameters. A sliding window approach of 4:15 was used, meaning that reads were trimmed when the average Phred quality score within a window of 4 bases dropped below 15. Leading and trailing bases with quality scores below 3 were removed to discard only very low-quality bases at the ends of reads. Additionally, a minimum read length threshold of 36 bp was used to eliminate very short reads. These relatively tolerant thresholds were chosen to preserve as much data as possible while still removing potentially error-prone bases, balancing accuracy with data size.

An initial trimming run was interrupted due to loss of SSH connection, and no log file was generated. To obtain trimming statistics, the trimming step was repeated with output logging. Trimmomatic output statistics seen below in Table 1. indicate that 20,979,851 read pairs were processed, of which 98.85% survived as properly paired reads. A small proportion of reads survived as singletons (0.70% forward only and 0.42% reverse only), while only 0.03% (6766) of reads were dropped. Because this is such a small number of low-quality reads, it is unlikely to affect downstream analyses. Dropping these reads improves the overall data quality by eliminating sequences that could otherwise introduce errors during read mapping, genome assembly and annotation.

Table 1. Trimmomatic output statistics.

Category	Read Count / Pairs	Percentage	Description
Both Surviving	20,737,992	98.85%	Both R1 and R2 passed quality/length filters.
Forward Only	147,044	0.70%	Only the R1 read passed; R2 was dropped.
Reverse Only	88,049	0.42%	Only the R2 read passed; R1 was dropped.
Dropped	6,766	0.03%	Both reads in the pair failed quality/length filters.

These results confirm that the input data were of high quality and required only minimal trimming, consistent with the initial FastQC analysis.

3 Post-trimming FastQC Quality Control

A second quality assessment was performed using FastQC on the trimmed paired end reads. This step was conducted to verify that sequence quality remained high after trimming and that the reads were ready for genome assembly.

The post-trimming FastQC analysis, seen in Figure 3 a-d confirms that the sequencing data remained of high quality after preprocessing. Both forward and reverse reads show consistently high per-base quality scores across the read length, with only a slight decline toward the ends, which is expected for Illumina data. No adapter contamination was detected, indicating that trimming was appropriately conservative and did not introduce artifacts. Overall, the results validate that the dataset was already of high quality and required only minimal trimming, and the processed reads are well suited for downstream genome assembly.

Forward Read (R1)	Reverse Read (R2)

a) Per base sequence quality (R1)	b) Per base sequence quality (R2)

c) Adapter contamination (R1)	d) Adapter contamination (R2)

Figure 3. FastQC report results after trimming. a-b show Per base sequence quality for read 1 and read 2, and c-d adapter contamination for R1 and R2.