Project Plan - linnasp/GenomeAnalysis_VT2026_Lab GitHub Wiki

Aim

The aim of the project is to reconstruct part of the genome (from chromosome 3) of the moss Niphotrichum japonicum, as well as study genes expressed during heat stress. This is done by following a similair pipeline as the article written by Zhou et al. (2023).

The project will include an assembly, preprocessing, annotation, evaluation, and differential gene expression using RNA-Seq data.

Data

The data that will be used during the project focuses on chromosome 3 of N. japonicum, and has been given to us.

The data set includes:

Genomic data from chromosome 3
Transcriptomic data
Whole genome

The genomic data consists of sequencing reads generated using different sequencing technologies.

Long-reads (Nanopore):
- Clean and require no preprocessing
- FASTQ format
Short-reads (Illumina):
- Require preprocessing
- FASTQ format
- Paired-end sequencing data where R1 and R2 correspond to forward and reverse reads
Hi-C data (for chromosome construction):
- FASTQ format
- Paired-end sequencing data where R1 and R2 correspond to forward and reverse reads

The differential expression analysis will use transcriptomic data which has been obtained from RNA sequencing at different times, during varying levels of heat stress.

Pipeline

Preprocessing

The first step is preprocessing of the reads, including trimming and quality control. Quality control will be done before and after trimming, using the software FastQC. Trimming will be done using Trimmomatic.

FastQC

Quality control
Input: Raw Illumina sequencing reads (FASTQ files)
Output: Quality report
Running time: ~ 10 min

Trimmomatic

Trims sequences e.g. by removing adaptors and low quality bases
Input: Raw Illumina sequencing reads (FASTQ files)
Output: Four FASTQ files
Running time: ~ 1hr per file

DNA assembly

The DNA assembly will then be preformed using Nanopore long-reads and then polished with Illumina short-reads. The long-reads are assembled using the software Flye, and is then polished with the help of Pilon. In order to use the short-reads for polishing, they are first mapped to the assembled long-reads using BWA.

Flye

Assembler
Input: Nanopore long-reads (FASTQ files)
Output: FASTA file containing the assembled contigs + Final repeat graph + Extra information (.txt file)
Running time: ~ 48 hr

BWA

Short-read mapping
Input: Illumina short-reads from Trimmomatic + FASTA file of genome assembly from Flye
Output: Sequence Alignment Map (.sam file)

SAMtools

Convert SAM-file to BAM-file
Input: .sam file from BWA
Output: .bam file

Pilon

Genome polishing and error correction
Input: FASTA file of genome assembly from Flye + BAM-file from SAMtools
Output: FASTA file containing the improved representation of the genome (pilon.fasta)
Running time: ~ 24 h

Assembly evaluation

The assembly is then evaluated using Quast and BUSCO.

Quast

Evaluate genome assembly using assembly statistics
Input: Genom assemly (Flye output [FASTA file]) + Polished genome assembly (Pilon output [FASTA file])
Output: HTML version of the report ( + other files)
Running time: < 15 min

BUSCO

Evaluate genome assembly
Input: Polished genome assembly (Pilon output [FASTA file]) + Full genome of N. japonicum (FASTQ file)
Output: "Quantitative assessment of the completeness in terms of expected gene content of the genome assembly" (BUSCO, n.d)
Running time: ~ 30 min

Annotation

Using the assembled genome, the different genetic elements are annoted using BRAKER3. BRAKER3 requires softmasking first. Afterwards, eggNOGmapper is used for functional annotation.

RepeatMasker

Softmasking for annotation
Input: FASTA file containing the improved representation of the genome (pilon.fasta)
Output: Softmasked FASTA file
Running time: ~ 8 h

BRAKER3

Gene prediction and structural annotation
Input: Softmasked FASTA file (output of RepeatMasker) + RNA-information (output from STAR(Hisat2)
Output: Map of annotation files
Running time: ~ 32 h

eggNOGmapper

Functional annotation of predicted genes
Input: Predicted protein sequences (BRAKER3 output)
Output: Annotation files
Running time: ~ 18 h

Differential Expression Analysis

The RNA-seq analysis will be conducted first by mapping the sequences using STAR, followed by counting reads mapping to genomic features using featureCounts. Then, Deseq2 will be used to conduct the differential expression analysis. Instea dof STAR, Hisat2 cna be used.

STAR

Mapping
Input: Trimmed RNA-seq reads (FASTQ files) + Polished genome assembly (Pilon output [FASTA file])
Output: BAM file (mapping and aligment information)
Running time: ~ 24 h

Hisat2

Mapping
Input: Trimmed RNA-seq reads (FASTQ files) + Polished genome assembly (Pilon output [FASTA file])
Output: BAM file (mapping and aligment information)
Running time: ~ 12 h

featureCounts

Quantification of gene expression
Input: BAM file (STAR/Hisat2 output) + Gene annotation file (output from BRAKER3 or eggNOGmapper)
Output: Count table conting the number of reads mapped
Running time: ~ 12 h

Deseq2

Differential expression analysis (R package)
Input: Counts table (output from featureCounts)
Output: Information for biologial analysis
Running time: ~ few minutes

Pipeline (graphic representation)

Below is a graphic representation of the pipeline above.

Extra analysis

If there is time, an assembly and annotation of the chloroplast genome will be conducted using software GetOrganelle.

GetOrganelle

Input: Illumina pair-end sequences (FASTQ files)
Output: Chloroplasts assembly (FASTA files) + Organelle related assembly graph

Time management plan

The estimated times are based on the times from the individual softwares with some added buffer time.

Date	Analysis	Description	Software	Time	Deadline
2026-04-15	Preprocessing	Quality control	FastQC	20 min
2026-04-15		Read trimming	Trimmomatic	2 h
2026-04-15		Quality control	FastQC	20 min
2026-04-15	DNA assembly	Assemble long-reads	Flye	54 h	2026-04-16
2026-04-18		Map short-reads	BWA	2 h	2026-04-16
2026-04-18		Convert SAM to BAM	SAMtools	1 h	2026-04-16
2026-04-18		Genome polishing	Pilon	30 h	2026-04-16
2026-04-21	Assembly evaluation	Assembly quality assessment	Quast	20 min	2026-04-21
2026-04-21		Evaluation	BUSCO	1 h	2026-04-21
2026-04-21	Annotation	TE family identification	RepeatModeler	10 h	2026-04-21
2026-04-21		Repeat masking	RepeatMasker	10 h	2026-04-28
2026-04-24	RNA preprocessing	Quality control	FastQC	20 min
2026-04-24		Read trimming	Trimmomatic	2 h
2026-04-24		Quality control	FastQC	20 min
2026-04-26	RNA-seq mapping	Read alignment	STAR/HISAT2	1 h	2026-05-05
2026-04-28	Annotation	Gene prediction	BRAKER3	48 h	2026-05-05
2026-04-30		Functional annotation	eggNOGmapper	24 h	2026-05-05
2026-05-01	Differential Expression	Read counting	featureCounts	12 h	2026-05-05
2026-05-02		Differential expression analysis	DESeq2	1 h	2026-05-05
2026-05-05	Extra analysis	Chloroplast analysis	GetOrganelle		2026-05-08

Final presentations will be held 2025-05-22. Therefore, the weeks between 2026-05-05 and 2025-05-22 will be used to create and practise for this and eventually doing extra analysis.

DNA assembly and Annotation require more time than other steps and it is therefore crucial that I can start those in time.

Project Organization

Data will follow the ISO 8601 standard format, and dates will be written as YYYY-MM-DD. The data can be found after logging in to UPPMAX at the following link: /crex/proj/uppmax2026-1-61/Genome_Analysis/2_Zhou_2023/reads/. Large files will be stored on UPPMAX. GitHub will be used for documentation and verison control.

The project will follow the following structure.

├── data/
│   ├── meta_data/
│   ├── raw_data/
├── results/
│   ├── 01_preprocessing/
│   │   ├── 01_FastQC/
│   │   ├── 02_Trimmomatic/
│   │   ├── 03_FastQC_post_trim/
│   ├── 02_assembly/
│   │   ├── 01_Flye/
│   │   ├── 02_BWA/
│   │   ├── 03_SAMtools/
│   │   ├── 04_Pilon/
│   ├── 03_evaluation/
│   │   ├── 01_Quast/
│   │   ├── 02_BUSCO/
│   ├── 04_masking/
│   │   ├── 01_RepeatMasker/
│   ├── 05_annotation/
│   │   ├── 01_BRAKER3/
│   │   ├── 02_eggNOGmapper/
│   ├── 06_DEA/
│   │   ├── 01_STAR/
│   │   ├── 01_Hisat2/
│   │   ├── 02_featureCounts/
│   │   ├── 03_DESeq2/
├── code/
│   ├── 01_preprocessing/
│   │   ├── 01_FastQC/
│   │   ├── 02_Trimmomatic/
│   │   ├── 03_FastQC_post_trim/
│   ├── 02_assembly/
│   │   ├── 01_Flye/
│   │   ├── 02_BWA/
│   │   ├── 03_SAMtools/
│   │   ├── 04_Pilon/
│   ├── 03_evaluation/
│   │   ├── 01_Quast/
│   │   ├── 02_BUSCO/
│   ├── 04_masking/
│   │   ├── 01_RepeatMasker/
│   ├── 05_annotation/
│   │   ├── 01_BRAKER3/
│   │   ├── 02_eggNOGmapper/
│   ├── 06_DEA/
│   │   ├── 01_STAR/
│   │   ├── 02_featureCounts/
│   │   ├── 03_DESeq2/
└── logs/

README.md file will be describing pipeline. A README-file will also be present in each subfolder as well.

AI statement

Generative AI was used in order to create the structure of sorting the data as well as for understanding the different inputs and outputs from softwares used in Differential Expression Analysis.

Bibliography

BUSCO. busco.ezlab.org. (n.d.). User guide BUSCO v5.7.1. [online] Available at: https://busco.ezlab.org/busco_userguide.html#interpreting-the-results.

Zhou X, Peng T, Zeng Y, Cai Y, Zuo Q, Zhang L, Dong S and Liu Y (2023) Chromosome-level genome assembly of Niphotrichum japonicum provides new insights into heat stress responses in mosses. Front. Plant Sci. 14:1271357. doi: 10.3389/fpls.2023.1271357