Project Plan - linnasp/GenomeAnalysis_VT2026_Lab GitHub Wiki

Aim

The aim of the project is to reconstruct part of the genome (from chromosome 3) of the moss Niphotrichum japonicum, as well as study genes expressed during heat stress. This is done by following a similair pipeline as the article written by Zhou et al. (2023).

The project will include an assembly, preprocessing, annotation, evaluation, and differential gene expression using RNA-Seq data.

Data

The data that will be used during the project focuses on chromosome 3 of N. japonicum, and has been given to us.

The data set includes:

  • Genomic data from chromosome 3
  • Transcriptomic data
  • Whole genome

The genomic data consists of sequencing reads generated using different sequencing technologies.

  • Long-reads (Nanopore):
    • Clean and require no preprocessing
    • FASTQ format
  • Short-reads (Illumina):
    • Require preprocessing
    • FASTQ format
    • Paired-end sequencing data where R1 and R2 correspond to forward and reverse reads
  • Hi-C data (for chromosome construction):
    • FASTQ format
    • Paired-end sequencing data where R1 and R2 correspond to forward and reverse reads

The differential expression analysis will use transcriptomic data which has been obtained from RNA sequencing at different times, during varying levels of heat stress.

Pipeline

Preprocessing

The first step is preprocessing of the reads, including trimming and quality control. Quality control will be done before and after trimming, using the software FastQC. Trimming will be done using Trimmomatic.

FastQC

  • Quality control
  • Input: Raw Illumina sequencing reads (FASTQ files)
  • Output: Quality report
  • Running time: ~ 10 min

Trimmomatic

  • Trims sequences e.g. by removing adaptors and low quality bases
  • Input: Raw Illumina sequencing reads (FASTQ files)
  • Output: Four FASTQ files
  • Running time: ~ 1hr per file

DNA assembly

The DNA assembly will then be preformed using Nanopore long-reads and then polished with Illumina short-reads. The long-reads are assembled using the software Flye, and is then polished with the help of Pilon. In order to use the short-reads for polishing, they are first mapped to the assembled long-reads using BWA.

Flye

  • Assembler
  • Input: Nanopore long-reads (FASTQ files)
  • Output: FASTA file containing the assembled contigs + Final repeat graph + Extra information (.txt file)
  • Running time: ~ 48 hr

BWA

  • Short-read mapping
  • Input: Illumina short-reads from Trimmomatic + FASTA file of genome assembly from Flye
  • Output: Sequence Alignment Map (.sam file)

SAMtools

  • Convert SAM-file to BAM-file
  • Input: .sam file from BWA
  • Output: .bam file

Pilon

  • Genome polishing and error correction
  • Input: FASTA file of genome assembly from Flye + BAM-file from SAMtools
  • Output: FASTA file containing the improved representation of the genome (pilon.fasta)
  • Running time: ~ 24 h

Assembly evaluation

The assembly is then evaluated using Quast and BUSCO.

Quast

  • Evaluate genome assembly using assembly statistics
  • Input: Genom assemly (Flye output [FASTA file]) + Polished genome assembly (Pilon output [FASTA file])
  • Output: HTML version of the report ( + other files)
  • Running time: < 15 min

BUSCO

  • Evaluate genome assembly
  • Input: Polished genome assembly (Pilon output [FASTA file]) + Full genome of N. japonicum (FASTQ file)
  • Output: "Quantitative assessment of the completeness in terms of expected gene content of the genome assembly" (BUSCO, n.d)
  • Running time: ~ 30 min

Annotation

Using the assembled genome, the different genetic elements are annoted using BRAKER3. BRAKER3 requires softmasking first. Afterwards, eggNOGmapper is used for functional annotation.

RepeatMasker

  • Softmasking for annotation
  • Input: FASTA file containing the improved representation of the genome (pilon.fasta)
  • Output: Softmasked FASTA file
  • Running time: ~ 8 h

BRAKER3

  • Gene prediction and structural annotation
  • Input: Softmasked FASTA file (output of RepeatMasker) + RNA-information (output from STAR(Hisat2)
  • Output: Map of annotation files
  • Running time: ~ 32 h

eggNOGmapper

  • Functional annotation of predicted genes
  • Input: Predicted protein sequences (BRAKER3 output)
  • Output: Annotation files
  • Running time: ~ 18 h

Differential Expression Analysis

The RNA-seq analysis will be conducted first by mapping the sequences using STAR, followed by counting reads mapping to genomic features using featureCounts. Then, Deseq2 will be used to conduct the differential expression analysis. Instea dof STAR, Hisat2 cna be used.

STAR

  • Mapping
  • Input: Trimmed RNA-seq reads (FASTQ files) + Polished genome assembly (Pilon output [FASTA file])
  • Output: BAM file (mapping and aligment information)
  • Running time: ~ 24 h

Hisat2

  • Mapping
  • Input: Trimmed RNA-seq reads (FASTQ files) + Polished genome assembly (Pilon output [FASTA file])
  • Output: BAM file (mapping and aligment information)
  • Running time: ~ 12 h

featureCounts

  • Quantification of gene expression
  • Input: BAM file (STAR/Hisat2 output) + Gene annotation file (output from BRAKER3 or eggNOGmapper)
  • Output: Count table conting the number of reads mapped
  • Running time: ~ 12 h

Deseq2

  • Differential expression analysis (R package)
  • Input: Counts table (output from featureCounts)
  • Output: Information for biologial analysis
  • Running time: ~ few minutes

Pipeline (graphic representation)

Below is a graphic representation of the pipeline above.

Extra analysis

If there is time, an assembly and annotation of the chloroplast genome will be conducted using software GetOrganelle.

GetOrganelle

  • Input: Illumina pair-end sequences (FASTQ files)
  • Output: Chloroplasts assembly (FASTA files) + Organelle related assembly graph

Time management plan

The estimated times are based on the times from the individual softwares with some added buffer time.

Date Analysis Description Software Time Deadline
2026-04-15 Preprocessing Quality control FastQC 20 min
2026-04-15 Read trimming Trimmomatic 2 h
2026-04-15 Quality control FastQC 20 min
2026-04-15 DNA assembly Assemble long-reads Flye 54 h 2026-04-16
2026-04-18 Map short-reads BWA 2 h 2026-04-16
2026-04-18 Convert SAM to BAM SAMtools 1 h 2026-04-16
2026-04-18 Genome polishing Pilon 30 h 2026-04-16
2026-04-21 Assembly evaluation Assembly quality assessment Quast 20 min 2026-04-21
2026-04-21 Evaluation BUSCO 1 h 2026-04-21
2026-04-21 Annotation TE family identification RepeatModeler 10 h 2026-04-21
2026-04-21 Repeat masking RepeatMasker 10 h 2026-04-28
2026-04-24 RNA preprocessing Quality control FastQC 20 min
2026-04-24 Read trimming Trimmomatic 2 h
2026-04-24 Quality control FastQC 20 min
2026-04-26 RNA-seq mapping Read alignment STAR/HISAT2 1 h 2026-05-05
2026-04-28 Annotation Gene prediction BRAKER3 48 h 2026-05-05
2026-04-30 Functional annotation eggNOGmapper 24 h 2026-05-05
2026-05-01 Differential Expression Read counting featureCounts 12 h 2026-05-05
2026-05-02 Differential expression analysis DESeq2 1 h 2026-05-05
2026-05-05 Extra analysis Chloroplast analysis GetOrganelle 2026-05-08

Final presentations will be held 2025-05-22. Therefore, the weeks between 2026-05-05 and 2025-05-22 will be used to create and practise for this and eventually doing extra analysis.

DNA assembly and Annotation require more time than other steps and it is therefore crucial that I can start those in time.

Project Organization

Data will follow the ISO 8601 standard format, and dates will be written as YYYY-MM-DD. The data can be found after logging in to UPPMAX at the following link: /crex/proj/uppmax2026-1-61/Genome_Analysis/2_Zhou_2023/reads/. Large files will be stored on UPPMAX. GitHub will be used for documentation and verison control.

The project will follow the following structure.

├── data/
│   ├── meta_data/
│   ├── raw_data/
├── results/
│   ├── 01_preprocessing/
│   │   ├── 01_FastQC/
│   │   ├── 02_Trimmomatic/
│   │   ├── 03_FastQC_post_trim/
│   ├── 02_assembly/
│   │   ├── 01_Flye/
│   │   ├── 02_BWA/
│   │   ├── 03_SAMtools/
│   │   ├── 04_Pilon/
│   ├── 03_evaluation/
│   │   ├── 01_Quast/
│   │   ├── 02_BUSCO/
│   ├── 04_masking/
│   │   ├── 01_RepeatMasker/
│   ├── 05_annotation/
│   │   ├── 01_BRAKER3/
│   │   ├── 02_eggNOGmapper/
│   ├── 06_DEA/
│   │   ├── 01_STAR/
│   │   ├── 01_Hisat2/
│   │   ├── 02_featureCounts/
│   │   ├── 03_DESeq2/
├── code/
│   ├── 01_preprocessing/
│   │   ├── 01_FastQC/
│   │   ├── 02_Trimmomatic/
│   │   ├── 03_FastQC_post_trim/
│   ├── 02_assembly/
│   │   ├── 01_Flye/
│   │   ├── 02_BWA/
│   │   ├── 03_SAMtools/
│   │   ├── 04_Pilon/
│   ├── 03_evaluation/
│   │   ├── 01_Quast/
│   │   ├── 02_BUSCO/
│   ├── 04_masking/
│   │   ├── 01_RepeatMasker/
│   ├── 05_annotation/
│   │   ├── 01_BRAKER3/
│   │   ├── 02_eggNOGmapper/
│   ├── 06_DEA/
│   │   ├── 01_STAR/
│   │   ├── 02_featureCounts/
│   │   ├── 03_DESeq2/
└── logs/

README.md file will be describing pipeline. A README-file will also be present in each subfolder as well.

AI statement

Generative AI was used in order to create the structure of sorting the data as well as for understanding the different inputs and outputs from softwares used in Differential Expression Analysis.

Bibliography

BUSCO. busco.ezlab.org. (n.d.). User guide BUSCO v5.7.1. [online] Available at: https://busco.ezlab.org/busco_userguide.html#interpreting-the-results.

Zhou X, Peng T, Zeng Y, Cai Y, Zuo Q, Zhang L, Dong S and Liu Y (2023) Chromosome-level genome assembly of Niphotrichum japonicum provides new insights into heat stress responses in mosses. Front. Plant Sci. 14:1271357. doi: 10.3389/fpls.2023.1271357