Project Plan - linnasp/GenomeAnalysis_VT2026_Lab GitHub Wiki
Aim
The aim of the project is to reconstruct part of the genome (from chromosome 3) of the moss Niphotrichum japonicum, as well as study genes expressed during heat stress. This is done by following a similair pipeline as the article written by Zhou et al. (2023).
The project will include an assembly, preprocessing, annotation, evaluation, and differential gene expression using RNA-Seq data.
Data
The data that will be used during the project focuses on chromosome 3 of N. japonicum, and has been given to us.
The data set includes:
- Genomic data from chromosome 3
- Transcriptomic data
- Whole genome
The genomic data consists of sequencing reads generated using different sequencing technologies.
- Long-reads (Nanopore):
- Clean and require no preprocessing
- FASTQ format
- Short-reads (Illumina):
- Require preprocessing
- FASTQ format
- Paired-end sequencing data where R1 and R2 correspond to forward and reverse reads
- Hi-C data (for chromosome construction):
- FASTQ format
- Paired-end sequencing data where R1 and R2 correspond to forward and reverse reads
The differential expression analysis will use transcriptomic data which has been obtained from RNA sequencing at different times, during varying levels of heat stress.
Pipeline
Preprocessing
The first step is preprocessing of the reads, including trimming and quality control. Quality control will be done before and after trimming, using the software FastQC. Trimming will be done using Trimmomatic.
FastQC
- Quality control
- Input: Raw Illumina sequencing reads (FASTQ files)
- Output: Quality report
- Running time: ~ 10 min
Trimmomatic
- Trims sequences e.g. by removing adaptors and low quality bases
- Input: Raw Illumina sequencing reads (FASTQ files)
- Output: Four FASTQ files
- Running time: ~ 1hr per file
DNA assembly
The DNA assembly will then be preformed using Nanopore long-reads and then polished with Illumina short-reads. The long-reads are assembled using the software Flye, and is then polished with the help of Pilon. In order to use the short-reads for polishing, they are first mapped to the assembled long-reads using BWA.
Flye
- Assembler
- Input: Nanopore long-reads (FASTQ files)
- Output: FASTA file containing the assembled contigs + Final repeat graph + Extra information (.txt file)
- Running time: ~ 48 hr
BWA
- Short-read mapping
- Input: Illumina short-reads from Trimmomatic + FASTA file of genome assembly from Flye
- Output: Sequence Alignment Map (.sam file)
SAMtools
- Convert SAM-file to BAM-file
- Input: .sam file from BWA
- Output: .bam file
Pilon
- Genome polishing and error correction
- Input: FASTA file of genome assembly from Flye + BAM-file from SAMtools
- Output: FASTA file containing the improved representation of the genome (pilon.fasta)
- Running time: ~ 24 h
Assembly evaluation
The assembly is then evaluated using Quast and BUSCO.
Quast
- Evaluate genome assembly using assembly statistics
- Input: Genom assemly (Flye output [FASTA file]) + Polished genome assembly (Pilon output [FASTA file])
- Output: HTML version of the report ( + other files)
- Running time: < 15 min
BUSCO
- Evaluate genome assembly
- Input: Polished genome assembly (Pilon output [FASTA file]) + Full genome of N. japonicum (FASTQ file)
- Output: "Quantitative assessment of the completeness in terms of expected gene content of the genome assembly" (BUSCO, n.d)
- Running time: ~ 30 min
Annotation
Using the assembled genome, the different genetic elements are annoted using BRAKER3. BRAKER3 requires softmasking first. Afterwards, eggNOGmapper is used for functional annotation.
RepeatMasker
- Softmasking for annotation
- Input: FASTA file containing the improved representation of the genome (pilon.fasta)
- Output: Softmasked FASTA file
- Running time: ~ 8 h
BRAKER3
- Gene prediction and structural annotation
- Input: Softmasked FASTA file (output of RepeatMasker) + RNA-information (output from STAR(Hisat2)
- Output: Map of annotation files
- Running time: ~ 32 h
eggNOGmapper
- Functional annotation of predicted genes
- Input: Predicted protein sequences (BRAKER3 output)
- Output: Annotation files
- Running time: ~ 18 h
Differential Expression Analysis
The RNA-seq analysis will be conducted first by mapping the sequences using STAR, followed by counting reads mapping to genomic features using featureCounts. Then, Deseq2 will be used to conduct the differential expression analysis. Instea dof STAR, Hisat2 cna be used.
STAR
- Mapping
- Input: Trimmed RNA-seq reads (FASTQ files) + Polished genome assembly (Pilon output [FASTA file])
- Output: BAM file (mapping and aligment information)
- Running time: ~ 24 h
Hisat2
- Mapping
- Input: Trimmed RNA-seq reads (FASTQ files) + Polished genome assembly (Pilon output [FASTA file])
- Output: BAM file (mapping and aligment information)
- Running time: ~ 12 h
featureCounts
- Quantification of gene expression
- Input: BAM file (STAR/Hisat2 output) + Gene annotation file (output from BRAKER3 or eggNOGmapper)
- Output: Count table conting the number of reads mapped
- Running time: ~ 12 h
Deseq2
- Differential expression analysis (R package)
- Input: Counts table (output from featureCounts)
- Output: Information for biologial analysis
- Running time: ~ few minutes
Pipeline (graphic representation)
Below is a graphic representation of the pipeline above.
Extra analysis
If there is time, an assembly and annotation of the chloroplast genome will be conducted using software GetOrganelle.
GetOrganelle
- Input: Illumina pair-end sequences (FASTQ files)
- Output: Chloroplasts assembly (FASTA files) + Organelle related assembly graph
Time management plan
The estimated times are based on the times from the individual softwares with some added buffer time.
| Date | Analysis | Description | Software | Time | Deadline |
|---|---|---|---|---|---|
| 2026-04-15 | Preprocessing | Quality control | FastQC | 20 min | |
| 2026-04-15 | Read trimming | Trimmomatic | 2 h | ||
| 2026-04-15 | Quality control | FastQC | 20 min | ||
| 2026-04-15 | DNA assembly | Assemble long-reads | Flye | 54 h | 2026-04-16 |
| 2026-04-18 | Map short-reads | BWA | 2 h | 2026-04-16 | |
| 2026-04-18 | Convert SAM to BAM | SAMtools | 1 h | 2026-04-16 | |
| 2026-04-18 | Genome polishing | Pilon | 30 h | 2026-04-16 | |
| 2026-04-21 | Assembly evaluation | Assembly quality assessment | Quast | 20 min | 2026-04-21 |
| 2026-04-21 | Evaluation | BUSCO | 1 h | 2026-04-21 | |
| 2026-04-21 | Annotation | TE family identification | RepeatModeler | 10 h | 2026-04-21 |
| 2026-04-21 | Repeat masking | RepeatMasker | 10 h | 2026-04-28 | |
| 2026-04-24 | RNA preprocessing | Quality control | FastQC | 20 min | |
| 2026-04-24 | Read trimming | Trimmomatic | 2 h | ||
| 2026-04-24 | Quality control | FastQC | 20 min | ||
| 2026-04-26 | RNA-seq mapping | Read alignment | STAR/HISAT2 | 1 h | 2026-05-05 |
| 2026-04-28 | Annotation | Gene prediction | BRAKER3 | 48 h | 2026-05-05 |
| 2026-04-30 | Functional annotation | eggNOGmapper | 24 h | 2026-05-05 | |
| 2026-05-01 | Differential Expression | Read counting | featureCounts | 12 h | 2026-05-05 |
| 2026-05-02 | Differential expression analysis | DESeq2 | 1 h | 2026-05-05 | |
| 2026-05-05 | Extra analysis | Chloroplast analysis | GetOrganelle | 2026-05-08 |
Final presentations will be held 2025-05-22. Therefore, the weeks between 2026-05-05 and 2025-05-22 will be used to create and practise for this and eventually doing extra analysis.
DNA assembly and Annotation require more time than other steps and it is therefore crucial that I can start those in time.
Project Organization
Data will follow the ISO 8601 standard format, and dates will be written as YYYY-MM-DD. The data can be found after logging in to UPPMAX at the following link: /crex/proj/uppmax2026-1-61/Genome_Analysis/2_Zhou_2023/reads/. Large files will be stored on UPPMAX. GitHub will be used for documentation and verison control.
The project will follow the following structure.
├── data/
│ ├── meta_data/
│ ├── raw_data/
├── results/
│ ├── 01_preprocessing/
│ │ ├── 01_FastQC/
│ │ ├── 02_Trimmomatic/
│ │ ├── 03_FastQC_post_trim/
│ ├── 02_assembly/
│ │ ├── 01_Flye/
│ │ ├── 02_BWA/
│ │ ├── 03_SAMtools/
│ │ ├── 04_Pilon/
│ ├── 03_evaluation/
│ │ ├── 01_Quast/
│ │ ├── 02_BUSCO/
│ ├── 04_masking/
│ │ ├── 01_RepeatMasker/
│ ├── 05_annotation/
│ │ ├── 01_BRAKER3/
│ │ ├── 02_eggNOGmapper/
│ ├── 06_DEA/
│ │ ├── 01_STAR/
│ │ ├── 01_Hisat2/
│ │ ├── 02_featureCounts/
│ │ ├── 03_DESeq2/
├── code/
│ ├── 01_preprocessing/
│ │ ├── 01_FastQC/
│ │ ├── 02_Trimmomatic/
│ │ ├── 03_FastQC_post_trim/
│ ├── 02_assembly/
│ │ ├── 01_Flye/
│ │ ├── 02_BWA/
│ │ ├── 03_SAMtools/
│ │ ├── 04_Pilon/
│ ├── 03_evaluation/
│ │ ├── 01_Quast/
│ │ ├── 02_BUSCO/
│ ├── 04_masking/
│ │ ├── 01_RepeatMasker/
│ ├── 05_annotation/
│ │ ├── 01_BRAKER3/
│ │ ├── 02_eggNOGmapper/
│ ├── 06_DEA/
│ │ ├── 01_STAR/
│ │ ├── 02_featureCounts/
│ │ ├── 03_DESeq2/
└── logs/
README.md file will be describing pipeline. A README-file will also be present in each subfolder as well.
AI statement
Generative AI was used in order to create the structure of sorting the data as well as for understanding the different inputs and outputs from softwares used in Differential Expression Analysis.
Bibliography
BUSCO. busco.ezlab.org. (n.d.). User guide BUSCO v5.7.1. [online] Available at: https://busco.ezlab.org/busco_userguide.html#interpreting-the-results.
Zhou X, Peng T, Zeng Y, Cai Y, Zuo Q, Zhang L, Dong S and Liu Y (2023) Chromosome-level genome assembly of Niphotrichum japonicum provides new insights into heat stress responses in mosses. Front. Plant Sci. 14:1271357. doi: 10.3389/fpls.2023.1271357