Project Plan - Linafina100/GenomeAnalysis GitHub Wiki

This is the project plan for the analysis of paper II.

1. Project Aim and Research Questions:

The aim of this project is to produce a chromosome-level genome assembly of the moss Niphotrichum japonicum to study genetic basis of stress resilience, primarily heat resistance. This can provide valuable insights into the evolution of resilient plants and the mechanisms behind stress tolerance.

Research Questions:

Primary Question: Which genes and gene expression patterns allow N. japonicum to survive environmental heat stress?

The following extra analyses are integrated into the workflow:

Scaffold Assembly with Hi-C Data: Hi-C Illumina data from the whole-genome dataset will be used together with the tool Yahs to identify contigs in close physical proximity. This enables the organization of assembled contigs into chromosome-level pseudomolecules.
Chloroplast Genome Assembly and Annotation: The chloroplast genome will be assembled using the tool GetOrganelle. This analysis requires the use of whole-genome sequencing reads, as organellar DNA is not guaranteed to be present in the chromosome 3 subset. The assembled chloroplast genome will then be annotated to study a potential role in heat stress response.

2. Project Overview

Data Type	Biological Source	Purpose	Sequencing Technology	Read Type & Notes
DNA (WGS)	N. japonicum moss	Chromosome 3 assembly	Nanopore	Long reads, higher error rate, useful for contig assembly
DNA (WGS)	N. japonicum moss	Assembly polishing	Illumina	Short paired-end reads, high accuracy
RNA-seq	N. japonicum moss	Heat stress (gene expression analysis on chr3)	Illumina	Short reads mapped to chromosome 3 assembly
Hi-C	N. japonicum moss	Chromatin interaction, scaffolding (extra analysis)	Illumina	Paired-end reads, captures 3D genome structure
DNA (WGS)	N. japonicum moss	Chloroplast genome assembly (extra analysis)	Illumina	Whole-genome reads required to capture organellar DNA

The primary analysis in this project is restricted to chromosome 3, and therefore only chromosome-specific datasets are used for assembly, polishing, and expression analysis. Whole-genome datasets are only used for additional chloroplast genome analyses where broader genomic context is required.

3. Project Workflow Overview

The table below outlines the core pipeline and specific tools used for the analysis. The first step will be preprocessing where Illumina sequences will first go through a quality control with FastQC and then further preprocessed with Trimmomatic which trims unwanted the sequence to improve quality. A second quality check with FastQC will be performed after trimming to verify the sequences are ready for assembly. The initial assembly of chromosome 3 will be produced using Nanopore reads from the chromosome-specific dataset, while whole-genome datasets will be used for complementary analyses such as chloroplast genome assembly and Hi-C–based scaffolding. The initial assembly will be produced using the Nanopore reads with either Flye or Canu. Flye is faster but will require more computational power compared to canu. The sequence will then be polished using Illumina short reads with Pilon software. The next step is to assess the quality of the assembly with BUSCO and QUAST before continuing with annotation. As a pre-annotation step, a Repeat Masking step is required to identify and mask repetitive elements using RepeatMasker. This prevents repetitive regions from being incorrectly identified as genes. Both BRAKER3 and eggNOG-mapper will be used for structural and functional annotation on the masked assembly. Finally, differential expression analysis will be performed by mapping RNA-seq reads to the newly polished assembly using STAR because this is a eukaryotic species with exon-intron structure. The reads will then be quantified with featureCounts and changes in the gene expression can be identified with DESeq2. This allows for analysis across various heat stress conditions.

Step	Task	Software / Tools
1. Pre-processing	Quality control and trimming of adapters/low-quality bases.	FastQC & Trimmomatic
2. Assembly	De novo genome assembly using long Nanopore reads.	Flye or Canu
3. Polishing	Improving draft assembly accuracy using Illumina short reads.	Pilon
4. Assessment	Evaluating assembly completeness and quality metrics.	BUSCO, QUAST
5. Masking & Annotation	Masking repeats before doing structural gene prediction and functional biological assignment.	RepeatMasker, BRAKER3 & eggNOG-mapper
6. Expression	Differential expression analysis across heat stress conditions.	STAR, featureCounts, & DESeq2

4. Additional Analyses

The following extra analyses are integrated into the workflow:

Scaffold Assembly with Hi-C Data: Hi-C Illumina data and Yahs tools will be used to identify contigs in close physical proximity. This allows the placement of contigs into pseudomolecules representative of the actual chromosomes.
Assembly and Annotatation of the Chloroplast Genome: Assembly and annotation of the chloroplast genome will be performed using GetOrganelle and the whole genome raw data. This will allow me to identify if specific chloroplast genes are differentially expressed during heat stress.

5. Data Management and Organization

Storage: I will monitor the 32 Gb UPPMAX home directory limit.
Large Files: I will use symbolic links (ln -s) to access raw data instead of copying files. Large data files will be compressed.
Structure: My working directory will be organized into separate analyses/, code/, and data/ folders with numerical prefixes. Smaller data files, such as final results, figures and text will be included in my repository while larger files will be ignored.
Metadata: I will maintain a structured csv table to track sample variables and SRA identifiers.

6. Timeframe and Bottlenecks

I have accounted for the following long-running tasks to meet the May 22nd deadline:

Analysis	Software	Estimated Running Time
Long Read Assembly	Canu	~17 hours (4 cores)
Assembly Polishing	Pilon	~12 hours (2 cores)
Annotation	BRAKER3	~3–4 hours (16 cores)
RNA Mapping	STAR	~12 hours for 6 samples

Key Checkpoints:

April 15: Genome assembly completed.
April 25: Structural and functional annotation completed.
May 15: Differential Expression analysis finalized.
May 21: Wiki and extra analyses fully documented.