Project Plan - Linafina100/GenomeAnalysis GitHub Wiki
This is the project plan for the analysis of paper II.
1. Project Aim and Research Questions:
The aim of this project is to produce a chromosome-level genome assembly of the moss Niphotrichum japonicum to study genetic basis of stress resilience, primarily heat resistance. This can provide valuable insights into the evolution of resilient plants and the mechanisms behind stress tolerance.
Research Questions:
Primary Question: Which genes and gene expression patterns allow N. japonicum to survive environmental heat stress?
The following extra analyses are integrated into the workflow:
- Scaffold Assembly with Hi-C Data: Hi-C Illumina data from the whole-genome dataset will be used together with the tool Yahs to identify contigs in close physical proximity. This enables the organization of assembled contigs into chromosome-level pseudomolecules.
- Chloroplast Genome Assembly and Annotation: The chloroplast genome will be assembled using the tool GetOrganelle. This analysis requires the use of whole-genome sequencing reads, as organellar DNA is not guaranteed to be present in the chromosome 3 subset. The assembled chloroplast genome will then be annotated to study a potential role in heat stress response.
2. Project Overview
| Data Type | Biological Source | Purpose | Sequencing Technology | Read Type & Notes |
|---|---|---|---|---|
| DNA (WGS) | N. japonicum moss | Chromosome 3 assembly | Nanopore | Long reads, higher error rate, useful for contig assembly |
| DNA (WGS) | N. japonicum moss | Assembly polishing | Illumina | Short paired-end reads, high accuracy |
| RNA-seq | N. japonicum moss | Heat stress (gene expression analysis on chr3) | Illumina | Short reads mapped to chromosome 3 assembly |
| Hi-C | N. japonicum moss | Chromatin interaction, scaffolding (extra analysis) | Illumina | Paired-end reads, captures 3D genome structure |
| DNA (WGS) | N. japonicum moss | Chloroplast genome assembly (extra analysis) | Illumina | Whole-genome reads required to capture organellar DNA |
The primary analysis in this project is restricted to chromosome 3, and therefore only chromosome-specific datasets are used for assembly, polishing, and expression analysis. Whole-genome datasets are only used for additional chloroplast genome analyses where broader genomic context is required.
3. Project Workflow Overview
The table below outlines the core pipeline and specific tools used for the analysis. The first step will be preprocessing where Illumina sequences will first go through a quality control with FastQC and then further preprocessed with Trimmomatic which trims unwanted the sequence to improve quality. A second quality check with FastQC will be performed after trimming to verify the sequences are ready for assembly. The initial assembly of chromosome 3 will be produced using Nanopore reads from the chromosome-specific dataset, while whole-genome datasets will be used for complementary analyses such as chloroplast genome assembly and Hi-C–based scaffolding. The initial assembly will be produced using the Nanopore reads with either Flye or Canu. Flye is faster but will require more computational power compared to canu. The sequence will then be polished using Illumina short reads with Pilon software. The next step is to assess the quality of the assembly with BUSCO and QUAST before continuing with annotation. As a pre-annotation step, a Repeat Masking step is required to identify and mask repetitive elements using RepeatModeler and RepeatMasker. This prevents repetitive regions from being incorrectly identified as genes. Both BRAKER3 and eggNOG-mapper will be used for structural and functional annotation on the masked assembly. Finally, differential expression analysis will be performed by mapping RNA-seq reads to the newly polished assembly using STAR because this is a eukaryotic species with exon-intron structure. The reads will then be quantified with featureCounts and changes in the gene expression can be identified with DESeq2. This allows for analysis across various heat stress conditions.
| Step | Task | Software / Tools |
|---|---|---|
| 1. Pre-processing | Quality control and trimming of adapters/low-quality bases. | FastQC & Trimmomatic |
| 2. Assembly | De novo genome assembly using long Nanopore reads. | Flye or Canu |
| 3. Polishing | Improving draft assembly accuracy using Illumina short reads. | Pilon |
| 4. Assessment | Evaluating assembly completeness and quality metrics. | BUSCO, QUAST |
| 5. Masking & Annotation | Masking repeats before doing structural gene prediction and functional biological assignment. | RepeatMasker, BRAKER3 & eggNOG-mapper |
| 6. Expression | Differential expression analysis across heat stress conditions. | STAR, featureCounts, & DESeq2 |
4. Additional Analyses
The following extra analyses are integrated into the workflow:
- Scaffold Assembly with Hi-C Data: Hi-C Illumina data and Yahs tools will be used to identify contigs in close physical proximity. This allows the placement of contigs into pseudomolecules representative of the actual chromosomes.
- Assembly and Annotatation of the Chloroplast Genome: Assembly and annotation of the chloroplast genome will be performed using GetOrganelle and the whole genome raw data. This will allow me to identify if specific chloroplast genes are differentially expressed during heat stress.
5. Data Management and Organization
- Storage: I will monitor the 32 Gb UPPMAX home directory limit.
- Large Files: I will use symbolic links (
ln -s) to access raw data instead of copying files. Large data files will be compressed. - Structure: My working directory will be organized into separate
analyses/,code/, anddata/folders with numerical prefixes. Smaller data files, such as final results, figures and text will be included in my repository while larger files will be ignored. - Metadata: I will maintain a structured csv table to track sample variables and SRA identifiers.
6. Timeframe and Bottlenecks
I have accounted for the following long-running tasks to meet the May 22nd deadline:
| Analysis | Software | Estimated Running Time |
|---|---|---|
| Long Read Assembly | Canu | ~17 hours (4 cores) |
| Assembly Polishing | Pilon | ~12 hours (2 cores) |
| Annotation | BRAKER3 | ~3–4 hours (16 cores) |
| RNA Mapping | STAR | ~12 hours for 6 samples |
Key Checkpoints:
- April 15: Genome assembly completed.
- April 25: Structural and functional annotation completed.
- May 15: Differential Expression analysis finalized.
- May 21: Wiki and extra analyses fully documented.