Project plan - Mathilda-the-G/Genome-analysis-project GitHub Wiki

Aim of the Project

The aim of the project is to perform a DNA assembly, structural and functional annotation, differential expression analysis, and evaluations on chromosome 3 on the species N. japonicum which is about 16Mbp. After all previous steps are performed the questions regarding which genes and proteins are expressed on chromosom 3 of N. japonicum, as well as which proteins are under- or over expressed will be answered.

Workflow

Pre-processing

The DNA and RNA will both be pre-processed before being used in further steps by being run through FastQC, followed by Trimmomatic, and then being runt through FastQC again. This is done to asses the quality of the raw reads and then trimming them and assessing the quality after the trim.

DNA Assembly

To assemble the genome Flye will be used on the pre-processed illumina DNA reads for ~48 h. After the genome is assembled the illumina short reads must be mapped to the assembled Nanopore long reads with BWA in order to polish the genome with Pilon, the time expected for this is ~1h. The assembled genome will then be polished and improved with Pilon which is expected to run for ~24 h. After the polishing the assembly will be masked with ReapeatMasking taking ~8 h. The masked and polished assembly will the be evaluated using QUAST and BUSCO taking ~15 min for each software. After the evaluation the assembly will be structurally and functionally annotated using BRAKER3 for ~24 h and EggNOGmapper for ~18 h. The software BRAKER3 uses RNA-seq for the generation of gene structure annotation. RNA mapping will therefore be performed before the structural annotation.

Differential Expression Analysis

After the RNA is pre-processed it will be mapped using STAR which will take ~30 min. The mapped RNA will be used for structural annotation and the output will later be used to perform read counting. The read counting will utilize the software Feature Counts and is expected to run for ~12 h. Afterwards the results will be used to perform differential expression analysis with DESeq2 for ~1 h.

Bottlenecks

The primary bottleneck in this project will be softwares which require long running times. Some examples are flye and pilon which require ~48h or ~24h to run and they results are necessary for further steps. While running these softwares other non-reliant steps will be run, like pre-processing the illumina short reads and the RNA while Flye and Pilon is running, as well as creating scripts for slurms for the next steps.

Data Management Plan

The data which will be handled are of various types like .fasta files, .fastq files and etc. A lot of these files will be large and some will be automatically compressed since they are too large. To ease the handling of files all large files will tried to be compressed since most softwares can unzip them easily.

File structure

Data will be organized according to this file structure

Genome-analysis-project/
│
├── code/                  # All SLURM scripts and other code
│
├── analysis/
│
│   ├── preprocessing/ 
│   │   ├── fastqc/
│   │   │   ├── DNA_1/
│   │   │   ├── DNA_2/
│   │   │   ├── RNA_1/
│   │   │   ├── RNA_2/
│   │   │   └── log_files/
│   │   │
│   │   └── trimmomatic/
│   │       ├── DNA_1/
│   │       ├── RNA_1/
│   │       └── log_files/
│   │
│   ├── assembly/
│   │   ├── flye/
│   │   │   └── log_files/
│   |   ├── bwa/
│   |   │   └── log_files/
│   │   ├── pilon/
│   │   │   └── log_files/
│   │   ├── busco/
│   │   │   └── log_files/
│   │   ├── quast/
│   │   │   └── log_files/
│   │   ├── repeatmasker/
│   │   │   └── log_files/
│   │   ├── braker3/
│   │   │   └── log_files/
│   │   └── eggnogmapper/
│   │       └── log_files/
│   │
│   └── de_analysis/
│       ├── star/
│       │   └── log_files/
│       ├── readcounts/
│       └── deseq2/
│           └── log_files/

Extra Analysis

The extra analysis which will be performed is HiC-scaffolding assembly using Yahs and is expected to take ~1 h. This step is performed after all of the basic analyses and will use the improved assembly as input.

Time Plan

Completion date Task Software Internal Deadline Official Deadline
10-04-2026 Project Plan GitHub 10-04-2026 10-04-2026
10-04-2026 Pre-Processing DNA FastQC 10-04-2026 15-04-2026
10-04-13 Pre-processing DNA Trimmomatic 11-04-2026 15-04-2026
14-04-2026 Pre-processing DNA FastQC 12-04-2026 15-04-2026
13-04-2026 Genome Assembly Flye 13-04-2026 15-04-2026
16-04-2026 Mapping short reads to assembly BWA 16-04-2026 21-04-2026
17-04-2026 Genome Polishing Pilon 17-04-2026 21-04-2026
20-04-2026 Genome Evaluation BUSCO 19-04-2026 24-04-2026
20-04-2026 Assembly Evaluation Quast 20-04-2026 24-04-2026
21-04-2026 Genome database RepeatModeler 21-04-2026 24-04-2026
21-04-2026 Genome Masking RepeatMasker 22-04-2026 24-04-2026
24-04-2026 Pre-processing RNA FastQC 23-04-2026 28-04-2026
24-04-2026 Pre-processing RNA Trimmomatic 23-04-2026 28-04-2026
24-04-2026 Pre-processing RNA FastQC 23-04-2026 28-04-2026
24-04-2026 RNA mapping STAR 24-04-2026 16-05-2026
Genome Annotation Braker3 25-04-2026 13-05-2026
Genome Annotation EggNOGmapper 26-04-2026 13-05-2026
Read Counting 02-05-2016 27-04-2026 13-05-2026
DE analysis DESeq2 04-05-2026 19-05-2026
HiC-scaffolding Yahs 06-05-2026 22-05-2026 (Extra analysis)
Wiki GitHub 15-05-2026 22-05-2026
Project Presentation 23-05-2026 26-05-2026