Project plan - Mathilda-the-G/Genome-analysis-project GitHub Wiki
Aim of the Project
The aim of the project is to perform a DNA assembly, structural and functional annotation, differential expression analysis, and evaluations on chromosome 3 on the species N. japonicum which is about 16Mbp. After all previous steps are performed the questions regarding which genes and proteins are expressed on chromosom 3 of N. japonicum, as well as which proteins are under- or over expressed will be answered.
Workflow
Pre-processing
The DNA and RNA will both be pre-processed before being used in further steps by being run through FastQC, followed by Trimmomatic, and then being runt through FastQC again. This is done to asses the quality of the raw reads and then trimming them and assessing the quality after the trim.
DNA Assembly
To assemble the genome Flye will be used on the pre-processed illumina DNA reads for ~48 h. After the genome is assembled the illumina short reads must be mapped to the assembled Nanopore long reads with BWA in order to polish the genome with Pilon, the time expected for this is ~1h. The assembled genome will then be polished and improved with Pilon which is expected to run for ~24 h. After the polishing the assembly will be masked with ReapeatMasking taking ~8 h. The masked and polished assembly will the be evaluated using QUAST and BUSCO taking ~15 min for each software. After the evaluation the assembly will be structurally and functionally annotated using BRAKER3 for ~24 h and EggNOGmapper for ~18 h. The software BRAKER3 uses RNA-seq for the generation of gene structure annotation. RNA mapping will therefore be performed before the structural annotation.
Differential Expression Analysis
After the RNA is pre-processed it will be mapped using STAR which will take ~30 min. The mapped RNA will be used for structural annotation and the output will later be used to perform read counting. The read counting will utilize the software Feature Counts and is expected to run for ~12 h. Afterwards the results will be used to perform differential expression analysis with DESeq2 for ~1 h.
Bottlenecks
The primary bottleneck in this project will be softwares which require long running times. Some examples are flye and pilon which require ~48h or ~24h to run and they results are necessary for further steps. While running these softwares other non-reliant steps will be run, like pre-processing the illumina short reads and the RNA while Flye and Pilon is running, as well as creating scripts for slurms for the next steps.
Data Management Plan
The data which will be handled are of various types like .fasta files, .fastq files and etc. A lot of these files will be large and some will be automatically compressed since they are too large. To ease the handling of files all large files will tried to be compressed since most softwares can unzip them easily.
File structure
Data will be organized according to this file structure
Genome-analysis-project/
│
├── code/ # All SLURM scripts and other code
│
├── analysis/
│
│ ├── preprocessing/
│ │ ├── fastqc/
│ │ │ ├── DNA_1/
│ │ │ ├── DNA_2/
│ │ │ ├── RNA_1/
│ │ │ ├── RNA_2/
│ │ │ └── log_files/
│ │ │
│ │ └── trimmomatic/
│ │ ├── DNA_1/
│ │ ├── RNA_1/
│ │ └── log_files/
│ │
│ ├── assembly/
│ │ ├── flye/
│ │ │ └── log_files/
│ | ├── bwa/
│ | │ └── log_files/
│ │ ├── pilon/
│ │ │ └── log_files/
│ │ ├── busco/
│ │ │ └── log_files/
│ │ ├── quast/
│ │ │ └── log_files/
│ │ ├── repeatmasker/
│ │ │ └── log_files/
│ │ ├── braker3/
│ │ │ └── log_files/
│ │ └── eggnogmapper/
│ │ └── log_files/
│ │
│ └── de_analysis/
│ ├── star/
│ │ └── log_files/
│ ├── readcounts/
│ └── deseq2/
│ └── log_files/
Extra Analysis
The extra analysis which will be performed is HiC-scaffolding assembly using Yahs and is expected to take ~1 h. This step is performed after all of the basic analyses and will use the improved assembly as input.
Time Plan
| Completion date | Task | Software | Internal Deadline | Official Deadline |
|---|---|---|---|---|
| 10-04-2026 | Project Plan | GitHub | 10-04-2026 | 10-04-2026 |
| 10-04-2026 | Pre-Processing DNA | FastQC | 10-04-2026 | 15-04-2026 |
| 10-04-13 | Pre-processing DNA | Trimmomatic | 11-04-2026 | 15-04-2026 |
| 14-04-2026 | Pre-processing DNA | FastQC | 12-04-2026 | 15-04-2026 |
| 13-04-2026 | Genome Assembly | Flye | 13-04-2026 | 15-04-2026 |
| 16-04-2026 | Mapping short reads to assembly | BWA | 16-04-2026 | 21-04-2026 |
| 17-04-2026 | Genome Polishing | Pilon | 17-04-2026 | 21-04-2026 |
| Genome Masking | RepeatMasker | 19-04-2026 | 24-04-2026 | |
| 20-04-2026 | Genome Evaluation | BUSCO | 19-04-2026 | 24-04-2026 |
| 20-04-2026 | Assembly Evaluation | Quast | 20-04-2026 | 24-04-2026 |
| Genome Annotation | EggNOGmapper | 22-04-2026 | 13-05-2026 | |
| Pre-processing RNA | FastQC | 23-04-2026 | 28-04-2026 | |
| Pre-processing RNA | Trimmomatic | 24-04-2026 | 28-04-2026 | |
| Pre-processing RNA | FastQC | 25-04-2026 | 28-04-2026 | |
| Genome Annotation | Braker3 | 28-04-2026 | 13-05-2026 | |
| RNA mapping | STAR | 28-04-2026 | 11-05-2026 | |
| Read Counting | 02-05-2016 | 24-04-2026 | 13-05-2026 | |
| DE analysis | DESeq2 | 04-05-2026 | 19-05-2026 | |
| HiC-scaffolding | Yahs | 06-05-2026 | 22-05-2026 (Extra analysis) | |
| Wiki | GitHub | 15-05-2026 | 22-05-2026 | |
| Project Presentation | 23-05-2026 | 26-05-2026 |