01 Project plan - saltpinna/Genome_analysis_project GitHub Wiki
Project goals
This project aims to examine how the bacterium E. faecium can survive in human serum and what genetic factors are involved in this. Another goal is to find out more about the virulence of the bacterium and what genes are up- or down-regulated during this process.
Analyses and workflow
The main steps of the analysis are presented in the flowchart below:
-
Read quality control and genome assembly: The long reads from the PacBio sequencing are assembled
Software: Canu
Input data: PacBio FASTQ files
Output data: Assembly
Estimated time: 4,5 h
-
Assembly quality evaluation
Software: Quast
Input data: Assembly
Output data: Summary report with quality score for the genome assembly
Estimated time: 15 min
-
Structural and functional annotation: We get information about the structure and function of different parts of the genome by annotation
Software: Prokka
Input data: Assembly
Output data: Structural and functional information
Estimated time: 5 min
-
Synteny comparison: The assembled genome is compared to a closely related reference genome to evaluate the quality of the assembly
Software: Nucleotide BLAST
Input data: Nucleotide sequence of assembly & reference genome
Output data: Alignment score
Estimated time: 15 min
-
Differential Expression Analysis:
A: Mapping of RNA sequence data to the assembled genome
Software: BWA
Input data: Assembly & RNA sequence data (paired end reads)
Output data: Alignment file
Estimated time: 30 min
B: Analysis of up- and down regulated proteins by measuring gene expression
Software: Htseq
Input data: Alignment file + RNA files
Output data: Count results of counted mapped reads to your genome
Estimated time: 2-7 h
C: Comparison of fold change and plotting of result
Software: Deseq2
Input data: Count results of counted mapped reads
Output data: Differential expression of up- and down regulated genes in the genome
Estimated time: Variable
D: Plotting the differential expression results
Software: Deseq2
Input data: Differential expression of up- and down regulated genes in the genome
Output data: Plot
Estimated time: Variable
-
Short reads preprocessing: trimming + quality check
A: Quality check (before and after trimming)
Software: FastQC
Input data: Illumina reads (before and after trimming)
Output data: Quality score
Estimated time: 15 min
B: Trimming
Software: Trimmomatic
Input data: Illumina reads
Output data: Trimmed Illumina reads
Estimated time: 50 min
If time allows, additional analyses will be performed:
- Genome assembly with Illumina and Nanopore reads in Spades
- Additional assembly evaluation using MUMmerplot & BCFtools
Time Frame
Checkpoints:
19/4: Genome assembly & genome annotation finished
27/4: Comparative genomics finished (Synteny)
11/5: Differential expression analysis finished (RNA mapping + fold change comparison + plots)
All analyses should be finished running by 11/5, so that there is enough time to analyze the results and prepare for the final presentation.
Data management and storage
In this project, I will handle DNA and RNA sequence data which might need a lot of storage.
The data will organized separately from the scripts and analyses results. The project will follow a file structure similar to the image below:
Where Data contains all data files, Scripts contains all scripts files, Results contain all result files from anlyses, Intermediate contains output files form different analysis steps and Scratch contains temporary files that can be safely deleted or lost.