01 Project plan - saltpinna/Genome_analysis_project GitHub Wiki

Project goals

This project aims to examine how the bacterium E. faecium can survive in human serum and what genetic factors are involved in this. Another goal is to find out more about the virulence of the bacterium and what genes are up- or down-regulated during this process.

Analyses and workflow

The main steps of the analysis are presented in the flowchart below:

Read quality control and genome assembly: The long reads from the PacBio sequencing are assembled

Software: Canu

Input data: PacBio FASTQ files

Output data: Assembly

Estimated time: 4,5 h
Assembly quality evaluation

Software: Quast

Input data: Assembly

Output data: Summary report with quality score for the genome assembly

Estimated time: 15 min
Structural and functional annotation: We get information about the structure and function of different parts of the genome by annotation

Software: Prokka

Input data: Assembly

Output data: Structural and functional information

Estimated time: 5 min
Synteny comparison: The assembled genome is compared to a closely related reference genome to evaluate the quality of the assembly

Software: Nucleotide BLAST

Input data: Nucleotide sequence of assembly & reference genome

Output data: Alignment score

Estimated time: 15 min
Differential Expression Analysis:

A: Mapping of RNA sequence data to the assembled genome

Software: BWA

Input data: Assembly & RNA sequence data (paired end reads)

Output data: Alignment file

Estimated time: 30 min

B: Analysis of up- and down regulated proteins by measuring gene expression

Software: Htseq

Input data: Alignment file + RNA files

Output data: Count results of counted mapped reads to your genome

Estimated time: 2-7 h

C: Comparison of fold change and plotting of result

Software: Deseq2

Input data: Count results of counted mapped reads

Output data: Differential expression of up- and down regulated genes in the genome

Estimated time: Variable

D: Plotting the differential expression results

Software: Deseq2

Input data: Differential expression of up- and down regulated genes in the genome

Output data: Plot

Estimated time: Variable
Short reads preprocessing: trimming + quality check

A: Quality check (before and after trimming)

Software: FastQC

Input data: Illumina reads (before and after trimming)

Output data: Quality score

Estimated time: 15 min

B: Trimming

Software: Trimmomatic

Input data: Illumina reads

Output data: Trimmed Illumina reads

Estimated time: 50 min

If time allows, additional analyses will be performed:

Genome assembly with Illumina and Nanopore reads in Spades
Additional assembly evaluation using MUMmerplot & BCFtools

Time Frame

Checkpoints:

19/4: Genome assembly & genome annotation finished

27/4: Comparative genomics finished (Synteny)

11/5: Differential expression analysis finished (RNA mapping + fold change comparison + plots)

All analyses should be finished running by 11/5, so that there is enough time to analyze the results and prepare for the final presentation.

Data management and storage

In this project, I will handle DNA and RNA sequence data which might need a lot of storage.

The data will organized separately from the scripts and analyses results. The project will follow a file structure similar to the image below:

Where Data contains all data files, Scripts contains all scripts files, Results contain all result files from anlyses, Intermediate contains output files form different analysis steps and Scratch contains temporary files that can be safely deleted or lost.