01 Project plan - saltpinna/Genome_analysis_project GitHub Wiki

Project goals

This project aims to examine how the bacterium E. faecium can survive in human serum and what genetic factors are involved in this. Another goal is to find out more about the virulence of the bacterium and what genes are up- or down-regulated during this process.

Analyses and workflow

The main steps of the analysis are presented in the flowchart below:

image

  1. Read quality control and genome assembly: The long reads from the PacBio sequencing are assembled

    Software: Canu

    Input data: PacBio FASTQ files

    Output data: Assembly

    Estimated time: 4,5 h

  2. Assembly quality evaluation

    Software: Quast

    Input data: Assembly

    Output data: Summary report with quality score for the genome assembly

    Estimated time: 15 min

  3. Structural and functional annotation: We get information about the structure and function of different parts of the genome by annotation

    Software: Prokka

    Input data: Assembly

    Output data: Structural and functional information

    Estimated time: 5 min

  4. Synteny comparison: The assembled genome is compared to a closely related reference genome to evaluate the quality of the assembly

    Software: Nucleotide BLAST

    Input data: Nucleotide sequence of assembly & reference genome

    Output data: Alignment score

    Estimated time: 15 min

  5. Differential Expression Analysis:

    A: Mapping of RNA sequence data to the assembled genome

    Software: BWA

    Input data: Assembly & RNA sequence data (paired end reads)

    Output data: Alignment file

    Estimated time: 30 min

    B: Analysis of up- and down regulated proteins by measuring gene expression

    Software: Htseq

    Input data: Alignment file + RNA files

    Output data: Count results of counted mapped reads to your genome

    Estimated time: 2-7 h

    C: Comparison of fold change and plotting of result

    Software: Deseq2

    Input data: Count results of counted mapped reads

    Output data: Differential expression of up- and down regulated genes in the genome

    Estimated time: Variable

    D: Plotting the differential expression results

    Software: Deseq2

    Input data: Differential expression of up- and down regulated genes in the genome

    Output data: Plot

    Estimated time: Variable

  6. Short reads preprocessing: trimming + quality check

    A: Quality check (before and after trimming)

    Software: FastQC

    Input data: Illumina reads (before and after trimming)

    Output data: Quality score

    Estimated time: 15 min

    B: Trimming

    Software: Trimmomatic

    Input data: Illumina reads

    Output data: Trimmed Illumina reads

    Estimated time: 50 min

If time allows, additional analyses will be performed:

  • Genome assembly with Illumina and Nanopore reads in Spades
  • Additional assembly evaluation using MUMmerplot & BCFtools

Time Frame

Checkpoints:

19/4: Genome assembly & genome annotation finished

27/4: Comparative genomics finished (Synteny)

11/5: Differential expression analysis finished (RNA mapping + fold change comparison + plots)

All analyses should be finished running by 11/5, so that there is enough time to analyze the results and prepare for the final presentation.

Data management and storage

In this project, I will handle DNA and RNA sequence data which might need a lot of storage.

The data will organized separately from the scripts and analyses results. The project will follow a file structure similar to the image below:

image

Where Data contains all data files, Scripts contains all scripts files, Results contain all result files from anlyses, Intermediate contains output files form different analysis steps and Scratch contains temporary files that can be safely deleted or lost.