Project Plan - MaryamDost/GenomeAnalysis GitHub Wiki

Project Plan

Background

Leptospirillum ferriphilum is a Gram negative, chemolithoautotrophic, and acidophilic bacteria i.e. bacteria that is found in acidic, metal-rich environments. They are obligate aerobe and gain energy only by oxidizing iron, generating ferric iron (Fe3+) using ferrous iron (Fe2+) as an electron donor. (Christel et al. 2018). Due to their ability to catalyze mineral dissolution, L. ferriphilums are used in biomining (Christel et al. 2018). Biomining is an industrial process where living organisms extract minerals from solid material. L. ferriphilum is important in gold recovery and has been identified in the bioleaching pile for the recovery of chalcopyrite coppers. In spite its importance much is unknown about this organism, thus mapping the hole genome will provide a better understanding of its physiological processes (Christel et al. 2018) and thereby improve the efficiency of biomining.

Aim

This project is about to reproduce some of the analyses in the paper by Christel et al., reanalyzing their data and reevaluating their biological conclusions. The aim of the project is to have a deep understanding of some bioinformatic methods relevant in this project and, to get familiar with bioinformatic tools and methods that are commonly used when analyzing sequencing data. One of the purposes is to become aware of the continuous development of these methods and the impact of the updates.

Methods

To different analyses will be performed. The first is to use PacBio and obtain the whole genome sequence. This will be done through de novo assembly of the genome from long reads. The fully assembled genome will then be annotated to investigate its synteny with a closely-related species. The second is transcriptomics and differential gene expression analysis using paired reads obtained by RNA-seq.

Table 1: Analysis 1, Genome Assembly (Data from WGS):

Analysis Software Running time
Genome assembly Canu ~ 11,5 h (2 cores)
Assembly evaluation Quast < 15 min (1 core)
Assembly evaluation MUMmerplot < 5 min (1 core)
Annotation Prokka < 5 min (2 cores)
Annotation eggNOGmapper ~ 1 h (HMM algorithm)
Synteny comparison blastn

Table 2: Analysis 2, Transcriptomics and Differential Gene Expression (Data from RNA-Seq):

Analysis Software Running time
Quality control FastQC
Trimming Trimmomatic ~ 15min per file, 5 files (2 Cores)
Quality control FastQC
Aligner BWA ~ 5 h (2 cores)
RNA-seq reads counting Htseq ~ 8 h
Differential Expression Deseq2 (Rlibrary)

Workflow

Following data analyses will be performed:

  • Genome assemble of PacBio reads.
  • Assembly quality assessment
  • Structural and functional annotation.
  • Synteny comparison with a closely related genome
  • Reads preprocessing: trimming + quality check (before and after)
  • Mapping and counting RNA-seq reads, and analyzing differential expression.

Figure 1 Workflow for analysis 1 (Genome Assembly) and analysis 2 (Differential Gene Expression)

Project organization

Data and code are separated. Folders or file names starts with a number, since by default they will be shown in alphanumerical order. It is easier to know in which order they were created when they are numerically organized. Data files, especially big data files, os compressed. I will be working both on my local computer and on UPPMAX depending on which program I will be running.

Figure 2 Data structure of the repository.

Timeplan

The timeframe of the project is 24/3-20/5 and the checkpoints of the different methods are noted in the table below. Table 3: Analyses checkpoints

Day Deadline
26/3 Seminar
6/4 Project planning
15/4 Genome Assembly + Genome annotation
28/4 Comparative genomics
4/5 RNA mapping

References

Christel S, Herold M, Bellenberg S, El Hajjami M, Buetti-Dinh A, Pivkin IV, Sand W, Wilmes P, Poetsch A, Dopson M. 2018. Multi-omics Reveals the Lifestyle of the Acidophilic, Mineral-Oxidizing Model Species Leptospirillum ferriphilumT. Applied and Environmental Microbiology, doi 10.1128/AEM.02091-17.