Project plan - sellwe/Genome-Analysis GitHub Wiki
Aim
The bacteria Enterococcus faecium (E. Faecium) is a gut bacterium in humans. In hospital patients with lowered immune systems, it has potential to cross into the bloodstream, causing blood infections. Several strains of E. faecium has also accumulated several resistances to antibiotics, making infections hard to combat. We are going to work with genetic data from E. faecium strain E745, which is resistant to ampicillin and vancomycin.
Our goal is to identify genes that are contributing to growth of the bacterium E. faecium in human serum. I will do this by using the data from the study "RNA-seq and Tn-seq reveal fitness determinants of vancomycin-resistant Enterococcus faecium during growth in human serum" by Zhang et al (2017). I am comparing gene expression levels from RNA-seq data of E. faecium E745 grown in human serum vs Brain Heart Infusion (BHI)/nutrient rich medium. By identifying which genes are important for growth in serum, we also find potential targets for drug development of antibiotics.
Analyses and timeframe
The tables shows the order of analysis, software used, the run time and the soft “deadlines”. In order to answer which genes are differentially expressed, and which are important for growth I have to:
● Have a genome. I will first do DNA pre-processing for the short reads (Illumina) followed by two genome assemblies, PacBio and NanoPore+Illumina. I will evaluate both assemblies with Quast, followed by gene Annotation of both. Then I will do comparative genomics through by aligning and comparing the two genomes using Act or MUMmerplot, followed by a synteny comparison of a neighbouring species, with either Act or MUMmerplot. This will tell me which of the two methods produced the best genome, which will be used as a reference genome going forward.
● Compare the gene expression. I will process and trim reads from RNA-seq data. I will then map these to my reference genome using BWA, then idex and sort using SAMtools to be able to do the read counting. Finally compare the differential expressions between the serum and BHI with R-scripts.
The genome assembly deadline is flexible, especially since I’m doing the extra steps here as well. I will try to be done with it and the annotations before 25/4. If I can manage, I will try to be done with Read Count, Differential Expression Analysis and Functional Enrichment Analysis by 14/5. I will aim to be done with all software runs at least one week before the presentations.
Uppmax code uppmax2025-3-3 for Batch jobs is all that is needed.
DNA preprocessing
Deadline 9/4.
Uppmax code: uppmax2025-3-3_2
FastQC only for Illumina (short reads)
Software | Purpose | Input | Output | Run Time | Cores |
---|---|---|---|---|---|
FastQC | Short reads Quality Control | *.fastq | HTML/QC reports | 10 min | - |
Trimmomatic | Short reads preprocessing. Trim adapters. (DNA trimming) | *.fastq | *.trimmed.fastq | 50 min/file | 1 |
Genome Assembly
Deadline 16/4.
Uppmax code: uppmax2025-3-3_2
Canu also trims and does quality assessment.
Software | Purpose | Input | Output | Run Time | Cores |
---|---|---|---|---|---|
Canu | Long read assembler. For PacBio. | *.pacbio.fastq | canu_assembly.fasta | 4.5 h | 1 |
Spades EXTRA STEP | Short + long reads. NanoPore+ Illumina | *.fastq | scaffolds.fasta | 2 h | 1 |
Quast | Quality control for both methods. Assembly evaluation. | *.fasta | quast_report.html | 15 min | 1 |
MUMmerplot EXTRA STEP | Qual control. Align with ref. and visualize. | *.fasta | PNG/PDF plots | 5 min | 1 |
Gene Annotation
Deadline: 25/4.
Uppmax code: uppmax2025-3-3_5
Software | Purpose | Input | Output | Run Time | Cores |
---|---|---|---|---|---|
Prokka | Annotation. Assign functions to genes. Seminar? | *.fasta | gff3/fasta files | 5 min | 1 |
Comparative Genomics
Deadline: 28/4.
Uppmax code: uppmax2025-3-3_6
Either compare my assembled genomes to each other, or to a reference genome. Can also be done before Gene Annotation.
Software | Purpose | Input | Output | Run Time | Cores |
---|---|---|---|---|---|
Act | Align and visualize. Synteny | *.fasta | PNG/PDF plots | 5 min | 1 |
RNA Trimming
Deadline: 2/5
Uppmax code: uppmax2025-3-3_7
Software | Purpose | Input | Output | Run Time | Cores |
---|---|---|---|---|---|
FastQC | Reads Quality Control | *.fastq | HTML/QC reports | 10 min | - |
Trimmomatic | RNA trimming | *.fastq | *.trimmed.fastq | 50 min/file | 1 |
RNA Mapping
Deadline 9/5.
Uppmax code: uppmax2025-3-3_8
Software | Purpose | Input | Output | Run Time | Cores |
---|---|---|---|---|---|
BWA | Map RNA-seq to my ref genome.Paired-end reads | *.fastq + reference | *.bam | 30 min | 1 |
BWA | Map RNA-seq to my ref genome.Single reads | *.fastq + reference | *.bam | 15 min | 1 |
Post mapping analysis
Deadline: 9/5
Uppmax code: uppmax2025-3-3_8
Software | Purpose | Input | Output | Run Time | Cores |
---|---|---|---|---|---|
SAMtools | Indexing and sorting | *.bam | sorted.bam + .bai | 5 min | 1 |
Read Counting
Deadline: 14/5.
Uppmax code: uppmax2025-3-3_9
Software | Purpose | Input | Output | Run Time | Cores |
---|---|---|---|---|---|
HTSeq | Count reads to the mapping.Paired-end reads | *.bam + .gff | counts.txt | 2-7 h | 1 |
HTSeq | Count reads to the mapping.Single reads | *.bam + .gff | counts.txt | 10 min | 1 |
Differential Expression Analysis
Deadline: 20/5.
Uppmax code: uppmax2025-3-3_11
Software | Purpose | Input | Output |
---|---|---|---|
DESeq2 (R) | Find significant expression diffs. Write script | counts.txt | DE_results.csv |
Wiki
Deadline: 23/5.
Uppmax code: uppmax2025-3-3_12
Presentation
28/5
Data
I will be working with both genomics data and transcriptomics data. For the genome assembly I will be working with both short-read and long-read data. I will use Illumina short-read data, and both PacBio and NanoPore long reads data.
SRA Links for BioProject (Select 362318) - SRA - NCBI
The metadata for the transcriptomic data is saved in an excel sheet for now. The data in uppmax is accessible from: /proj/uppmax2025-3-3/Genome_Analysis/1_Zhang_2017 on command-line, containing both the genomics data and the transcriptomics data. I will keep the large raw data in the remote repository. In my local repository I will keep 3 main folders:
● Analyses: Subfolders for each main step. (preprocessing, genome_assembly, structural_annotation and so forth)
● Code: .sh files
● Data: Subfolders metadata, raw_data and trimmed_data