Project plan - sellwe/Genome-Analysis GitHub Wiki

Aim

The bacteria Enterococcus faecium (E. Faecium) is a gut bacterium in humans. In hospital patients with lowered immune systems, it has potential to cross into the bloodstream, causing blood infections. Several strains of E. faecium has also accumulated several resistances to antibiotics, making infections hard to combat. We are going to work with genetic data from E. faecium strain E745, which is resistant to ampicillin and vancomycin.

Our goal is to identify genes that are contributing to growth of the bacterium E. faecium in human serum. I will do this by using the data from the study "RNA-seq and Tn-seq reveal fitness determinants of vancomycin-resistant Enterococcus faecium during growth in human serum" by Zhang et al (2017). I am comparing gene expression levels from RNA-seq data of E. faecium E745 grown in human serum vs Brain Heart Infusion (BHI)/nutrient rich medium. By identifying which genes are important for growth in serum, we also find potential targets for drug development of antibiotics.

Analyses and timeframe

The tables shows the order of analysis, software used, the run time and the soft “deadlines”. In order to answer which genes are differentially expressed, and which are important for growth I have to:

● Have a genome. I will first do DNA pre-processing for the short reads (Illumina) followed by two genome assemblies, PacBio and NanoPore+Illumina. I will evaluate both assemblies with Quast, followed by gene Annotation of both. Then I will do comparative genomics through by aligning and comparing the two genomes using Act or MUMmerplot, followed by a synteny comparison of a neighbouring species, with either Act or MUMmerplot. This will tell me which of the two methods produced the best genome, which will be used as a reference genome going forward.

● Compare the gene expression. I will process and trim reads from RNA-seq data. I will then map these to my reference genome using BWA, then idex and sort using SAMtools to be able to do the read counting. Finally compare the differential expressions between the serum and BHI with R-scripts.

The genome assembly deadline is flexible, especially since I’m doing the extra steps here as well. I will try to be done with it and the annotations before 25/4. If I can manage, I will try to be done with Read Count, Differential Expression Analysis and Functional Enrichment Analysis by 14/5. I will aim to be done with all software runs at least one week before the presentations.

Uppmax code uppmax2025-3-3 for Batch jobs is all that is needed.

DNA preprocessing

Deadline 9/4.

Uppmax code: uppmax2025-3-3_2

FastQC only for Illumina (short reads)

Software	Purpose	Input	Output	Run Time	Cores
FastQC	Short reads Quality Control	*.fastq	HTML/QC reports	10 min	-
Trimmomatic	Short reads preprocessing. Trim adapters. (DNA trimming)	*.fastq	*.trimmed.fastq	50 min/file	1

Genome Assembly

Deadline 16/4.

Uppmax code: uppmax2025-3-3_2

Canu also trims and does quality assessment.

Software	Purpose	Input	Output	Run Time	Cores
Canu	Long read assembler. For PacBio.	*.pacbio.fastq	canu_assembly.fasta	4.5 h	1
Spades EXTRA STEP	Short + long reads. NanoPore+ Illumina	*.fastq	scaffolds.fasta	2 h	1
Quast	Quality control for both methods. Assembly evaluation.	*.fasta	quast_report.html	15 min	1
MUMmerplot EXTRA STEP	Qual control. Align with ref. and visualize.	*.fasta	PNG/PDF plots	5 min	1

Gene Annotation

Deadline: 25/4.

Uppmax code: uppmax2025-3-3_5

Software	Purpose	Input	Output	Run Time	Cores
Prokka	Annotation. Assign functions to genes. Seminar?	*.fasta	gff3/fasta files	5 min	1

Comparative Genomics

Deadline: 28/4.

Uppmax code: uppmax2025-3-3_6

Either compare my assembled genomes to each other, or to a reference genome. Can also be done before Gene Annotation.

Software	Purpose	Input	Output	Run Time	Cores
Act	Align and visualize. Synteny	*.fasta	PNG/PDF plots	5 min	1

RNA Trimming

Deadline: 2/5

Uppmax code: uppmax2025-3-3_7

Software	Purpose	Input	Output	Run Time	Cores
FastQC	Reads Quality Control	*.fastq	HTML/QC reports	10 min	-
Trimmomatic	RNA trimming	*.fastq	*.trimmed.fastq	50 min/file	1

RNA Mapping

Deadline 9/5.

Uppmax code: uppmax2025-3-3_8

Software	Purpose	Input	Output	Run Time	Cores
BWA	Map RNA-seq to my ref genome.Paired-end reads	*.fastq + reference	*.bam	30 min	1
BWA	Map RNA-seq to my ref genome.Single reads	*.fastq + reference	*.bam	15 min	1

Post mapping analysis

Deadline: 9/5

Uppmax code: uppmax2025-3-3_8

Software	Purpose	Input	Output	Run Time	Cores
SAMtools	Indexing and sorting	*.bam	sorted.bam + .bai	5 min	1

Read Counting

Deadline: 14/5.

Uppmax code: uppmax2025-3-3_9

Software	Purpose	Input	Output	Run Time	Cores
HTSeq	Count reads to the mapping.Paired-end reads	*.bam + .gff	counts.txt	2-7 h	1
HTSeq	Count reads to the mapping.Single reads	*.bam + .gff	counts.txt	10 min	1

Differential Expression Analysis

Deadline: 20/5.

Uppmax code: uppmax2025-3-3_11

Software	Purpose	Input	Output
DESeq2 (R)	Find significant expression diffs. Write script	counts.txt	DE_results.csv

Wiki

Deadline: 23/5.

Uppmax code: uppmax2025-3-3_12

Presentation

28/5

Data

I will be working with both genomics data and transcriptomics data. For the genome assembly I will be working with both short-read and long-read data. I will use Illumina short-read data, and both PacBio and NanoPore long reads data.

SRA Links for BioProject (Select 362318) - SRA - NCBI

The metadata for the transcriptomic data is saved in an excel sheet for now. The data in uppmax is accessible from: /proj/uppmax2025-3-3/Genome_Analysis/1_Zhang_2017 on command-line, containing both the genomics data and the transcriptomics data. I will keep the large raw data in the remote repository. In my local repository I will keep 3 main folders:

● Analyses: Subfolders for each main step. (preprocessing, genome_assembly, structural_annotation and so forth)

● Code: .sh files

● Data: Subfolders metadata, raw_data and trimmed_data