Project plan - sellwe/Genome-Analysis GitHub Wiki

Aim

The bacteria Enterococcus faecium (E. Faecium) is a gut bacterium in humans. In hospital patients with lowered immune systems, it has potential to cross into the bloodstream, causing blood infections. Several strains of E. faecium has also accumulated several resistances to antibiotics, making infections hard to combat. We are going to work with genetic data from E. faecium strain E745, which is resistant to ampicillin and vancomycin.

Our goal is to identify genes that are contributing to growth of the bacterium E. faecium in human serum. I will do this by using the data from the study "RNA-seq and Tn-seq reveal fitness determinants of vancomycin-resistant Enterococcus faecium during growth in human serum" by Zhang et al (2017). I am comparing gene expression levels from RNA-seq data of E. faecium E745 grown in human serum vs Brain Heart Infusion (BHI)/nutrient rich medium. By identifying which genes are important for growth in serum, we also find potential targets for drug development of antibiotics.

Analyses and timeframe

The tables shows the order of analysis, software used, the run time and the soft “deadlines”. In order to answer which genes are differentially expressed, and which are important for growth I have to:

● Have a genome. I will first do DNA pre-processing for the short reads (Illumina) followed by two genome assemblies, PacBio and NanoPore+Illumina. I will evaluate both assemblies with Quast, followed by gene Annotation of both. Then I will do comparative genomics through by aligning and comparing the two genomes using Act or MUMmerplot, followed by a synteny comparison of a neighbouring species, with either Act or MUMmerplot. This will tell me which of the two methods produced the best genome, which will be used as a reference genome going forward.

● Compare the gene expression. I will process and trim reads from RNA-seq data. I will then map these to my reference genome using BWA, then idex and sort using SAMtools to be able to do the read counting. Finally compare the differential expressions between the serum and BHI with R-scripts.

The genome assembly deadline is flexible, especially since I’m doing the extra steps here as well. I will try to be done with it and the annotations before 25/4. If I can manage, I will try to be done with Read Count, Differential Expression Analysis and Functional Enrichment Analysis by 14/5. I will aim to be done with all software runs at least one week before the presentations.

Uppmax code uppmax2025-3-3 for Batch jobs is all that is needed.

DNA preprocessing

Deadline 9/4.

Uppmax code: uppmax2025-3-3_2

FastQC only for Illumina (short reads)

Software Purpose Input Output Run Time Cores
FastQC Short reads Quality Control *.fastq HTML/QC reports 10 min -
Trimmomatic Short reads preprocessing. Trim adapters. (DNA trimming) *.fastq *.trimmed.fastq 50 min/file 1

Genome Assembly

Deadline 16/4.

Uppmax code: uppmax2025-3-3_2

Canu also trims and does quality assessment.

Software Purpose Input Output Run Time Cores
Canu Long read assembler. For PacBio. *.pacbio.fastq canu_assembly.fasta 4.5 h 1
Spades EXTRA STEP Short + long reads. NanoPore+ Illumina *.fastq scaffolds.fasta 2 h 1
Quast Quality control for both methods. Assembly evaluation. *.fasta quast_report.html 15 min 1
MUMmerplot EXTRA STEP Qual control. Align with ref. and visualize. *.fasta PNG/PDF plots 5 min 1

Gene Annotation

Deadline: 25/4.

Uppmax code: uppmax2025-3-3_5

Software Purpose Input Output Run Time Cores
Prokka Annotation. Assign functions to genes. Seminar? *.fasta gff3/fasta files 5 min 1

Comparative Genomics

Deadline: 28/4.

Uppmax code: uppmax2025-3-3_6

Either compare my assembled genomes to each other, or to a reference genome. Can also be done before Gene Annotation.

Software Purpose Input Output Run Time Cores
Act Align and visualize. Synteny *.fasta PNG/PDF plots 5 min 1

RNA Trimming

Deadline: 2/5

Uppmax code: uppmax2025-3-3_7

Software Purpose Input Output Run Time Cores
FastQC Reads Quality Control *.fastq HTML/QC reports 10 min -
Trimmomatic RNA trimming *.fastq *.trimmed.fastq 50 min/file 1

RNA Mapping

Deadline 9/5.

Uppmax code: uppmax2025-3-3_8

Software Purpose Input Output Run Time Cores
BWA Map RNA-seq to my ref genome.Paired-end reads *.fastq + reference *.bam 30 min 1
BWA Map RNA-seq to my ref genome.Single reads *.fastq + reference *.bam 15 min 1

Post mapping analysis

Deadline: 9/5

Uppmax code: uppmax2025-3-3_8

Software Purpose Input Output Run Time Cores
SAMtools Indexing and sorting *.bam sorted.bam + .bai 5 min 1

Read Counting

Deadline: 14/5.

Uppmax code: uppmax2025-3-3_9

Software Purpose Input Output Run Time Cores
HTSeq Count reads to the mapping.Paired-end reads *.bam + .gff counts.txt 2-7 h 1
HTSeq Count reads to the mapping.Single reads *.bam + .gff counts.txt 10 min 1

Differential Expression Analysis

Deadline: 20/5.

Uppmax code: uppmax2025-3-3_11

Software Purpose Input Output
DESeq2 (R) Find significant expression diffs. Write script counts.txt DE_results.csv

Wiki

Deadline: 23/5.

Uppmax code: uppmax2025-3-3_12

Presentation

28/5

Data

I will be working with both genomics data and transcriptomics data. For the genome assembly I will be working with both short-read and long-read data. I will use Illumina short-read data, and both PacBio and NanoPore long reads data.

SRA Links for BioProject (Select 362318) - SRA - NCBI

The metadata for the transcriptomic data is saved in an excel sheet for now. The data in uppmax is accessible from: /proj/uppmax2025-3-3/Genome_Analysis/1_Zhang_2017 on command-line, containing both the genomics data and the transcriptomics data. I will keep the large raw data in the remote repository. In my local repository I will keep 3 main folders:

● Analyses: Subfolders for each main step. (preprocessing, genome_assembly, structural_annotation and so forth)

● Code: .sh files

● Data: Subfolders metadata, raw_data and trimmed_data