Genome Analysis Project Plan

Paper: Metabolic Roles of Uncultivated Bacterioplankton Lineages in the Northern Gulf of Mexico “Dead Zone”
Thrash, J. Cameron et al. (2017)

Purpose of Study

This study relies on data and methods from the 2017 paper listed above. Files and code are located in the 'master' branch.

Studying the factors which contribute to the seasonally occurring hypoxic "dead zone" in the northern Gulf of Mexico (nGOM) is integral to finding and proposing solutions to manage it. These seasonal "dead zones" occur when bacterioplankton and other small organisms use dissolved oxygen to metabolize the large amounts dead algae following an algal bloom, so it is important to characterize these (often unculturable) microorganisms, as well as their metabolic pathways contributing to this environmental depletion of dissolved oxygen (DO). In order to do this, I will perform a metagenomic analysis on the microorganisms from two of the six collection site locations used in this 2017 paper -- I will assemble and bin genomes to identify distinct species, perform functional annotation on the sequences to identify genes linked to metabolism, and use transcriptome analyses to predict metabolic pathways.

Project Goals

Assemble metagenome
Phylogenetic placement of bins
Functional annotation
Expression analysis

Project Timeline

Projected Date	Analysis	Prep Time	Run Time	Total Time	Completed?
2023-03-31	Pre-Processing	1h	1h	2h	Yes
2023-04-07	Assembly + QC	3h	7h	10h	Yes
2023-04-14	Binning + QC	2h	2.5h	4.5h	Yes
2023-04-14	Annotation	1.5h	1h	2.5h	Yes
2023-04-21	Mapping	2h	6h	8h	Yes
2023-04-28	Phylogeny	2h	6h	8h	Yes
2023-05-05	Expression Analysis	2h	1h	3h	Yes
2023-05-12	Extra Analysis (TBD)	?h	?h	?h	No
2023-05-15	Extra Analysis (TBD)	?h	?h	?h	No

Bolded: predicted bottleneck

Data Overview

Collection site background: Precise location of sites, DO levels, and nutrient content will be included when I access site data.

Accession No. (DNA, RNA)	Site	Lat (N)	Lon (E)	DO (umol/kg)	Chla (ug/L)*	PO4 (uM)	Si(OH)4 (uM)	NO3 (uM)	NO2 (uM)	NH4 (uM)
SAMN05791315, SAMN05791321	D1, oxic	28.9833	-90.8333	131.71	0.93	0.58	9.09	3.97	0.84	5.76
SAMN05791317, SAMN05791323	D3, hypoxic	28.7167	-90.8333	12.79	0.66	1.9	30.46	13.88	0.86	2.39

*Chlorophyll A concentration is used as a measure of phytoplankton in the water

Sample collection: Saltwater samples were collected from two sites (one oxic, one hypoxic) in the northern Gulf of Mexico. From these samples, DNA and RNA were extracted and purified. DNA quantity was determined, and RNA quality was measured. RNA with RNA Integrity Number (16s/23s rRNA) less than 8 was discarded and ribosomal RNA was removed, leaving only mRNA. This mRNA was then reverse transcribed to cDNA prior to sequencing.
Illumina HiSeq 2000 Whole-Genome Sequencing: Genomic DNA and RNA-seq cDNA sequenced separately, producing 100bp paired-end reads for fragments sized 180bp.

Main Steps of Study & Software

Reads quality check (performed before and after trimming)

Purpose: To assess quality of DNA and RNA reads, and identify low-quality reads for trimming
Program: FastQC documentation, help
Est. Time: ~15 min
Input Data: DNA and RNA sequence files (Illumina Fastq)
Output Data: HTML file and .zip file tutorial1, tutorial2
Parameters:

Trimming

Purpose: To remove adapter sequences and low quality reads
Program: Trimmomatic manual
Est. Time: ?
Input Data: Illumina fastq files, including gzip (.gz) and bzip2 (.bz2) compressed files
Output Data: Paired (fwd) fastq, Paired (rev) fastq, Unpaired (fwd) fastq, Unpaired (rev) fastq
Parameters:

Metagenome assembly

Purpose: To align sequences into contigs
Program: Megahit info, manual
Est. Time: ~6h (2 cores)
Input Data: Two paired-end fastq files (compressed .gz ok)
Output Data: Assembled contigs in fasta file
Parameters: Use “--kmin-1pass” to reduce memory usage or it will crash in UPPMAX

Quality check of assembly

Purpose: To identify misassemblies, quantify metrics (no. uncalled bases, mismatches, etc.)
Program: Quast manual
Est. Time: ~45 min (2cores)
Input Data: Assembled contigs in fasta format (compressed file ok)
Output Data: ~10 report files (including report plots pdf, summary txt, k-mer contig and reads-specific info)
Parameters:

Binning

Purpose: To identify separate genomes
Program: Metabat wiki, tutorial
Est. Time: ~30 min (2 cores)
Input Data: Assembly file in fasta format
Output Data: Discovered bins saved in fasta format
Parameters:

Quality check of binning

Purpose: To assess quality of bins, as well as genome quality by comparing to taxonomy/ref genome data
Program: CheckM (binning) manual
Est. Time: ~2h (2 cores)
Input Data: Binned genes in fasta format (fna ending!)
Output Data: Variety: output report and downloadable QC files example
Parameters:

Functional annotation

Purpose: "PROKKA automates the process of locating open reading frames (ORFs) and RNA regions on contigs, translating ORFs to protein sequences, searching for protein homologs and producing standard output files." source
Program: Prokka manual
Est. Time: ~1h (2 cores)
Input Data: Assembly data in fasta format
Output Data: Tab-delimited file of genome annotations (GFF); Genbank file (GBK)
Parameters:

RNA Mapping

Purpose: Mapping reads to reference genomes
Program: BWA manual, tutorial
Est. Time: ~6h (2 cores)
Input Data: Fasta format alignment file? Fastq format paired-end reads?
Output Data: SAM file (sequence alignment map) more info
Parameters:

Phylogenetic placement of bins

Purpose: Reconstructing phylogenies of prokaryotes, "PhyloPhlAn can assign both genomes and metagenome-assembled genomes (MAGs) to species-level genome bins (SGBs)."
Program: PhyloPhlan reference/manual
Est. Time: ~6h (2 cores)
Input Data: Assembly fasta files
Output Data: Treefiles(?)
Parameters:

Analysis of expression of different bins

Purpose: Counting and comparing mapped reads (Differential expression cannot be performed because the data will not have enough statistical power)
Program: HTseq manual
Est. Time: ?
Input Data: GTF file tutorial
Output Data: txt file
Parameters:

A Project Plan - cecilia-andersson/Genome_Analysis_Project GitHub Wiki

Genome Analysis Project Plan

Paper: Metabolic Roles of Uncultivated Bacterioplankton Lineages in the Northern Gulf of Mexico “Dead Zone”
Thrash, J. Cameron et al. (2017)

Purpose of Study

Project Goals

Project Timeline

Data Overview

Main Steps of Study & Software

Reads quality check (performed before and after trimming)

Trimming

Metagenome assembly

Quality check of assembly

Binning

Quality check of binning

Functional annotation

RNA Mapping

Phylogenetic placement of bins

Analysis of expression of different bins

⚠️ GitHub.com Fallback ⚠️

A Project Plan - cecilia-andersson/Genome_Analysis_Project GitHub Wiki

Genome Analysis Project Plan

Paper: Metabolic Roles of Uncultivated Bacterioplankton Lineages in the Northern Gulf of Mexico “Dead Zone” Thrash, J. Cameron et al. (2017)

Purpose of Study

Project Goals

Project Timeline

Data Overview

Main Steps of Study & Software

Reads quality check (performed before and after trimming)

Trimming

Metagenome assembly

Quality check of assembly

Binning

Quality check of binning

Functional annotation

RNA Mapping

Phylogenetic placement of bins

Analysis of expression of different bins

⚠️ **GitHub.com Fallback** ⚠️

Paper: Metabolic Roles of Uncultivated Bacterioplankton Lineages in the Northern Gulf of Mexico “Dead Zone”
Thrash, J. Cameron et al. (2017)

⚠️ GitHub.com Fallback ⚠️