A Project Plan - cecilia-andersson/Genome_Analysis_Project GitHub Wiki

Genome Analysis Project Plan

Paper: Metabolic Roles of Uncultivated Bacterioplankton Lineages in the Northern Gulf of Mexico “Dead Zone”
Thrash, J. Cameron et al. (2017)

Purpose of Study

This study relies on data and methods from the 2017 paper listed above. Files and code are located in the 'master' branch.

Studying the factors which contribute to the seasonally occurring hypoxic "dead zone" in the northern Gulf of Mexico (nGOM) is integral to finding and proposing solutions to manage it. These seasonal "dead zones" occur when bacterioplankton and other small organisms use dissolved oxygen to metabolize the large amounts dead algae following an algal bloom, so it is important to characterize these (often unculturable) microorganisms, as well as their metabolic pathways contributing to this environmental depletion of dissolved oxygen (DO). In order to do this, I will perform a metagenomic analysis on the microorganisms from two of the six collection site locations used in this 2017 paper -- I will assemble and bin genomes to identify distinct species, perform functional annotation on the sequences to identify genes linked to metabolism, and use transcriptome analyses to predict metabolic pathways.

Project Goals

  • Assemble metagenome
  • Phylogenetic placement of bins
  • Functional annotation
  • Expression analysis

Project Timeline

Projected Date Analysis Prep Time Run Time Total Time Completed?
2023-03-31 Pre-Processing 1h 1h 2h Yes
2023-04-07 Assembly + QC 3h 7h 10h Yes
2023-04-14 Binning + QC 2h 2.5h 4.5h Yes
2023-04-14 Annotation 1.5h 1h 2.5h Yes
2023-04-21 Mapping 2h 6h 8h Yes
2023-04-28 Phylogeny 2h 6h 8h Yes
2023-05-05 Expression Analysis 2h 1h 3h Yes
2023-05-12 Extra Analysis (TBD) ?h ?h ?h No
2023-05-15 Extra Analysis (TBD) ?h ?h ?h No

Bolded: predicted bottleneck

Data Overview

  • Collection site background: Precise location of sites, DO levels, and nutrient content will be included when I access site data.
  • Accession No. (DNA, RNA) Site Lat (N) Lon (E) DO (umol/kg) Chla (ug/L)* PO4 (uM) Si(OH)4 (uM) NO3 (uM) NO2 (uM) NH4 (uM)
    SAMN05791315, SAMN05791321 D1, oxic 28.9833 -90.8333 131.71 0.93 0.58 9.09 3.97 0.84 5.76
    SAMN05791317, SAMN05791323 D3, hypoxic 28.7167 -90.8333 12.79 0.66 1.9 30.46 13.88 0.86 2.39

    *Chlorophyll A concentration is used as a measure of phytoplankton in the water

  • Sample collection: Saltwater samples were collected from two sites (one oxic, one hypoxic) in the northern Gulf of Mexico. From these samples, DNA and RNA were extracted and purified. DNA quantity was determined, and RNA quality was measured. RNA with RNA Integrity Number (16s/23s rRNA) less than 8 was discarded and ribosomal RNA was removed, leaving only mRNA. This mRNA was then reverse transcribed to cDNA prior to sequencing.
  • Illumina HiSeq 2000 Whole-Genome Sequencing: Genomic DNA and RNA-seq cDNA sequenced separately, producing 100bp paired-end reads for fragments sized 180bp.

Main Steps of Study & Software

Reads quality check (performed before and after trimming)

Purpose: To assess quality of DNA and RNA reads, and identify low-quality reads for trimming
Program: FastQC documentation, help
Est. Time: ~15 min
Input Data: DNA and RNA sequence files (Illumina Fastq)
Output Data: HTML file and .zip file tutorial1, tutorial2
Parameters:

Trimming

Purpose: To remove adapter sequences and low quality reads
Program: Trimmomatic manual
Est. Time: ?
Input Data: Illumina fastq files, including gzip (.gz) and bzip2 (.bz2) compressed files
Output Data: Paired (fwd) fastq, Paired (rev) fastq, Unpaired (fwd) fastq, Unpaired (rev) fastq
Parameters:

Metagenome assembly

Purpose: To align sequences into contigs
Program: Megahit info, manual
Est. Time: ~6h (2 cores)
Input Data: Two paired-end fastq files (compressed .gz ok)
Output Data: Assembled contigs in fasta file
Parameters: Use “--kmin-1pass” to reduce memory usage or it will crash in UPPMAX

Quality check of assembly

Purpose: To identify misassemblies, quantify metrics (no. uncalled bases, mismatches, etc.)
Program: Quast manual
Est. Time: ~45 min (2cores)
Input Data: Assembled contigs in fasta format (compressed file ok)
Output Data: ~10 report files (including report plots pdf, summary txt, k-mer contig and reads-specific info)
Parameters:

Binning

Purpose: To identify separate genomes
Program: Metabat wiki, tutorial
Est. Time: ~30 min (2 cores)
Input Data: Assembly file in fasta format
Output Data: Discovered bins saved in fasta format
Parameters:

Quality check of binning

Purpose: To assess quality of bins, as well as genome quality by comparing to taxonomy/ref genome data
Program: CheckM (binning) manual
Est. Time: ~2h (2 cores)
Input Data: Binned genes in fasta format (fna ending!)
Output Data: Variety: output report and downloadable QC files example
Parameters:

Functional annotation

Purpose: "PROKKA automates the process of locating open reading frames (ORFs) and RNA regions on contigs, translating ORFs to protein sequences, searching for protein homologs and producing standard output files." source
Program: Prokka manual
Est. Time: ~1h (2 cores)
Input Data: Assembly data in fasta format
Output Data: Tab-delimited file of genome annotations (GFF); Genbank file (GBK)
Parameters:

RNA Mapping

Purpose: Mapping reads to reference genomes
Program: BWA manual, tutorial
Est. Time: ~6h (2 cores)
Input Data: Fasta format alignment file? Fastq format paired-end reads?
Output Data: SAM file (sequence alignment map) more info
Parameters:

Phylogenetic placement of bins

Purpose: Reconstructing phylogenies of prokaryotes, "PhyloPhlAn can assign both genomes and metagenome-assembled genomes (MAGs) to species-level genome bins (SGBs)."
Program: PhyloPhlan reference/manual
Est. Time: ~6h (2 cores)
Input Data: Assembly fasta files
Output Data: Treefiles(?)
Parameters:

Analysis of expression of different bins

Purpose: Counting and comparing mapped reads (Differential expression cannot be performed because the data will not have enough statistical power)
Program: HTseq manual
Est. Time: ?
Input Data: GTF file tutorial
Output Data: txt file
Parameters:

⚠️ **GitHub.com Fallback** ⚠️