A Project Plan - cecilia-andersson/Genome_Analysis_Project GitHub Wiki
Paper: Metabolic Roles of Uncultivated Bacterioplankton Lineages in the Northern Gulf of Mexico “Dead Zone”
Thrash, J. Cameron et al. (2017)
This study relies on data and methods from the 2017 paper listed above. Files and code are located in the 'master' branch.
Studying the factors which contribute to the seasonally occurring hypoxic "dead zone" in the northern Gulf of Mexico (nGOM) is integral to finding and proposing solutions to manage it. These seasonal "dead zones" occur when bacterioplankton and other small organisms use dissolved oxygen to metabolize the large amounts dead algae following an algal bloom, so it is important to characterize these (often unculturable) microorganisms, as well as their metabolic pathways contributing to this environmental depletion of dissolved oxygen (DO). In order to do this, I will perform a metagenomic analysis on the microorganisms from two of the six collection site locations used in this 2017 paper -- I will assemble and bin genomes to identify distinct species, perform functional annotation on the sequences to identify genes linked to metabolism, and use transcriptome analyses to predict metabolic pathways.
- Assemble metagenome
- Phylogenetic placement of bins
- Functional annotation
- Expression analysis
Projected Date | Analysis | Prep Time | Run Time | Total Time | Completed? |
2023-03-31 | Pre-Processing | 1h | 1h | 2h | Yes |
2023-04-07 | Assembly + QC | 3h | 7h | 10h | Yes |
2023-04-14 | Binning + QC | 2h | 2.5h | 4.5h | Yes |
2023-04-14 | Annotation | 1.5h | 1h | 2.5h | Yes |
2023-04-21 | Mapping | 2h | 6h | 8h | Yes |
2023-04-28 | Phylogeny | 2h | 6h | 8h | Yes |
2023-05-05 | Expression Analysis | 2h | 1h | 3h | Yes |
2023-05-12 | Extra Analysis (TBD) | ?h | ?h | ?h | No |
2023-05-15 | Extra Analysis (TBD) | ?h | ?h | ?h | No |
Bolded: predicted bottleneck
- Collection site background: Precise location of sites, DO levels, and nutrient content will be included when I access site data.
- Sample collection: Saltwater samples were collected from two sites (one oxic, one hypoxic) in the northern Gulf of Mexico. From these samples, DNA and RNA were extracted and purified. DNA quantity was determined, and RNA quality was measured. RNA with RNA Integrity Number (16s/23s rRNA) less than 8 was discarded and ribosomal RNA was removed, leaving only mRNA. This mRNA was then reverse transcribed to cDNA prior to sequencing.
- Illumina HiSeq 2000 Whole-Genome Sequencing: Genomic DNA and RNA-seq cDNA sequenced separately, producing 100bp paired-end reads for fragments sized 180bp.
Accession No. (DNA, RNA) | Site | Lat (N) | Lon (E) | DO (umol/kg) | Chla (ug/L)* | PO4 (uM) | Si(OH)4 (uM) | NO3 (uM) | NO2 (uM) | NH4 (uM) |
SAMN05791315, SAMN05791321 | D1, oxic | 28.9833 | -90.8333 | 131.71 | 0.93 | 0.58 | 9.09 | 3.97 | 0.84 | 5.76 |
SAMN05791317, SAMN05791323 | D3, hypoxic | 28.7167 | -90.8333 | 12.79 | 0.66 | 1.9 | 30.46 | 13.88 | 0.86 | 2.39 |
*Chlorophyll A concentration is used as a measure of phytoplankton in the water
Purpose: To assess quality of DNA and RNA reads, and identify low-quality reads for trimming
Program: FastQC documentation, help
Est. Time: ~15 min
Input Data: DNA and RNA sequence files (Illumina Fastq)
Output Data: HTML file and .zip file tutorial1, tutorial2
Purpose: To remove adapter sequences and low quality reads
Program: Trimmomatic manual
Est. Time: ?
Input Data: Illumina fastq files, including gzip (.gz) and bzip2 (.bz2) compressed files
Output Data: Paired (fwd) fastq, Paired (rev) fastq, Unpaired (fwd) fastq, Unpaired (rev) fastq
Purpose: To align sequences into contigs
Program: Megahit info, manual
Est. Time: ~6h (2 cores)
Input Data: Two paired-end fastq files (compressed .gz ok)
Output Data: Assembled contigs in fasta file
Parameters: Use “--kmin-1pass” to reduce memory usage or it will crash in UPPMAX
Purpose: To identify misassemblies, quantify metrics (no. uncalled bases, mismatches, etc.)
Program: Quast manual
Est. Time: ~45 min (2cores)
Input Data: Assembled contigs in fasta format (compressed file ok)
Output Data: ~10 report files (including report plots pdf, summary txt, k-mer contig and reads-specific info)
Purpose: To identify separate genomes
Program: Metabat wiki, tutorial
Est. Time: ~30 min (2 cores)
Input Data: Assembly file in fasta format
Output Data: Discovered bins saved in fasta format
Purpose: To assess quality of bins, as well as genome quality by comparing to taxonomy/ref genome data
Program: CheckM (binning) manual
Est. Time: ~2h (2 cores)
Input Data: Binned genes in fasta format (fna ending!)
Output Data: Variety: output report and downloadable QC files example
Purpose: "PROKKA automates the process of locating open reading frames (ORFs) and RNA regions on contigs, translating ORFs to protein sequences, searching for protein homologs and producing standard output files." source
Program: Prokka manual
Est. Time: ~1h (2 cores)
Input Data: Assembly data in fasta format
Output Data: Tab-delimited file of genome annotations (GFF); Genbank file (GBK)
Purpose: Mapping reads to reference genomes
Program: BWA manual, tutorial
Est. Time: ~6h (2 cores)
Input Data: Fasta format alignment file? Fastq format paired-end reads?
Output Data: SAM file (sequence alignment map) more info
Purpose: Reconstructing phylogenies of prokaryotes, "PhyloPhlAn can assign both genomes and metagenome-assembled genomes (MAGs) to species-level genome bins (SGBs)."
Program: PhyloPhlan reference/manual
Est. Time: ~6h (2 cores)
Input Data: Assembly fasta files
Output Data: Treefiles(?)