Basic Genomics Workflow - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki
We use the "Minoche" filtering method from illumina-utils (https://github.com/merenlab/illumina-utils), followed by trimmomatic (http://www.usadellab.org/cms/index.php?page=trimmomatic) to make sure there are no illumina adaptor sequences on the raw reads (ICBR does NOT remove adaptors). To begin, create a config file for each of your samples.
There is a python script in the repository for our resource page. You can download it here. Place the file in the folder where all of your R1 and R2 reads are. Make sure that those are the only files in the folder. Then run the script.
python make_genome_assembly_files.py _your_email_here
It will generate config files for all your pairs, and the 3 batch files (1_qualityControl.txt, 2_assembly.txt, 3_quast.txt) to use.
This is an example config file. Change the "project_name" to your sample name and replace email and full path as appropriate for your data. When sequencing on the NextSeq, you will get 4 R1 reads and 4 R2 reads for each sample, simply list them all in the config file, separated by commas. The output of quality-filtered reads will be just a single R1 and a single R2 for the sample. Repeat for each metagenome library.
[general]
project_name = MM17-34
researcher_email = [email protected]
input_directory = /blue/juliemeyer/share/March2021_genome_assembly_party/
output_directory = /blue/juliemeyer/share/March2021_genome_assembly_party/
[files]
pair_1 = MM17-34_S10_L001_R1_001.fastq.gz
pair_2 = MM17-34_S10_L001_R2_001.fastq.gz
Sample script for minoche quality filtering and trimmomatic to remove illumina adaptors and nextera transposases
#!/bin/bash
#SBATCH --job-name=QC_%j
#SBATCH --output=QC_%j.log
#SBATCH --error=QC_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
#SBATCH --mem-per-cpu=100gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b
module load fastqc
fastqc *.fastq.gz
### create config files for each library
module load illumina-utils/2.11
iu-filter-quality-minoche config_iu_MM17-34.txt
#remove illumina adaptors
module load trimmomatic
trimmomatic PE -phred33 -summary trim_stats.txt MM17-34-QUALITY_PASSED_R1.fastq.gz MM17-34-QUALITY_PASSED_R2.fastq.gz MM17-34_trimmed_paired_R1.fastq.gz MM17-34_trimmed_unpaired_R1.fastq.gz MM17-34_trimmed_paired_R2.fastq.gz MM17-34_trimmed_unpaired_R2.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-PE.fa:2:30:10 ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
fastqc *trimmed_paired_R*.fastq.gz
#!/bin/bash
#SBATCH --job-name=assemble_%j
#SBATCH --output=assemble_%j.log
#SBATCH --error=assemble_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
#SBATCH --mem-per-cpu=100gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b
module load spades
spades.py --isolate -1 MM17-34_trimmed_paired_R1.fastq.gz -2 MM17-34_trimmed_paired_R2.fastq.gz -o MM17_34_assembly/
The default for this spades assembly is to keep all contigs, no matter how small. However, typically we do not have a use for contigs < 1,000 bp. I believe NCBI will make you remove them before they can be uploaded. You can use anvio to remove the small contigs. An example of this script is below. Anvio can be run on hipergator or locally.
module load anvio/7.1
anvi-script-reformat-fasta Pseudoalteromonas_BZA4_wshort.fa -o Pseudoalteromonas_BZA4.fa -l 1000
The two primary ways I like to assess genome quality are with 1) quast and 2) MiGA. At the end of your assembly script, you will run quast, which will produce a pdf summary of the assembly statistics. This will let you know how well the assembly performed. A good assembly will result in a small number of very long contigs (contiguous sequences). A poor assembly will result in a large number of small contigs. Quast will allow you to assess this, providing total assembled length, length of longest contig, number of contigs, etc. To assess how well you have captured the genome content (based on single-copy genes) and to show taxonomic identity based on genome content, you will use MiGA Online. For MiGA, create a free account and upload your fasta file of the assembly under the "NCBI Prok" box. Wait awhile for it to run and check on your results later.
After assembly, run quast to generate a report that gives you information about about the quality of your assembly.
module load quast
quast.py -o quast_MM17-34 MM17_34_assembly/contigs.fasta
- original sequencing reads
- final quality-filtered reads used for assembly
- fastqc of original raw sequencing reads
- fastqc of final quality-filtered reads used for assembly
- quast report (pdf)
- fasta of assembled genome (rename it with strain name, for example, "MM17-34_contigs.fasta" instead of "contigs.fasta")
- MiGA Online results: Genus (or lowest taxonomy with high certainty), % Completeness, % Contamination, Quality rating (High, Excellent)
- annotation of genes - prokka (hipergator) - will give you annotated .fna, annotated .faa, total genome length, total coding sequences (CDS), etc and/or submit to IMG/MER (see below) that will give similar products (but it takes time to go through their system).
module load prokka
prokka --outdir samplename_prokka/ --prefix samplename samplename.fasta
- pangenome analysis (only for genomes in the same genus) - roary (input = directory of .gff files from prokka)
module load roary
roary -e --mafft -i 90 -r *.gff
module load fasttree
FastTree -nt -gtr core_gene_alignment.aln > core_gene_tree.newick
visualize the output of roary with phandango
-
Pangenomes: calculate Average Nucleotide Identity (ANI) pairwise or in batches
-
Pangenomes: calculate digital DNA-DNA Hybridization (dDDH)
-
Find antibiotics & secondary metabolites: antiSMASH analysis - web-based or on hipergator
-
Find carbohydrate active enzymes (CAZyme annotation) - dbCA metaserver
-
Find Virulence Factors for genera related to human pathogens (Rickettsia, Vibrio, Clostridia): VFDB (Virulence Factor Database)
-
submit original sequencing reads to NCBI's Sequence Read Archive
-
submit genome to GenBank
-
optional: submit genome to IMG/MER for comparative tools, looking at gene neighborhoods, multiple annotations (COG, KEGG, pfam, enzyme #s, etc), map gene presence on KEGG pathways.