Steps for Genome Assembly

Quality Control
Assembly
Assess Genome Quality
Generate Scripts and Files

Step 1 is always Quality Control

We use the "Minoche" filtering method from illumina-utils (https://github.com/merenlab/illumina-utils), followed by trimmomatic (http://www.usadellab.org/cms/index.php?page=trimmomatic) to make sure there are no illumina adaptor sequences on the raw reads (ICBR does NOT remove adaptors). To begin, create a config file for each of your samples.

Generate config files and scripts with python

There is a python script in the repository for our resource page. You can download it here. Place the file in the folder where all of your R1 and R2 reads are. Make sure that those are the only files in the folder. Then run the script.

python make_genome_assembly_files.py _your_email_here

It will generate config files for all your pairs, and the 3 batch files (1_qualityControl.txt, 2_assembly.txt, 3_quast.txt) to use.

This is an example config file. Change the "project_name" to your sample name and replace email and full path as appropriate for your data. When sequencing on the NextSeq, you will get 4 R1 reads and 4 R2 reads for each sample, simply list them all in the config file, separated by commas. The output of quality-filtered reads will be just a single R1 and a single R2 for the sample. Repeat for each metagenome library.

[general]
project_name = MM17-34
researcher_email = [email protected]
input_directory = /blue/juliemeyer/share/March2021_genome_assembly_party/
output_directory = /blue/juliemeyer/share/March2021_genome_assembly_party/

[files]
pair_1 = MM17-34_S10_L001_R1_001.fastq.gz
pair_2 = MM17-34_S10_L001_R2_001.fastq.gz

Sample script for minoche quality filtering and trimmomatic to remove illumina adaptors and nextera transposases

#!/bin/bash
#SBATCH --job-name=QC_%j
#SBATCH --output=QC_%j.log
#SBATCH --error=QC_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
#SBATCH --mem-per-cpu=100gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b

module load fastqc
fastqc *.fastq.gz

### create config files for each library

module load illumina-utils/2.11
iu-filter-quality-minoche config_iu_MM17-34.txt

#remove illumina adaptors
module load trimmomatic
trimmomatic PE -phred33 -summary trim_stats.txt MM17-34-QUALITY_PASSED_R1.fastq.gz MM17-34-QUALITY_PASSED_R2.fastq.gz MM17-34_trimmed_paired_R1.fastq.gz MM17-34_trimmed_unpaired_R1.fastq.gz MM17-34_trimmed_paired_R2.fastq.gz MM17-34_trimmed_unpaired_R2.fastq.gz ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/TruSeq3-PE.fa:2:30:10 ILLUMINACLIP:${HPC_TRIMMOMATIC_ADAPTER}/NexteraPE-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

fastqc *trimmed_paired_R*.fastq.gz

Step 2 Assembly

#!/bin/bash
#SBATCH --job-name=assemble_%j
#SBATCH --output=assemble_%j.log
#SBATCH --error=assemble_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
#SBATCH --mem-per-cpu=100gb
#SBATCH --time=4-00:00:00
#SBATCH --account=juliemeyer
#SBATCH --qos=juliemeyer-b

module load spades
spades.py --isolate -1 MM17-34_trimmed_paired_R1.fastq.gz -2 MM17-34_trimmed_paired_R2.fastq.gz -o MM17_34_assembly/

The default for this spades assembly is to keep all contigs, no matter how small. However, typically we do not have a use for contigs < 1,000 bp. I believe NCBI will make you remove them before they can be uploaded. You can use anvio to remove the small contigs. An example of this script is below. Anvio can be run on hipergator or locally.

module load anvio/7.1
anvi-script-reformat-fasta Pseudoalteromonas_BZA4_wshort.fa -o Pseudoalteromonas_BZA4.fa -l 1000

Step 3 Assess Genome Quality

The two primary ways I like to assess genome quality are with 1) quast and 2) MiGA. At the end of your assembly script, you will run quast, which will produce a pdf summary of the assembly statistics. This will let you know how well the assembly performed. A good assembly will result in a small number of very long contigs (contiguous sequences). A poor assembly will result in a large number of small contigs. Quast will allow you to assess this, providing total assembled length, length of longest contig, number of contigs, etc. To assess how well you have captured the genome content (based on single-copy genes) and to show taxonomic identity based on genome content, you will use MiGA Online. For MiGA, create a free account and upload your fasta file of the assembly under the "NCBI Prok" box. Wait awhile for it to run and check on your results later.

After assembly, run quast to generate a report that gives you information about about the quality of your assembly.

module load quast
quast.py -o quast_MM17-34  MM17_34_assembly/contigs.fasta

Files to keep for publication and posterity:

original sequencing reads
final quality-filtered reads used for assembly
fastqc of original raw sequencing reads
fastqc of final quality-filtered reads used for assembly
quast report (pdf)
fasta of assembled genome (rename it with strain name, for example, "MM17-34_contigs.fasta" instead of "contigs.fasta")
MiGA Online results: Genus (or lowest taxonomy with high certainty), % Completeness, % Contamination, Quality rating (High, Excellent)

Next Steps with an Assembled Genome

annotation of genes - prokka (hipergator) - will give you annotated .fna, annotated .faa, total genome length, total coding sequences (CDS), etc and/or submit to IMG/MER (see below) that will give similar products (but it takes time to go through their system).

module load prokka
prokka --outdir samplename_prokka/ --prefix samplename samplename.fasta

pangenome analysis (only for genomes in the same genus) - roary (input = directory of .gff files from prokka)

module load roary
roary -e --mafft -i 90 -r *.gff

module load fasttree
FastTree -nt -gtr core_gene_alignment.aln > core_gene_tree.newick

visualize the output of roary with phandango

Pangenomes: calculate Average Nucleotide Identity (ANI) pairwise or in batches
Pangenomes: calculate digital DNA-DNA Hybridization (dDDH)
Find antibiotics & secondary metabolites: antiSMASH analysis - web-based or on hipergator
Find carbohydrate active enzymes (CAZyme annotation) - dbCA metaserver
Find Virulence Factors for genera related to human pathogens (Rickettsia, Vibrio, Clostridia): VFDB (Virulence Factor Database)
anvi'o pangenome tutorial for visualization
submit original sequencing reads to NCBI's Sequence Read Archive
submit genome to GenBank
optional: submit genome to IMG/MER for comparative tools, looking at gene neighborhoods, multiple annotations (COG, KEGG, pfam, enzyme #s, etc), map gene presence on KEGG pathways.

Basic Genomics Workflow - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki

Steps for Genome Assembly

Step 1 is always Quality Control

Generate config files and scripts with python

Sample script for minoche quality filtering and trimmomatic to remove illumina adaptors and nextera transposases

Step 2 Assembly

Step 3 Assess Genome Quality

Files to keep for publication and posterity:

Next Steps with an Assembled Genome

⚠️ GitHub.com Fallback ⚠️

Basic Genomics Workflow - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki

Steps for Genome Assembly

Step 1 is always Quality Control

Generate config files and scripts with python

Sample script for minoche quality filtering and trimmomatic to remove illumina adaptors and nextera transposases

Step 2 Assembly

Step 3 Assess Genome Quality

Files to keep for publication and posterity:

Next Steps with an Assembled Genome

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️