09 Short Read Assembly - NU-CPGME/quest_genomics_2025 GitHub Wiki

February 2025

Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])


In this section we are going to go through a de novo genome assembly pipeline one step at a time to understand the process.

Before we start:

Make a directory with your NetID (or whatever you want) in our workshop directory:

cd /projects/b1042/OzerLab/workshop/playground

mkdir -p <netid>/assembly

cd <netid>/assembly

Now activate the assembly conda environment:

conda activate /projects/p30002/condaenvs/short_assembly_env

Read files for working on assemblies can be found here: /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads

Section 1 - Checking read counts

perl /projects/p30002/Scripts/fastq_stats.pl -p /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS > GAS.read_stats.untrimmed.txt

-p: Gives the path and the prefix for the reads (i.e. GAS_1.fastq.gz and GAS_2.fastq.gz)

Section 2 - Downsampling

For de_novo assemblies very deep read coverages can result in increased analysis time with either little to no improvement in assembly quality, and sometime worse quality as a result of compounding random errors. It is sometime beneficial to reduce read coverage to 60 - 100x.

For this example we'll downsample to 40x for faster assembly.

perl /projects/p30002/Scripts/downsample_paired.pl \
  -p GAS_ds \
  -g 1790000 \
  -f 40 \
  -o \
  /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_1.fastq.gz \
  /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_2.fastq.gz 

-p Output file prefix

-g Estimated genome size, in bp

-f Fold coverage to outpt

-o Guarantees output, even if the input read set is less than the value given to -f

Section 3 - Species estimation using Kraken

Sometimes what we think we're sequencing isn't what we actually get. This can be due to sample contamination or mix-up or misidentification. To get a quick assessment of the sequence species, we'll use Kraken.

export KRAKEN_DEFAULT_DB="/projects/p30002/Applications/kraken-1.0/minikraken_20171019_8GB"
kraken \
    --preload --threads 1 --fastq-input --gzip-compressed \
    --paired GAS_ds_1.fastq.gz GAS_ds_2.fastq.gz \
    | kraken-report - > GAS.kraken_report.txt

This command loads the kraken database into memory, then compares the reads against the database to produce a report of the percentage of reads assigned to each level

perl /projects/p30002/Scripts/kraken-topspecies.pl GAS.kraken_report.txt GAS > GAS.kraken_topspecies.txt

This small script distills the kraken report down to the top species-level assignment and percentage of reads assigned to the species.

Section 4 - Read trimming

Before we assemble, we'll do some very light read trimming. This step removes any Illumina adapter sequences that may have made it into the reads due to read-through of short library fragments. These adapter sequences can sometimes get incorporated into your assembly and cause misassemblies. Better to take the time to remove them prior to assembly.

We are going to use fastp to perform the trimming step. Fastp can perform a number of functions including adapter removal as well remove low-quality sequences from the reads.

Commands

fastp \
    --in1 GAS_ds_1.fastq.gz --in2 GAS_ds_2.fastq.gz \
    --out1 GAS_trimmed_paired_1.fastq.gz --unpaired1 GAS_trimmed_unpaired_1.fastq.gz \
    --out2 GAS_trimmed_paired_2.fastq.gz --unpaired2 GAS_trimmed_unpaired_2.fastq.gz \
    -h GAS_fastp.html -j GAS_fastp.json \
    -w 1

Settings Used

Setting Descripton
--in1 / --in2 Input read files
--out1 / --out2 Paired output read files
--unpaired1 / --unpaired2 Singleton read files (only one of the paired reads passed filters)
-h Filtering and trimming report, html format
-j Filtering and trimming reoprt, json format
-w Number of parallel threads to use

See fastp manual for more detail on settings and other options.

Outputs

Files Description
GAS_trimmed_paired_1.fastq.gz & _2.fastq.gz Paired reads remaining after trimming
GAS_trimmed_unpaired_1.fastq.gz & _2.fastq.gz Unpaired reads remaining after trimming
GAS_fastp.html Filtering and trimming report. Can view with Firefox or other web browser

Section 5 - Checking post-processing read counts

We only care about the paired read outputs.

perl /projects/p30002/Scripts/fastq_stats.pl -p GAS_trimmed_paired > GAS.read_stats.trimmed.txt

Section 6 - Assemble with SPAdes

Now we'll generate a de novo whole genome assembly from our trimmed reads. For this we'll use the assembler SPAdes (Github site).

Warning

This step takes too many resources and is too slow to do in the login node. We'll submit a batch script for this instead.

nano spades.sh

Copy and paste the below code into the file you just opened on Quest using nano

spades.sh

#!/bin/bash
#SBATCH --job-name="spades"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 01:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o spades.out
#SBATCH -e spades.err

eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/short_assembly_env

spades.py \
    -o GAS_spades \
    -1 GAS_trimmed_paired_1.fastq.gz \
    -2 GAS_trimmed_paired_2.fastq.gz \
    --isolate \
    -t 12 --cov-cutoff auto \
    | tee GAS_spades_output.txt

Close and save the document, then submit the job request: sbatch spades.sh

Outputs

All of the output files can be found in the GAS_assembly folder. There are a lot of files in there, so we're just going to pick a few of the most relevant to describe.

Files Description
contigs.fasta Assembly contig sequences.
scaffolds.fasta Scaffold sequnences. Scaffolds consist of contigs joined by N's at sites with paired reads support.
spades.log Log file. Useful for troubleshooting poor or failed assemblies.
assembly_graph.fastg Graph of assembly showing contig connections. Can be useful for assembly quality assessment.

Section 6 - Remove sequence contamination

Here we'll use the NCBI foreign contammination screen fcsadaptor to remove any non-pathogen or sequencing adapter contamination. Running the reads through fastp above should have done most of this, but there's sometimes a bit remaining.

This program is in a singularity container, so we'll activate the singluarity module on Quest.

module load singularity

mkdir -p GAS_fcsadaptor

/projects/p30002/Applications/fcsadaptor/run_fcsadaptor.sh \
    --fasta-input GAS_spades/contigs.fasta \
    --output-dir GAS_fcsadaptor \
    --prok \
    --container-engine singularity \
    --image /projects/p30002/Applications/fcsadaptor/fcs-adaptor.sif

Then we'll run a script that takes any identified contamination and removes it from the sequences.

perl /projects/p30002/Scripts/ncbi_fcsadaptor_contamination_filter.pl \
    -f GAS_spades/contigs.fasta \
    -c GAS_fcsadaptor/fcs_adaptor_report.txt \
    > GAS_fcsadaptor/contigs_fixed.fasta

Section 7 - Filter the final assembly

This step aligns the reads back to the assembly to identify any sequences with low coverage that are more likely to be contamination and removes any sequences smaller than 200 bp.

It also looks for any reads that appear to be phiX, the internal sequencing control used in Illumina runs. This is far less of a problem than it used to be with updates to Illumina instrument software, but it's easy enough to leave it in.

perl /projects/p30002/Scripts/assembly_qc_filter.pl \
    -c GAS_fcsadaptor/contigs_fixed.fasta \
    -1 GAS_trimmed_paired_1.fastq.gz \
    -2 GAS_trimmed_paired_2.fastq.gz \
    -m 5 -l 200 -t 1 \
    -p /projects/p30002/Scripts/phiX.fasta \
    -x 98 -o GAS > GAS.qc_filter_stats.txt

Section 8 - Assembly quality assessment

Here we'll use CheckM to estimate assembled genome completeness and contamination. Again this is a slow one so we'll submit it as a Slurm job.

nano checkm.sh

checkm.sh

#!/bin/bash
#SBATCH --job-name="checkm"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 00:30:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o checkm.out
#SBATCH -e checkm.err

eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/short_assembly_env

echo -e "GAS\tGAS.filtered_sequences.fasta" > GAS.checkm_source.txt
export CHECKM_DATA_PATH=/projects/p30002/checkm_database

checkm lineage_wf \
  -t 12 \
  --tab_table \
  -f GAS.checkm_results.txt \
  GAS.checkm_source.txt \
  GAS_checkm
sbatch checkm.sh

Results will be summarized in the GAS.checkm_results.txt file. The ones to pay attention to are the Completeness, Contamination, and Strain heterogeneity values.

From the CheckM wiki:

  • completeness: estimated completeness of genome as determined from the presence/absence of marker genes and the expected collocalization of these genes.
  • contamination: estimated contamination of genome as determined by the presence of multi-copy marker genes and the expected collocalization of these genes.
  • strain heterogeneity: estimated strain heterogeneity as determined from the number of multi-copy marker pairs which exceed a specified amino acid identity threshold (default = 90%). High strain heterogeneity suggests the majority of reported contamination is from one or more closely related organisms (i.e. potentially the same species), while low strain heterogeneity suggests the majority of contamination is from more phylogenetically diverse sources.

Batch script

To make everyone's life (and especially mine) easier, this workflow has been combined into a Perl script that generates a Slurm array file based on your inputs. All you need is all your read files in a directory and a file listing all the read prefixes.

Make a new folder and add the example sequence list to it.

mkdir assembly_batch

cd assembly_batch

# Copy the list of file prefixes
cp /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/batch/list.txt .

Now run the batch assembly script. We are selecting a 60X downsample (-d). Don't forget to add your email address with the -e flag. Make sure there are single quotes ('') around it.

perl /projects/p30002/batch_scripts/assembly_batch.pl \
  -o . \
  -r /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/batch \
  -s list.txt \
  -d 60 \
  -e '<your email address>'

Getting assembly statistics

I have provided a script to produce a summary of statistics from assemblies generated using the assembly_batch.pl script. The script can be run anytime during the assembly process and will fill in as data is being generated.

The script must be called from within the output directory given to the -o flag in assembly_batch.pl as well as the file listing the prefixes of the assemblies:

perl /projects/p30002/batch_scripts/assembly_batch_stats.pl list.txt

Here are the output columns:

Column header Description
id Sequence prefix
untrim_bases Number of bases before trimming
untrim_reads Number of reads before trimming
trim_bases Number of bases after downsampling / trimmming
trim_reads Number of reads after downsampling / trimming
contigs_num Number of assembled contigs
contigs_size Total size of contigs, in bp
contigs_n50 N50 of assembled contigs
contigs_gc GC% of assembled contigs
avg_cov Average fold-coverage based on trimmmed bases
cm_lineage CheckM lineage assignment
cm_complete CheckM completeness %
cm_contam CheckM contamination %
cm_heterog CheckM heterogeneity %
kraken_species Kraken species assignment


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

⚠️ **GitHub.com Fallback** ⚠️