09 Short Read Assembly - NU-CPGME/quest_genomics_2025 GitHub Wiki
February 2025
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])
In this section we are going to go through a de novo genome assembly pipeline one step at a time to understand the process.
Before we start:
Make a directory with your NetID (or whatever you want) in our workshop directory:
cd /projects/b1042/OzerLab/workshop/playground
mkdir -p <netid>/assembly
cd <netid>/assembly
Now activate the assembly conda environment:
conda activate /projects/p30002/condaenvs/short_assembly_env
Read files for working on assemblies can be found here: /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads
perl /projects/p30002/Scripts/fastq_stats.pl -p /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS > GAS.read_stats.untrimmed.txt
-p
: Gives the path and the prefix for the reads (i.e. GAS_1.fastq.gz
and GAS_2.fastq.gz
)
For de_novo assemblies very deep read coverages can result in increased analysis time with either little to no improvement in assembly quality, and sometime worse quality as a result of compounding random errors. It is sometime beneficial to reduce read coverage to 60 - 100x.
For this example we'll downsample to 40x for faster assembly.
perl /projects/p30002/Scripts/downsample_paired.pl \
-p GAS_ds \
-g 1790000 \
-f 40 \
-o \
/projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_1.fastq.gz \
/projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/GAS_2.fastq.gz
-p
Output file prefix
-g
Estimated genome size, in bp
-f
Fold coverage to outpt
-o
Guarantees output, even if the input read set is less than the value given to -f
Sometimes what we think we're sequencing isn't what we actually get. This can be due to sample contamination or mix-up or misidentification. To get a quick assessment of the sequence species, we'll use Kraken.
export KRAKEN_DEFAULT_DB="/projects/p30002/Applications/kraken-1.0/minikraken_20171019_8GB"
kraken \
--preload --threads 1 --fastq-input --gzip-compressed \
--paired GAS_ds_1.fastq.gz GAS_ds_2.fastq.gz \
| kraken-report - > GAS.kraken_report.txt
This command loads the kraken database into memory, then compares the reads against the database to produce a report of the percentage of reads assigned to each level
perl /projects/p30002/Scripts/kraken-topspecies.pl GAS.kraken_report.txt GAS > GAS.kraken_topspecies.txt
This small script distills the kraken report down to the top species-level assignment and percentage of reads assigned to the species.
Before we assemble, we'll do some very light read trimming. This step removes any Illumina adapter sequences that may have made it into the reads due to read-through of short library fragments. These adapter sequences can sometimes get incorporated into your assembly and cause misassemblies. Better to take the time to remove them prior to assembly.
We are going to use fastp to perform the trimming step. Fastp can perform a number of functions including adapter removal as well remove low-quality sequences from the reads.
Commands
fastp \
--in1 GAS_ds_1.fastq.gz --in2 GAS_ds_2.fastq.gz \
--out1 GAS_trimmed_paired_1.fastq.gz --unpaired1 GAS_trimmed_unpaired_1.fastq.gz \
--out2 GAS_trimmed_paired_2.fastq.gz --unpaired2 GAS_trimmed_unpaired_2.fastq.gz \
-h GAS_fastp.html -j GAS_fastp.json \
-w 1
Settings Used
Setting | Descripton |
---|---|
--in1 / --in2 |
Input read files |
--out1 / --out2 |
Paired output read files |
--unpaired1 / --unpaired2 |
Singleton read files (only one of the paired reads passed filters) |
-h |
Filtering and trimming report, html format |
-j |
Filtering and trimming reoprt, json format |
-w |
Number of parallel threads to use |
See fastp manual for more detail on settings and other options.
Outputs
Files | Description |
---|---|
GAS_trimmed_paired_1.fastq.gz & _2.fastq.gz
|
Paired reads remaining after trimming |
GAS_trimmed_unpaired_1.fastq.gz & _2.fastq.gz
|
Unpaired reads remaining after trimming |
GAS_fastp.html |
Filtering and trimming report. Can view with Firefox or other web browser |
We only care about the paired read outputs.
perl /projects/p30002/Scripts/fastq_stats.pl -p GAS_trimmed_paired > GAS.read_stats.trimmed.txt
Now we'll generate a de novo whole genome assembly from our trimmed reads. For this we'll use the assembler SPAdes (Github site).
Warning
This step takes too many resources and is too slow to do in the login node. We'll submit a batch script for this instead.
nano spades.sh
Copy and paste the below code into the file you just opened on Quest using nano
spades.sh
#!/bin/bash
#SBATCH --job-name="spades"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 01:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o spades.out
#SBATCH -e spades.err
eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/short_assembly_env
spades.py \
-o GAS_spades \
-1 GAS_trimmed_paired_1.fastq.gz \
-2 GAS_trimmed_paired_2.fastq.gz \
--isolate \
-t 12 --cov-cutoff auto \
| tee GAS_spades_output.txt
Close and save the document, then submit the job request: sbatch spades.sh
Outputs
All of the output files can be found in the GAS_assembly
folder. There are a lot of files in there, so we're just going to pick a few of the most relevant to describe.
Files | Description |
---|---|
contigs.fasta |
Assembly contig sequences. |
scaffolds.fasta |
Scaffold sequnences. Scaffolds consist of contigs joined by N's at sites with paired reads support. |
spades.log |
Log file. Useful for troubleshooting poor or failed assemblies. |
assembly_graph.fastg |
Graph of assembly showing contig connections. Can be useful for assembly quality assessment. |
Here we'll use the NCBI foreign contammination screen fcsadaptor
to remove any non-pathogen or sequencing adapter contamination. Running the reads through fastp
above should have done most of this, but there's sometimes a bit remaining.
This program is in a singularity container, so we'll activate the singluarity module on Quest.
module load singularity
mkdir -p GAS_fcsadaptor
/projects/p30002/Applications/fcsadaptor/run_fcsadaptor.sh \
--fasta-input GAS_spades/contigs.fasta \
--output-dir GAS_fcsadaptor \
--prok \
--container-engine singularity \
--image /projects/p30002/Applications/fcsadaptor/fcs-adaptor.sif
Then we'll run a script that takes any identified contamination and removes it from the sequences.
perl /projects/p30002/Scripts/ncbi_fcsadaptor_contamination_filter.pl \
-f GAS_spades/contigs.fasta \
-c GAS_fcsadaptor/fcs_adaptor_report.txt \
> GAS_fcsadaptor/contigs_fixed.fasta
This step aligns the reads back to the assembly to identify any sequences with low coverage that are more likely to be contamination and removes any sequences smaller than 200 bp.
It also looks for any reads that appear to be phiX, the internal sequencing control used in Illumina runs. This is far less of a problem than it used to be with updates to Illumina instrument software, but it's easy enough to leave it in.
perl /projects/p30002/Scripts/assembly_qc_filter.pl \
-c GAS_fcsadaptor/contigs_fixed.fasta \
-1 GAS_trimmed_paired_1.fastq.gz \
-2 GAS_trimmed_paired_2.fastq.gz \
-m 5 -l 200 -t 1 \
-p /projects/p30002/Scripts/phiX.fasta \
-x 98 -o GAS > GAS.qc_filter_stats.txt
Here we'll use CheckM to estimate assembled genome completeness and contamination. Again this is a slow one so we'll submit it as a Slurm job.
nano checkm.sh
checkm.sh
#!/bin/bash
#SBATCH --job-name="checkm"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 00:30:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o checkm.out
#SBATCH -e checkm.err
eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/short_assembly_env
echo -e "GAS\tGAS.filtered_sequences.fasta" > GAS.checkm_source.txt
export CHECKM_DATA_PATH=/projects/p30002/checkm_database
checkm lineage_wf \
-t 12 \
--tab_table \
-f GAS.checkm_results.txt \
GAS.checkm_source.txt \
GAS_checkm
sbatch checkm.sh
Results will be summarized in the GAS.checkm_results.txt
file. The ones to pay attention to are the Completeness, Contamination, and Strain heterogeneity values.
From the CheckM wiki:
- completeness: estimated completeness of genome as determined from the presence/absence of marker genes and the expected collocalization of these genes.
- contamination: estimated contamination of genome as determined by the presence of multi-copy marker genes and the expected collocalization of these genes.
- strain heterogeneity: estimated strain heterogeneity as determined from the number of multi-copy marker pairs which exceed a specified amino acid identity threshold (default = 90%). High strain heterogeneity suggests the majority of reported contamination is from one or more closely related organisms (i.e. potentially the same species), while low strain heterogeneity suggests the majority of contamination is from more phylogenetically diverse sources.
To make everyone's life (and especially mine) easier, this workflow has been combined into a Perl script that generates a Slurm array file based on your inputs. All you need is all your read files in a directory and a file listing all the read prefixes.
Make a new folder and add the example sequence list to it.
mkdir assembly_batch
cd assembly_batch
# Copy the list of file prefixes
cp /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/batch/list.txt .
Now run the batch assembly script. We are selecting a 60X downsample (-d
). Don't forget to add your email address with the -e
flag. Make sure there are single quotes ('') around it.
perl /projects/p30002/batch_scripts/assembly_batch.pl \
-o . \
-r /projects/b1042/OzerLab/workshop/bacteria_assembly/workshop_reads/batch \
-s list.txt \
-d 60 \
-e '<your email address>'
I have provided a script to produce a summary of statistics from assemblies generated using the assembly_batch.pl
script. The script can be run anytime during the assembly process and will fill in as data is being generated.
The script must be called from within the output directory given to the -o
flag in assembly_batch.pl
as well as the file listing the prefixes of the assemblies:
perl /projects/p30002/batch_scripts/assembly_batch_stats.pl list.txt
Here are the output columns:
Column header | Description |
---|---|
id | Sequence prefix |
untrim_bases | Number of bases before trimming |
untrim_reads | Number of reads before trimming |
trim_bases | Number of bases after downsampling / trimmming |
trim_reads | Number of reads after downsampling / trimming |
contigs_num | Number of assembled contigs |
contigs_size | Total size of contigs, in bp |
contigs_n50 | N50 of assembled contigs |
contigs_gc | GC% of assembled contigs |
avg_cov | Average fold-coverage based on trimmmed bases |
cm_lineage | CheckM lineage assignment |
cm_complete | CheckM completeness % |
cm_contam | CheckM contamination % |
cm_heterog | CheckM heterogeneity % |
kraken_species | Kraken species assignment |
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.