Incorporating taxonomic information into sylph with sylph‐tax - bluenote-1577/sylph GitHub Wiki
[!NOTE] This manual uses sylph-tax, which replaces the old sylph-utils program for taxonomy integration. The old manual for
sylph-utils
is available here.
Sylph's TSV outputs contain no taxonomic information. However, the sylph-tax program can convert sylph's output into a taxonomic profile (with taxonomic annotations).
How to generate taxonomic profiles using sylph-tax
See the sylph-tax repository for more information. For a quick start:
conda install -c bioconda sylph-tax
# download taxonomies
sylph-tax download --download-to /any/location
# profiling with GTDB-r220
sylph profile gtdb-r220-c200-dbv1.syldb ... -o sylph_results/out.tsv
# incorporate GTDB-r220 taxonomy into sylph's results
sylph-tax taxprof sylph_results/*.tsv -t GTDB_r220
ls *.sylphmpa
.sylphmpa
taxonomic profiling output format
*.sylphmpa
files look like this:
#SampleID /home/jshaw/projects/temp/amr/short_reads/SRR14739086_1.fastq.gz Taxonomies_used:['GTDB_r220']
clade_name relative_abundance sequence_abundance ANI (if strain-level) Coverage (if strain-level)
d__Bacteria 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria 100.00010000000003 100.00019999999996 NA NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales 35.6384 36.0603 NA NA
....
[!TIP] This is a valid TSV file, but rows prefixed with
#
are comments. You can read.sylphmpa
files with pandas in python likepd.read_csv('output.sylphmpa',sep='\t', comment='#')
.
There are five important columns:
clade_name
: A string liked__Bacteria|p__Actinomycetota|c__Acidimicrobiia|o__Acidimicrobiales|f__Ilumatobacteraceae
that describes the clade.t__STRAIN
represents the exact genome identifier.relative_abundance
: the taxonomic relative abundance of the cladesequence_abundance
: the sequence abundance of the clade, i.e. the % of reads assignedANI
: this isNA
except for at the strain level (t__strain
). Otherwise it is sylph's ANI estimate.Coverage
: This is theEff_cov
orTrue_cov
column of sylph's output.
[!TIP] Viral-host information is available for IMG/VR 4.1. The
-a
option adds a new column in the.sylphmpa
files associating viral genomes to their hosts. For example:r__Duplodnaviria|k__Heunggongvirae|p__Uroviricota|c__Caudoviricetes|||||t__IMGVR_UViG_2503982007_000001 ... d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis
where IMGVR_UVIG_2503982007's host is Staphylococcus epidermidis.
Creating custom taxonomies
If you're working with custom sylph databases, you can easily create your own taxonomy metadata file. You can look at our pre-built taxonomy files (https://zenodo.org/records/14320496) for examples.
A taxonomic metadata file is simply a two-column TSV file:
- Column 1: the name of your genome's FASTA file:
my_mag.fa
- Column 2: a semicolon-delimited taxonomy string.
d__Archaea;p__Methanobacteriota_B;c__Thermococci;o__Thermococcales;f__Thermococcaceae;g__Thermococcus_A;s__Thermococcus_A alcaliphilus
Note: do not add the t__STRAIN
line.
Custom taxonomy example usage case
You obtained two new MAGs: genome1.fa
and genome2.fa
and you ran GTDB-tk to get their taxonomic annotation. You want to to profile against the new MAGs and the GTDB database.
- Create a file called
taxonomy.tsv
as follows:
genome1.fa d__Archaea;(...);s__My new species name`
genome2.fa d__Bacteria;(...);g__My genus name;s__My species name2`
- Use
taxonomy.tsv
as an argument tosylph-tax taxprof
.
## profile against gtdb_r220 and your new MAGs
sylph profile gtdb_r220.syldb my_custom_mags.syldb ... -o gtdb+mags_output.tsv
## use your new taxonomy.tsv file and GTDB_r220
sylph-tax taxprof gtdb+mags_output.tsv -t GTDB_r220 taxonomy.tsv
[!WARNING] For Genbank/RefSeq genomes, filenames have to be dealt with carefully.
If
_genomic
or_ASM
is in your genome file name, use the part before_genomic
or_ASM
.So for
GCF_002863645.1_ASM286364v1_genomic.fna.gz
, useGCF_002863645.1
in column 1.