Incorporating taxonomic information into sylph with sylph‐tax - bluenote-1577/sylph GitHub Wiki

[!NOTE] This manual uses sylph-tax, which replaces the old sylph-utils program for taxonomy integration. The old manual for sylph-utils is available here.

Sylph's TSV outputs contain no taxonomic information. However, the sylph-tax program can convert sylph's output into a taxonomic profile (with taxonomic annotations).

How to generate taxonomic profiles using sylph-tax

See the sylph-tax repository for more information. For a quick start:

conda install -c bioconda sylph-tax

# download taxonomies
sylph-tax download --download-to /any/location

# profiling with GTDB-r220
sylph profile gtdb-r220-c200-dbv1.syldb ... -o sylph_results/out.tsv

# incorporate GTDB-r220 taxonomy into sylph's results
sylph-tax taxprof sylph_results/*.tsv -t GTDB_r220 

ls *.sylphmpa

`.sylphmpa` taxonomic profiling output format

*.sylphmpa files look like this:

#SampleID       /home/jshaw/projects/temp/amr/short_reads/SRR14739086_1.fastq.gz        Taxonomies_used:['GTDB_r220']
clade_name      relative_abundance      sequence_abundance      ANI (if strain-level)    Coverage (if strain-level)
d__Bacteria     100.00010000000003      100.00019999999996      NA      NA
d__Bacteria|p__Pseudomonadota   100.00010000000003      100.00019999999996      NA      NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria    100.00010000000003      100.00019999999996      NA      NA
d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales        35.6384 36.0603 NA      NA
....

[!TIP] This is a valid TSV file, but rows prefixed with # are comments. You can read .sylphmpa files with pandas in python like pd.read_csv('output.sylphmpa',sep='\t', comment='#').

There are five important columns:

clade_name: A string like d__Bacteria|p__Actinomycetota|c__Acidimicrobiia|o__Acidimicrobiales|f__Ilumatobacteraceae that describes the clade. t__STRAIN represents the exact genome identifier.
relative_abundance: the taxonomic relative abundance of the clade
sequence_abundance: the sequence abundance of the clade, i.e. the % of reads assigned
ANI: this is NA except for at the strain level (t__strain). Otherwise it is sylph's ANI estimate.
Coverage: This is the Eff_cov or True_cov column of sylph's output.

[!TIP] Viral-host information is available for IMG/VR 4.1. The -a option adds a new column in the .sylphmpa files associating viral genomes to their hosts. For example: r__Duplodnaviria|k__Heunggongvirae|p__Uroviricota|c__Caudoviricetes|||||t__IMGVR_UViG_2503982007_000001 ... d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis

where IMGVR_UVIG_2503982007's host is Staphylococcus epidermidis.

Creating custom taxonomies

If you're working with custom sylph databases, you can easily create your own taxonomy metadata file. You can look at our pre-built taxonomy files (https://zenodo.org/records/14320496) for examples.

A taxonomic metadata file is simply a two-column TSV file:

Column 1: the name of your genome's FASTA file:
- my_mag.fa
Column 2: a semicolon-delimited taxonomy string.
- d__Archaea;p__Methanobacteriota_B;c__Thermococci;o__Thermococcales;f__Thermococcaceae;g__Thermococcus_A;s__Thermococcus_A alcaliphilus

Note: do not add the t__STRAIN line.

Custom taxonomy example usage case

You obtained two new MAGs: genome1.fa and genome2.fa and you ran GTDB-tk to get their taxonomic annotation. You want to to profile against the new MAGs and the GTDB database.

Create a file called taxonomy.tsv as follows:

genome1.fa d__Archaea;(...);s__My new species name`
genome2.fa d__Bacteria;(...);g__My genus name;s__My species name2`

Use taxonomy.tsv as an argument to sylph-tax taxprof.

## profile against gtdb_r220 and your new MAGs
sylph profile gtdb_r220.syldb my_custom_mags.syldb ... -o gtdb+mags_output.tsv

## use your new taxonomy.tsv file and GTDB_r220
sylph-tax taxprof gtdb+mags_output.tsv -t GTDB_r220 taxonomy.tsv

[!WARNING] For Genbank/RefSeq genomes, filenames have to be dealt with carefully.

If _genomic or _ASM is in your genome file name, use the part before _genomic or _ASM.

So for GCF_002863645.1_ASM286364v1_genomic.fna.gz, use GCF_002863645.1 in column 1.

Creating taxonomy metadata from RefSeq?

See this discussion thread.