Preparing a custom database - seqan/slimm GitHub Wiki

You might want to download a custom set of reference genomes and use that for taxonomic profiling using SLIMM. For that, you need a corresponding SLIMM database file that can obtained via the slimm_build program.

CASE 1: you have your own set of reference genomes as a FASTA file.

Let's assume you have a multi-fasta file custom_refs.fna as a set of reference genomes.

  1. Download the nodes.dmp and names.dmp taxonomy files from NCBI
    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
    tar -xzvf taxdump.tar.gz
  1. Download the accession2taxaid files from NCBI
    wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/{dead_nucl,nucl_wgs,nucl_gb}.accession2taxid.gz
    gunzip {dead_nucl,nucl_wgs,nucl_gb}.accession2taxid.gz

  1. Use slimm_build to build your SLIMM database
    ./bin/slimm_build -v -b 10000000 -nm taxdump/names.dmp -nd taxdump/nodes.dmp -o slimm_db_custom.sldb custom_refs.fna *.accession2taxid.gz

CASE 2: you just have a SAM/BAM file and you don't know the references genomes

You can create a dummy representative FASTA file for the reference genomes used to produce your SAM/BAM file at hand. For example, if you have SRR_0921301.bam file you may use the command below to get a toy reference fasta file.

    samtools view -H SRR_0921301.bam|grep 'SN:'|awk -F":" '{print ">"$2}' ORS="\nACGT\n" > SRR_0921301_references.fna

Afterwards, you can follow the above steps to get your SLIMM database from SRR_0921301_references.fna.