Preparing a custom database (OLD) - seqan/slimm GitHub Wiki

(THIS PAGE IS UNDER CONSTRUCTION)

One might want to download a custom set of reference genomes and use that for taxonomic profiling using SLIMM. This can be achieved through the preprocessing sub-module of SLIMM. Follow the following steps with examples to get your preferred reference set with a corresponding SLIMM database. Note that this can only prepare a multi-FASTA file containing reference genomes of choice. Indexing the genomic file might be required based on the read-mapper to be used later.

Clone the SLIMM git repository:

    git clone https://github.com/seqan/slimm.git slimm-src

Goto the preprocessing subdirectory

    cd slimm-src/preprocessing

run python download_refs.py --help to see different options for downloading your preferred set of references.

Bellow are some examples on how to use download_refs.py:

e.g. 1 To download and prepare an up-to-date database of Archaea and Bacteria, a single genome per species (-s option)

python download_refs.py -s -wd /Users/dadi/workspace/microbial_references/ -g 'AB'
python merge_files.py /Users/dadi/workspace/microbial_references/

e.g. 2 To include the (e.g.) human genome on top of Archaea and Bacterial genomes:

python download_refs.py -s -wd /Users/dadi/workspace/microbial_references/ -g 'AB' -t "9606"
python merge_files.py /Users/dadi/workspace/microbial_references/

e.g. 3 To download and prepare a database for a specific set of taxonomic ids (170187,1660,176280,272943,869816,210007):

python download_refs.py -s -wd /Users/dadi/workspace/microbial_references/ -g '' -t 170187,1660,176280,272943,869816,210007
python merge_files.py /Users/dadi/workspace/microbial_references/

After executing the python scripts, you will find multiple files/folders under your target working directory. The most important ones are the following two.

[AB|AB_CUSTOM|CUSTOM]_refseq_11082017_combined.fna -> a multi fasta file containing all the references downloaded according to the query
slimmDB_11082017 -> a corresponding SLIMM database containing a reduced mapping files (names.dmp, nodes.dmp)

The infix 11082017, in both cases, refers to the date of download.

The next step is to index the file AB_CUSTOM_refseq_11082017_combined.fna according to the read-mapper of choice.