Preparing a custom database (OLD) - seqan/slimm GitHub Wiki
(THIS PAGE IS UNDER CONSTRUCTION)
One might want to download a custom set of reference genomes and use that for taxonomic profiling using SLIMM. This can be achieved through the preprocessing sub-module of SLIMM. Follow the following steps with examples to get your preferred reference set with a corresponding SLIMM database. Note that this can only prepare a multi-FASTA file containing reference genomes of choice. Indexing the genomic file might be required based on the read-mapper to be used later.
- Clone the SLIMM git repository:
git clone https://github.com/seqan/slimm.git slimm-src
- Goto the preprocessing subdirectory
cd slimm-src/preprocessing
- run
python download_refs.py --help
to see different options for downloading your preferred set of references.
Bellow are some examples on how to use download_refs.py
:
- e.g. 1 To download and prepare an up-to-date database of Archaea and Bacteria, a single genome per species (
-s
option)
python download_refs.py -s -wd /Users/dadi/workspace/microbial_references/ -g 'AB'
python merge_files.py /Users/dadi/workspace/microbial_references/
- e.g. 2 To include the (e.g.) human genome on top of Archaea and Bacterial genomes:
python download_refs.py -s -wd /Users/dadi/workspace/microbial_references/ -g 'AB' -t "9606"
python merge_files.py /Users/dadi/workspace/microbial_references/
- e.g. 3 To download and prepare a database for a specific set of taxonomic ids (170187,1660,176280,272943,869816,210007):
python download_refs.py -s -wd /Users/dadi/workspace/microbial_references/ -g '' -t 170187,1660,176280,272943,869816,210007
python merge_files.py /Users/dadi/workspace/microbial_references/
After executing the python scripts, you will find multiple files/folders under your target working directory. The most important ones are the following two.
[AB|AB_CUSTOM|CUSTOM]_refseq_11082017_combined.fna
-> a multi fasta file containing all the references downloaded according to the queryslimmDB_11082017
-> a corresponding SLIMM database containing a reduced mapping files (names.dmp, nodes.dmp)
The infix 11082017, in both cases, refers to the date of download.
The next step is to index the file AB_CUSTOM_refseq_11082017_combined.fna
according to the read-mapper of choice.