Contamination in skimming data (CONSULT) - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

3. Contamination Filtering

There are two ways to remove contamination:

inclusion filters when you know what you are looking for and you have a reference genome.
- Here, you can use any tool, like BLAST, bowtie, etc.
exclusion filters when you do not know what to exactly look for but you know what you do not want (e.g., bacteria)
- We typically use one of two tools:
  - CONSULT
  - Kraken-II
- Both are computationally intensive.

Exclusion filters

We have bacterial/archaeal libraries available for both CONSULT and Kraken.

GTDB is the most comprehensive library.
- Consult DB is here
- Kraken-II DB is here.
All links to all of our reference libraries are available on our raw data github for CONSULT.

Query

To query sequence reads against reference database we ran

mkdir consult
cd consult/

# Let's use as query a bunch of Drosophila genome skims (already on cluster) 
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/Drosophila/ .

# I have already copied the reference dataset; let's link to it
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb/ .

consult_search -i all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb -c 1 -t 2 -q Drosophila/ 2>&1 |tee consult.log &

This runs for 5-10 minutes. While it runs, we can monitor it a bit:

top -u smirarab
tail -f consult.log
watch -n 10 wc -l ucseq_* Drosophila/*

After it finishes, now inspect the results.

less consult.log

wc -l ucseq_* Drosophila/*

Here are the arguments to CONSULT:

-i - the name of the reference database
-c - the lowest number of k-mers required to mark sequencing read as classified. For instance, if at least one k-mer match is enough to classify a read, "c" should be set to 1. If at least two k-mer matches are required to call read a match, "c" should be set to 2.
-t - number of threads
-q - the name of the folder where queries are located

Note:

CONSULT is a bit less of a professional-looking tool at the moment.
We will improve it.

Instruction for kraken:

We suggest using the default value for alpha option which is 0. This recommendation is based on our empirical findings from a previous paper.

To query kraken DB we use:

kraken2 --use-names --threads 24 --report REPORT_FILE_NAME --db DATABASE_NAME --confidence alpha --classified-out CLASSIFIED_FASTQ_FILE --unclassified-out UNCLASSIFIED_FASTQ_FILE QUERY_FASTQ_FILE > KRAKEN_OUTPUT_FILE