2021_Advanced use of k mers for analyzing skimming data - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

Intro

This tutorial walks you through more advanced topics related to skimming.

Slides: Slides from the talk are here
Videos: on YouTube

Tools:

Main tools:

Other tools we will use:

Papers:

MISA:
- M. Balaban, and S. Mirarab. “Phylogenetic Double Placement of Mixed Samples.” Bioinformatics Vol. 36, no. Supplement_1 (2020): pp. i335–43. doi:10.1093/bioinformatics/btaa489.
RESPECT:
- Sarmashghi, S., Balaban, M., Rachtman, E., Touri, B., Mirarab, S., & Bafna, V. (2021). Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. BioRxiv, 2021.01.28.428636. doi:10.1101/2021.01.28.428636
Contamination
- E. Rachtman, M. Balaban, V. Bafna, and S. Mirarab. “The Impact of Contaminants on the Accuracy of Genome Skimming and the Effectiveness of Exclusion Read Filters.” Molecular Ecology Resources Vol. 20, no. 3 (2020): pp. 1755-0998.13135. doi:10.1111/1755-0998.13135.
- Rachtman, E., Bafna, V., & Mirarab, S. (2021). CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics and Bioinformatics, 3(3), 10.1101/2021.03.18.436035. doi:10.1093/nargab/lqab071

Compute node,

Before you start, create an interactive session on a node.

## Make an interactive session
srun --ntasks=1 --mem-per-cpu=30G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i

1. Using MISA for mixed genome skim analyses

Installation

We installed MISA on day one. Please refer back to those instructions. We will simply re-activate our installed tools.

## Activate Conda
conda activate /cluster/projects/nn9458k/oh_know/.conda/skmer

## Load modules

### Load FastME
module load FastME/2.1.6.2-GCC-10.2.0

export PATH=$PATH:/cluster/projects/nn9458k/oh_know/teachers//bin

Prepare the query and distances.

We will now place the mixed individual (a known hybrid called Saccharomyces pastorianus) onto the tree using a double-placement tool MISA.

cd $USERWORK
cd skmer-tutorial
mkdir mix-query
cp genomes/Saccharomyces_pastorianus/GCA_001515485.2_Saccharomyces_pastorianus_Weihenstephan_34_70_chromosomes_assembly_1.0_genomic.fna mix-query/Saccharomyces_pastorianus.fna

These are the real constituents of Saccharomyces pastorianus.

cat genomes/Saccharomyces_pastorianus/things.txt

Recall that yesterday, we used -a to add Saccharomyces cerevisiae to the reference set. Let us first infer a backbone tree that includes Saccharomyces cerevisiae.

# Update the distance matrix to include the added species Saccharomyces cerevisiae
skmer distance -t library/

# Build the full tree with included
tsv_to_phymat.sh ref-dist-mat.txt  ref-dist-mat-full.phy
fastme -i ref-dist-mat-full.phy -o full.tre

Start by computing distances from the mixed query to the references.

# Run Skmer
skmer query -t mix-query/Saccharomyces_pastorianus.fna library/
# Convert output to .tsv file
convert_to_tsv.sh dist-saccharomyces_pastorianus.txt > dist-saccharomyces_pastorianus.tsv

Ignoring mixtures:

Now, place the sample onto the tree, ignoring that it is a mixture.

run_apples.py -t backbone-fastme.tre -d dist-saccharomyces_pastorianus.tsv -o pastorianus-single.jplace
guppy tog pastorianus-single.jplace
nw_display pastorianus-single.tog.tre

Placement of mixed samples with both constituents present

Let's jump to MISA runs.

# Run MISA for phylogenetic double placemet
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t full.tre -o mixed-output-present.jplace


# Check the output versus correct mixture:
guppy tog mixed-output-present.jplace
nw_display full.tre
nw_display mixed-output-present.tog.tre

Placement of mixed samples with one constituent missing

Now, let's try the double-placement when one of the constituents is missing from the backbone.

nw_display backbone-fastme.tre

# Run MISA for phylogenetic double placemet
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t backbone-fastme.tre -o mixed-output.jplace


# Check the output versus correct mixture:
guppy tog mixed-output.jplace
cat genomes/Saccharomyces_pastorianus/things.txt
nw_display backbone-fastme.tre
nw_display mixed-output.tog.tre

You will see the following beautiful result. As you can see, MISA correctly identified the two parent species of Saccharomyces pastorianus.

Top: the full reference tree before removing Saccharomyces cerevisiae. The Two blue branches are known constituents of Saccharomyces pastorianus.
Bottom: Results of placement of Saccharomyces pastorianus on the tree after removing Saccharomyces cerevisiae.

Placement of mixed samples with both constituents missing

nw_prune backbone-fastme.tre Saccharomyces_eubayanus > backbone-noconst.tre
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t backbone-noconst.tre -o mixed-output-noconst.jplace
guppy tog mixed-output-noconst.jplace
nw_display mixed-output-noconst.tog.tre

2. Repeat Spectra and better coverage estimates

We will now use RESPECT to compute the repeat structure of the genomes.

Installation

module load Jellyfish/2.3.0-GCC-9.3.0
# Note, you may need to swap GCC. Go down the rabbit hole

module load seqtk/1.3-GCC-9.3.0

python3.9 -m pip install gurobipy
# after you install gurobipy package using pip as described above, you need to set up the license; they have free academic license; see https://www.gurobi.com/documentation/9.1/quickstart_mac/creating_a_new_academic_li.html#subsection:createacademiclicense

git clone https://github.com/shahab-sarmashghi/RESPECT.git
cd RESPECT
python3.9 setup.py install
cd ..

Running RESPECT

Prepare input

Make sure you are under the skmer-tutorial directory.

mkdir respect 
cd respect
mkdir inputs
cd inputs
## link to both skims and genomes here
ln -s ../../skims/nonhybrids/Saccharomyces_*fq ../../skmer-tutorial/nonhybrids/*fna .
cd ..

Now run the method:

### There is some ugliness having to do with licenses gurobi. We cannot just run respect. I will discuss during the tutorial. 
### /cluster/projects/nn9458k/oh
_know/teachers/smirarab/respect/gurobi912/linux64/bin/grbgetkey
### export GRB_LICENSE_FILE=`pwd`/gurobi.lic
respect -d inputs/ -N 100

This will take a minute or two.

Look at the summary of results and note the high level of accuracy comparing to genomes

### Examine genome length estimates
column -t estimated-parameters.txt|sort -k1

### And examine repeat spectra
cat estimated-spectra.txt |sort -k1|column -t
cd ..

3. Contamination Filtering

There are two ways to remove contamination:

inclusion filters when you know what you are looking for and you have a reference genome.
- Here, you can use any tool, like BLAST, bowtie, etc.
exclusion filters when you do not know what to exactly look for but you know what you do not want (e.g., bacteria)
- We typically use one of two tools:
  - CONSULT
  - Kraken-II
- Both are computationally intensive.

Exclusion filters

We have bacterial/archaeal libraries available for both CONSULT and Kraken.

GTDB is the most comprehensive library.
- Consult DB is here
- Kraken-II DB is here.
All links to all of our reference libraries are available on our raw data github for CONSULT.

Instruction for CONSULT

Instruction for CONSULT (you can go straight to query):

Installation

CONSULT is available under /cluster/projects/nn9458k/oh_know/teachers/bin/consult_search.

To install and compile search command on your machine:

git clone https://github.com/noraracht/CONSULT.git
cd CONSULT/
g++ main_search.cpp -std=c++11 -fopenmp -O3 -o consult_search

But CONSULT requires ~120GB of memory. We will need to stop our interactive session and start a new one.

srun --ntasks=1 --cpus-per-task=8  --mem-per-cpu=16G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i

Query

To query sequence reads against reference database we ran

mkdir consult
cd consult/

# Let's use as query a bunch of Drosophila genome skims (already on cluster) 
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/Drosophila/ .

# I have already copied the reference dataset; let's link to it
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb/ .

consult_search -i all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb -c 1 -t 2 -q Drosophila/ 2>&1 |tee consult.log &

This runs for 5-10 minutes. While it runs, we can monitor it a bit:

top -u smirarab
tail -f consult.log
watch -n 10 wc -l ucseq_* Drosophila/*

After it finishes, now inspect the results.

less consult.log

wc -l ucseq_* Drosophila/*

Here are the arguments to CONSULT:

-i - the name of the reference database
-c - the lowest number of k-mers required to mark sequencing read as classified. For instance, if at least one k-mer match is enough to classify a read, "c" should be set to 1. If at least two k-mer matches are required to call read a match, "c" should be set to 2.
-t - number of threads
-q - the name of the folder where queries are located

Note:

CONSULT is a bit less of a professional-looking tool at the moment.
We will improve it.

Instruction for kraken:

We suggest using the default value for alpha option which is 0. This recommendation is based on our empirical findings from a previous paper.

To query kraken DB we use:

kraken2 --use-names --threads 24 --report REPORT_FILE_NAME --db DATABASE_NAME --confidence alpha --classified-out CLASSIFIED_FASTQ_FILE --unclassified-out UNCLASSIFIED_FASTQ_FILE QUERY_FASTQ_FILE > KRAKEN_OUTPUT_FILE