2021_Advanced use of k mers for analyzing skimming data - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

Intro

This tutorial walks you through more advanced topics related to skimming.

  • Slides: Slides from the talk are here
  • Videos: on YouTube

Tools:

Main tools:

Other tools we will use:

Papers:

  • MISA:
    • M. Balaban, and S. Mirarab. “Phylogenetic Double Placement of Mixed Samples.” Bioinformatics Vol. 36, no. Supplement_1 (2020): pp. i335–43. doi:10.1093/bioinformatics/btaa489.
  • RESPECT:
    • Sarmashghi, S., Balaban, M., Rachtman, E., Touri, B., Mirarab, S., & Bafna, V. (2021). Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. BioRxiv, 2021.01.28.428636. doi:10.1101/2021.01.28.428636
  • Contamination
    • E. Rachtman, M. Balaban, V. Bafna, and S. Mirarab. “The Impact of Contaminants on the Accuracy of Genome Skimming and the Effectiveness of Exclusion Read Filters.” Molecular Ecology Resources Vol. 20, no. 3 (2020): pp. 1755-0998.13135. doi:10.1111/1755-0998.13135.
    • Rachtman, E., Bafna, V., & Mirarab, S. (2021). CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics and Bioinformatics, 3(3), 10.1101/2021.03.18.436035. doi:10.1093/nargab/lqab071

Compute node,

Before you start, create an interactive session on a node.

## Make an interactive session
srun --ntasks=1 --mem-per-cpu=30G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i

1. Using MISA for mixed genome skim analyses

Installation

We installed MISA on day one. Please refer back to those instructions. We will simply re-activate our installed tools.

## Activate Conda
conda activate /cluster/projects/nn9458k/oh_know/.conda/skmer

## Load modules

### Load FastME
module load FastME/2.1.6.2-GCC-10.2.0

export PATH=$PATH:/cluster/projects/nn9458k/oh_know/teachers//bin

Prepare the query and distances.

We will now place the mixed individual (a known hybrid called Saccharomyces pastorianus) onto the tree using a double-placement tool MISA.

cd $USERWORK
cd skmer-tutorial
mkdir mix-query
cp genomes/Saccharomyces_pastorianus/GCA_001515485.2_Saccharomyces_pastorianus_Weihenstephan_34_70_chromosomes_assembly_1.0_genomic.fna mix-query/Saccharomyces_pastorianus.fna

These are the real constituents of Saccharomyces pastorianus.

cat genomes/Saccharomyces_pastorianus/things.txt

Recall that yesterday, we used -a to add Saccharomyces cerevisiae to the reference set. Let us first infer a backbone tree that includes Saccharomyces cerevisiae.

# Update the distance matrix to include the added species Saccharomyces cerevisiae
skmer distance -t library/

# Build the full tree with included
tsv_to_phymat.sh ref-dist-mat.txt  ref-dist-mat-full.phy
fastme -i ref-dist-mat-full.phy -o full.tre

Start by computing distances from the mixed query to the references.

# Run Skmer
skmer query -t mix-query/Saccharomyces_pastorianus.fna library/
# Convert output to .tsv file
convert_to_tsv.sh dist-saccharomyces_pastorianus.txt > dist-saccharomyces_pastorianus.tsv

Ignoring mixtures:

Now, place the sample onto the tree, ignoring that it is a mixture.

run_apples.py -t backbone-fastme.tre -d dist-saccharomyces_pastorianus.tsv -o pastorianus-single.jplace
guppy tog pastorianus-single.jplace
nw_display pastorianus-single.tog.tre

Placement of mixed samples with both constituents present

Let's jump to MISA runs.

# Run MISA for phylogenetic double placemet
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t full.tre -o mixed-output-present.jplace


# Check the output versus correct mixture:
guppy tog mixed-output-present.jplace
nw_display full.tre
nw_display mixed-output-present.tog.tre

Placement of mixed samples with one constituent missing

Now, let's try the double-placement when one of the constituents is missing from the backbone.

nw_display backbone-fastme.tre

# Run MISA for phylogenetic double placemet
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t backbone-fastme.tre -o mixed-output.jplace


# Check the output versus correct mixture:
guppy tog mixed-output.jplace
cat genomes/Saccharomyces_pastorianus/things.txt
nw_display backbone-fastme.tre
nw_display mixed-output.tog.tre

You will see the following beautiful result. As you can see, MISA correctly identified the two parent species of Saccharomyces pastorianus.

  • Top: the full reference tree before removing Saccharomyces cerevisiae. The Two blue branches are known constituents of Saccharomyces pastorianus.
  • Bottom: Results of placement of Saccharomyces pastorianus on the tree after removing Saccharomyces cerevisiae.

Placement of mixed samples with both constituents missing

nw_prune backbone-fastme.tre Saccharomyces_eubayanus > backbone-noconst.tre
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t backbone-noconst.tre -o mixed-output-noconst.jplace
guppy tog mixed-output-noconst.jplace
nw_display mixed-output-noconst.tog.tre

2. Repeat Spectra and better coverage estimates

We will now use RESPECT to compute the repeat structure of the genomes.

Installation

module load Jellyfish/2.3.0-GCC-9.3.0
# Note, you may need to swap GCC. Go down the rabbit hole

module load seqtk/1.3-GCC-9.3.0

python3.9 -m pip install gurobipy
# after you install gurobipy package using pip as described above, you need to set up the license; they have free academic license; see https://www.gurobi.com/documentation/9.1/quickstart_mac/creating_a_new_academic_li.html#subsection:createacademiclicense

git clone https://github.com/shahab-sarmashghi/RESPECT.git
cd RESPECT
python3.9 setup.py install
cd ..

Running RESPECT

Prepare input

Make sure you are under the skmer-tutorial directory.

mkdir respect 
cd respect
mkdir inputs
cd inputs
## link to both skims and genomes here
ln -s ../../skims/nonhybrids/Saccharomyces_*fq ../../skmer-tutorial/nonhybrids/*fna .
cd ..

Now run the method:

### There is some ugliness having to do with licenses gurobi. We cannot just run respect. I will discuss during the tutorial. 
### /cluster/projects/nn9458k/oh
_know/teachers/smirarab/respect/gurobi912/linux64/bin/grbgetkey
### export GRB_LICENSE_FILE=`pwd`/gurobi.lic
respect -d inputs/ -N 100

This will take a minute or two.

Look at the summary of results and note the high level of accuracy comparing to genomes

### Examine genome length estimates
column -t estimated-parameters.txt|sort -k1

### And examine repeat spectra
cat estimated-spectra.txt |sort -k1|column -t
cd ..

3. Contamination Filtering

There are two ways to remove contamination:

  • inclusion filters when you know what you are looking for and you have a reference genome.
    • Here, you can use any tool, like BLAST, bowtie, etc.
  • exclusion filters when you do not know what to exactly look for but you know what you do not want (e.g., bacteria)
    • We typically use one of two tools:
    • Both are computationally intensive.

Exclusion filters

We have bacterial/archaeal libraries available for both CONSULT and Kraken.

  • GTDB is the most comprehensive library.
    • Consult DB is here
    • Kraken-II DB is here.
  • All links to all of our reference libraries are available on our raw data github for CONSULT.

Instruction for CONSULT

Instruction for CONSULT (you can go straight to query):

Installation

CONSULT is available under /cluster/projects/nn9458k/oh_know/teachers/bin/consult_search.

To install and compile search command on your machine:

git clone https://github.com/noraracht/CONSULT.git
cd CONSULT/
g++ main_search.cpp -std=c++11 -fopenmp -O3 -o consult_search

But CONSULT requires ~120GB of memory. We will need to stop our interactive session and start a new one.

srun --ntasks=1 --cpus-per-task=8  --mem-per-cpu=16G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i

Query

To query sequence reads against reference database we ran

mkdir consult
cd consult/

# Let's use as query a bunch of Drosophila genome skims (already on cluster) 
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/Drosophila/ .

# I have already copied the reference dataset; let's link to it
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb/ .

consult_search -i all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb -c 1 -t 2 -q Drosophila/ 2>&1 |tee consult.log &

This runs for 5-10 minutes. While it runs, we can monitor it a bit:

top -u smirarab
tail -f consult.log
watch -n 10 wc -l ucseq_* Drosophila/*

After it finishes, now inspect the results.

less consult.log

wc -l ucseq_* Drosophila/*

Here are the arguments to CONSULT:

  • -i - the name of the reference database
  • -c - the lowest number of k-mers required to mark sequencing read as classified. For instance, if at least one k-mer match is enough to classify a read, "c" should be set to 1. If at least two k-mer matches are required to call read a match, "c" should be set to 2.
  • -t - number of threads
  • -q - the name of the folder where queries are located

Note:

  • CONSULT is a bit less of a professional-looking tool at the moment.
  • We will improve it.

Instruction for kraken:

We suggest using the default value for alpha option which is 0. This recommendation is based on our empirical findings from a previous paper.

To query kraken DB we use:

kraken2 --use-names --threads 24 --report REPORT_FILE_NAME --db DATABASE_NAME --confidence alpha --classified-out CLASSIFIED_FASTQ_FILE --unclassified-out UNCLASSIFIED_FASTQ_FILE QUERY_FASTQ_FILE > KRAKEN_OUTPUT_FILE