2021_Advanced use of k mers for analyzing skimming data - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki
Intro
This tutorial walks you through more advanced topics related to skimming.
Tools:
Main tools:
Other tools we will use:
Papers:
- MISA:
- M. Balaban, and S. Mirarab. “Phylogenetic Double Placement of Mixed Samples.” Bioinformatics Vol. 36, no. Supplement_1 (2020): pp. i335–43. doi:10.1093/bioinformatics/btaa489.
- RESPECT:
- Sarmashghi, S., Balaban, M., Rachtman, E., Touri, B., Mirarab, S., & Bafna, V. (2021). Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. BioRxiv, 2021.01.28.428636. doi:10.1101/2021.01.28.428636
- Contamination
- E. Rachtman, M. Balaban, V. Bafna, and S. Mirarab. “The Impact of Contaminants on the Accuracy of Genome Skimming and the Effectiveness of Exclusion Read Filters.” Molecular Ecology Resources Vol. 20, no. 3 (2020): pp. 1755-0998.13135. doi:10.1111/1755-0998.13135.
- Rachtman, E., Bafna, V., & Mirarab, S. (2021). CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics and Bioinformatics, 3(3), 10.1101/2021.03.18.436035. doi:10.1093/nargab/lqab071
Compute node,
Before you start, create an interactive session on a node.
## Make an interactive session
srun --ntasks=1 --mem-per-cpu=30G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i
1. Using MISA for mixed genome skim analyses
Installation
We installed MISA on day one. Please refer back to those instructions. We will simply re-activate our installed tools.
## Activate Conda
conda activate /cluster/projects/nn9458k/oh_know/.conda/skmer
## Load modules
### Load FastME
module load FastME/2.1.6.2-GCC-10.2.0
export PATH=$PATH:/cluster/projects/nn9458k/oh_know/teachers//bin
Prepare the query and distances.
We will now place the mixed individual (a known hybrid called Saccharomyces pastorianus) onto the tree using a double-placement tool MISA.
cd $USERWORK
cd skmer-tutorial
mkdir mix-query
cp genomes/Saccharomyces_pastorianus/GCA_001515485.2_Saccharomyces_pastorianus_Weihenstephan_34_70_chromosomes_assembly_1.0_genomic.fna mix-query/Saccharomyces_pastorianus.fna
These are the real constituents of Saccharomyces pastorianus.
cat genomes/Saccharomyces_pastorianus/things.txt
Recall that yesterday, we used -a
to add Saccharomyces cerevisiae to the reference set.
Let us first infer a backbone tree that includes Saccharomyces cerevisiae.
# Update the distance matrix to include the added species Saccharomyces cerevisiae
skmer distance -t library/
# Build the full tree with included
tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat-full.phy
fastme -i ref-dist-mat-full.phy -o full.tre
Start by computing distances from the mixed query to the references.
# Run Skmer
skmer query -t mix-query/Saccharomyces_pastorianus.fna library/
# Convert output to .tsv file
convert_to_tsv.sh dist-saccharomyces_pastorianus.txt > dist-saccharomyces_pastorianus.tsv
Ignoring mixtures:
Now, place the sample onto the tree, ignoring that it is a mixture.
run_apples.py -t backbone-fastme.tre -d dist-saccharomyces_pastorianus.tsv -o pastorianus-single.jplace
guppy tog pastorianus-single.jplace
nw_display pastorianus-single.tog.tre
Placement of mixed samples with both constituents present
Let's jump to MISA runs.
# Run MISA for phylogenetic double placemet
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t full.tre -o mixed-output-present.jplace
# Check the output versus correct mixture:
guppy tog mixed-output-present.jplace
nw_display full.tre
nw_display mixed-output-present.tog.tre
Placement of mixed samples with one constituent missing
Now, let's try the double-placement when one of the constituents is missing from the backbone.
nw_display backbone-fastme.tre
# Run MISA for phylogenetic double placemet
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t backbone-fastme.tre -o mixed-output.jplace
# Check the output versus correct mixture:
guppy tog mixed-output.jplace
cat genomes/Saccharomyces_pastorianus/things.txt
nw_display backbone-fastme.tre
nw_display mixed-output.tog.tre
You will see the following beautiful result. As you can see, MISA correctly identified the two parent species of Saccharomyces pastorianus.
- Top: the full reference tree before removing Saccharomyces cerevisiae. The Two blue branches are known constituents of Saccharomyces pastorianus.
- Bottom: Results of placement of Saccharomyces pastorianus on the tree after removing Saccharomyces cerevisiae.
Placement of mixed samples with both constituents missing
nw_prune backbone-fastme.tre Saccharomyces_eubayanus > backbone-noconst.tre
run_misa.py -d dist-saccharomyces_pastorianus.tsv -t backbone-noconst.tre -o mixed-output-noconst.jplace
guppy tog mixed-output-noconst.jplace
nw_display mixed-output-noconst.tog.tre
2. Repeat Spectra and better coverage estimates
We will now use RESPECT to compute the repeat structure of the genomes.
Installation
module load Jellyfish/2.3.0-GCC-9.3.0
# Note, you may need to swap GCC. Go down the rabbit hole
module load seqtk/1.3-GCC-9.3.0
python3.9 -m pip install gurobipy
# after you install gurobipy package using pip as described above, you need to set up the license; they have free academic license; see https://www.gurobi.com/documentation/9.1/quickstart_mac/creating_a_new_academic_li.html#subsection:createacademiclicense
git clone https://github.com/shahab-sarmashghi/RESPECT.git
cd RESPECT
python3.9 setup.py install
cd ..
Running RESPECT
Prepare input
Make sure you are under the skmer-tutorial
directory.
mkdir respect
cd respect
mkdir inputs
cd inputs
## link to both skims and genomes here
ln -s ../../skims/nonhybrids/Saccharomyces_*fq ../../skmer-tutorial/nonhybrids/*fna .
cd ..
Now run the method:
### There is some ugliness having to do with licenses gurobi. We cannot just run respect. I will discuss during the tutorial.
### /cluster/projects/nn9458k/oh
_know/teachers/smirarab/respect/gurobi912/linux64/bin/grbgetkey
### export GRB_LICENSE_FILE=`pwd`/gurobi.lic
respect -d inputs/ -N 100
This will take a minute or two.
Look at the summary of results and note the high level of accuracy comparing to genomes
### Examine genome length estimates
column -t estimated-parameters.txt|sort -k1
### And examine repeat spectra
cat estimated-spectra.txt |sort -k1|column -t
cd ..
3. Contamination Filtering
There are two ways to remove contamination:
- inclusion filters when you know what you are looking for and you have a reference genome.
- Here, you can use any tool, like BLAST, bowtie, etc.
- exclusion filters when you do not know what to exactly look for but you know what you do not want (e.g., bacteria)
Exclusion filters
We have bacterial/archaeal libraries available for both CONSULT and Kraken.
- GTDB is the most comprehensive library.
- All links to all of our reference libraries are available on our raw data github for CONSULT.
Instruction for CONSULT
Instruction for CONSULT (you can go straight to query):
Installation
CONSULT is available under /cluster/projects/nn9458k/oh_know/teachers/bin/consult_search
.
To install and compile search command on your machine:
git clone https://github.com/noraracht/CONSULT.git
cd CONSULT/
g++ main_search.cpp -std=c++11 -fopenmp -O3 -o consult_search
But CONSULT requires ~120GB of memory. We will need to stop our interactive session and start a new one.
srun --ntasks=1 --cpus-per-task=8 --mem-per-cpu=16G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i
Query
To query sequence reads against reference database we ran
mkdir consult
cd consult/
# Let's use as query a bunch of Drosophila genome skims (already on cluster)
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/Drosophila/ .
# I have already copied the reference dataset; let's link to it
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb/ .
consult_search -i all_nbrhood_kmers_k32_p3l2clmn7_K15-map2-171_gtdb -c 1 -t 2 -q Drosophila/ 2>&1 |tee consult.log &
This runs for 5-10 minutes. While it runs, we can monitor it a bit:
top -u smirarab
tail -f consult.log
watch -n 10 wc -l ucseq_* Drosophila/*
After it finishes, now inspect the results.
less consult.log
wc -l ucseq_* Drosophila/*
Here are the arguments to CONSULT:
-i
- the name of the reference database-c
- the lowest number of k-mers required to mark sequencing read as classified. For instance, if at least one k-mer match is enough to classify a read, "c" should be set to 1. If at least two k-mer matches are required to call read a match, "c" should be set to 2.-t
- number of threads-q
- the name of the folder where queries are located
Note:
- CONSULT is a bit less of a professional-looking tool at the moment.
- We will improve it.
Instruction for kraken:
We suggest using the default value for alpha
option which is 0. This recommendation is based on our empirical findings from a previous paper.
To query kraken DB we use:
kraken2 --use-names --threads 24 --report REPORT_FILE_NAME --db DATABASE_NAME --confidence alpha --classified-out CLASSIFIED_FASTQ_FILE --unclassified-out UNCLASSIFIED_FASTQ_FILE QUERY_FASTQ_FILE > KRAKEN_OUTPUT_FILE