2021_Introduction to k mers for analyzing skimming data - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki
This tutorial walks you through the use of Skmer for computing distances between genome skims, FastME for phylogeny reconstruction using Skmer distances, and APPLES for phylogenetic placement.
Main tools:
Other tools we will use:
- S. Sarmashghi, K. Bohmann, M. T. P Gilbert, V. Bafna, and S. Mirarab. โSkmer: Assembly-Free and Alignment-Free Sample Identification Using Genome Skims.โ Genome Biology Vol. 20, no. 1 (2019): pp. 34. doi:10.1186/s13059-019-1632-4.
- M. Balaban, S. Sarmashghi, and S. Mirarab. โAPPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments.โ Edited by David Posada. Systematic Biology Vol. 69, no. 3 (2020): pp. 566โ78. doi:10.1093/sysbio/syz063.
- K. Bohmann, S. Mirarab, V. Bafna, and M. T. P. Gilbert. โBeyond DNA Barcoding: The Unrealized Potential of Genome Skim Data in Sample Identification.โ Molecular Ecology, (2020), pp. mec.15507. doi:10.1111/mec.15507.
The following steps show how to install the tools.
- If you are unable to follow some of the steps, a lot of the outputs (all the smaller files) are provided here as a zip file you can download: Skmer+APPLES+tutorial-smallfiles.zip
On the saga.sigma2.no cluster, the tools are already installed.
To activate, run:
## Make an interactive session
srun --ntasks=1 --mem-per-cpu=30G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i
## Activate Conda
conda activate /cluster/projects/nn9458k/oh_know/.conda/skmer
## Load modules
### Load FastME
module load FastME/2.1.6.2-GCC-10.2.0
### Load FastTree
module swap GCCcore/10.2.0 GCCcore/8.3.0
module load FastTree/2.1.11-GCCcore-8.3.0
### Best if /cluster/projects/nn9458k/oh_know/teachers//bin is in your path;
export PATH=$PATH:/cluster/projects/nn9458k/oh_know/teachers//bin
# Guppy:
### is available here:/cluster/projects/nn9458k/oh_know/teachers//bin/guppy
ls /cluster/projects/nn9458k/oh_know/teachers/bin/tsv_to_phymat.sh
# (Optional) ART
### is available here:
### /cluster/projects/nn9458k/oh_know/teachers//bin/art_bin_MountRainier/art_illumina
## Test to make sure everything is installed
skmer -h
run_apples.py -h
run_misa.py -h
fastme -h
FastTree -h
/cluster/projects/nn9458k/oh_know/teachers//bin/guppy -h
# or even better:
guppy -h
/cluster/projects/nn9458k/oh_know/teachers//bin/art_bin_MountRainier/art_illumina -h
- Instructions shown below work on Linux. Some changes are needed for MAC and Windows. See comments.
conda create --name skmer
conda activate skmer
conda config --add channels bioconda
### Instal Skmer
conda install skmer
skmer -h
### Instal APPLES
python -m pip install -U apples
run_apples.py -h
python -m pip list |grep apples
### If you have versions older than 1.3.0, you may need to updating using:
python -m pip install --upgrade apples
### Install MISA
python -m pip install misa
run_misa.py -h
### Install FastME (to get backbone trees)
wget http://www.atgc-montpellier.fr/download/sources/fastme/fastme-2.1.5.tar.gz
tar xvfz fastme-2.1.5.tar.gz
chmod +x fastme-2.1.5/binaries/fastme-2.1.5-inux ## Change "inux" at the end if using other platforms (osx or windows).
./fastme-2.1.5/binaries/fastme-2.1.5-osx -h
### install fasttree
wget http://www.microbesonline.org/fasttree/FastTree.c
gcc -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm
### Note: for linux, you can find binary files on the FastTree website
./FastTree -h
### Guppy for working with .jplace files
wget https://github.com/smirarab/sepp/raw/master/tools/bundled/Linux/guppy-64
mv guppy-64 guppy
# wget https://github.com/smirarab/sepp/raw/master/tools/bundled/Darwin/guppy
chmod +x guppy
./guppy --version
### Other scripts:
wget https://raw.githubusercontent.com/balabanmetin/LSEdiag/master/tsv_to_phymat.sh
wget https://raw.githubusercontent.com/balabanmetin/misa/master/scripts/convert_to_tsv.sh
chmod +x tsv_to_phymat.sh
chmod +x convert_to_tsv.sh
### (Optional) If you want to simulate skimming, Install ART
# The link below is for MAC. For other platforms, see https://www.niehs.nih.gov/research/resources/software/biostatistics/art/
# For MAC, it's https://www.niehs.nih.gov/research/resources/assets/docs/artbinmountrainier2016.06.05macos64.tgz
wget https://www.niehs.nih.gov/research/resources/assets/docs/artbinmountrainier2016.06.05linux64.tgz
tar xvfz artbinmountrainier2016.06.05macos64.tgz
art_bin_MountRainier/art_illumina -h
### newick utilities
conda install -c bioconda newick_utils
conda install -c bioconda/label/cf201901 newick_utils
- We have pre-downloaded the datasets on the cluster. Just make a link:
### Move to userwork
cd $USERWORK
### make a directory
mkdir skmer-tutorial
cd skmer-tutorial
### Link the data
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/genomes .
- If you are on another machine, here is how you would get the data:
### Obtain yeast genomes as test case
wget https://github.com/balabanmetin/yeast-genomes/raw/master/yeast-genomes.tar.bz2
tar xvfj yeast-genomes.tar.bz2
du -sm genomes/*/*
head genomes/Saccharomyces_kudriavzevii/GCA_900682665.1_SKCA111_genomic.fna
We will start by computing pairwise distances between genomes
# First, prepare a directory with all the input files
mkdir nonhybrids
for x in `cat genomes/nonhybrids.txt`; do
cp genomes/$x/*fna nonhybrids/$x.fna;
done # We only choose non-hybrid species
# Then, set aside the genome S. cerevisiae as query
mkdir nonhybrids-query
mv nonhybrids/Saccharomyces_cerevisiae.fna nonhybrids-query/
# Look at the directories
ls nonhybrids*
# Now, run Skmer on this input directory.
skmer reference -t nonhybrids/
# This command will sketch each input file and will also compare all pairs to compute their distance
# On my machine it takes 4-5 minutes
# `-t` Tells is to transform distances using JC69 transformation
Let us examine results:
ls library/
CONFIG Saccharomyces_arboricola Saccharomyces_eubayanus Saccharomyces_jurei Saccharomyces_kudriavzevii Saccharomyces_mikatae Saccharomyces_paradoxus Saccharomyces_uvarum
cd library
cat CONFIG
ls *
### See a .dat file
cat Saccharomyces_paradoxus/Saccharomyces_paradoxus.dat
### Get all genome lengths
grep genome_length */*dat
### How big are sketches?
du -sm */*msh
cd ..
### Look at distances:
cat ref-dist-mat.txt
We will now estimate a FastME tree using our distance.
tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat.phy
# If this doesn't work, try:
# bash /cluster/projects/nn9458k/oh_know/teachers/bin/tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat.phy
### Look at results:
cat ref-dist-mat.phy
# Infer backbone from scratch
fastme -i ref-dist-mat.phy -o backbone-fastme.tre
ls
cat backbone-fastme.tre
nw_display backbone-fastme.tre
Note: Here, we inferred a new tree. If you already have a backbone tree, you need to (re)compute branch lengths on it with the right unit. Here is an example.
# Make a tree
echo "((((Saccharomyces_paradoxus,(Saccharomyces_jurei,Saccharomyces_mikatae)),Saccharomyces_kudriavzevii),Saccharomyces_arboricola),(Saccharomyces_eubayanus,Saccharomyces_uvarum));" > backbone.tre
# Recompute branch lengths:
fastme -i ref-dist-mat.phy -o backbone.tre -u backbone-fastme.tre
nw_display backbone-fastme-2.tre
We now place a query (Saccharomyces cerevisiae) onto the backbone tree
First, compute distances from query to the references.
# Ask skmer to compute distances from query to references
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna library/
# Convert to right format using script above
convert_to_tsv.sh dist-saccharomyces_cerevisiae.txt > dist-saccharomyces_cerevisiae.tsv
# If this doesn't work, try:
# bash /cluster/projects/nn9458k/oh_know/teachers/bin/convert_to_tsv.sh dist-saccharomyces_cerevisiae.txt > dist-saccharomyces_cerevisiae.tsv
cat dist-saccharomyces_cerevisiae.tsv
Note that the query is not added to the reference set.
Use -a
if you want the query to be added back to the library
ls library/
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna -a library/
ls library
Now that we have distances, let's add the query to the tree using APPLES.
run_apples.py -t backbone-fastme.tre -d dist-saccharomyces_cerevisiae.tsv -o cerevisiae.jplace
cat cerevisiae.jplace
The output is in the .jplace format. Examine it a bit.
You can always convert the .jplace to newick. We will use guppy, but there are more recent tools.
# Turn output to newick
guppy tog cerevisiae.jplace
cat cerevisiae.tog.tre
nw_display cerevisiae.tog.tre
Now, let's move on to working with genome skims instead of assembled genomes.
We start with simulating some genome skimming. In real life, you would have these.
If you want, you can skip this step and use the data available at /cluster/projects/nn9458k/oh_know/teachers/smirarab/skims/
.
# To copy existing files,
cp -r /cluster/projects/nn9458k/oh_know/teachers/smirarab/skims/ .
# To use ART to create genome skims from your genomes at 2X coverage
mkdir skims skims/nonhybrids skims/nonhybrids-query/
ls nonhybrids*/*fna|xargs -n1 -I@ /cluster/projects/nn9458k/oh_know/teachers/bin/art_bin_MountRainier/art_illumina -i @ -l 100 -f 2 -o skims/@
cd skims
ls */
Reference set
# Build library of skims
skmer reference -t nonhybrids/
ls library
cat library/Saccharomyces_paradoxus.fna/Saccharomyces_paradoxus.fna.dat
grep "" library/*/*.fna.dat
Compare with query:
# Compute distance of query to backbone sequence
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna.fq -a library/
sed -i -e "s/.fna//g" dist-saccharomyces_cerevisiae.fna.txt # Remove extra .fna from species names.
# Convert file format
convert_to_tsv.sh dist-saccharomyces_cerevisiae.fna.txt > dist-saccharomyces_cerevisiae.tsv
cat dist-saccharomyces_cerevisiae.tsv
Let us compare assembly-based and assembly-free results
### Start with genome lengths
paste <(grep genome_length library/*/*.dat ) <(grep genome_length ../library/*/*dat)
### And move on to distances
paste dist-saccharomyces_cerevisiae.fna.txt ../dist-saccharomyces_cerevisiae.txt
OK, we can now move on to phylogenetic placement
# Run APPLES
run_apples.py -t ../backbone-fastme.tre -d dist-saccharomyces_cerevisiae.tsv -o cerevisiae.jplace
# Turn output to newick
guppy tog cerevisiae.jplace
nw_display cerevisiae.tog.tre
# Compare to the genome-based run
nw_display ../cerevisiae.tog.tre
cd ..
-
For preprocessing we use BBtools tool (https://jgi.doe.gov/data-and-tools/bbtools/).
-
We stick to 5 steps:
-
To decontaminate .fastq files we used
bbduk.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out1=FASTQ_READ1 out2=FASTQ_READ2 ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 tpe tbo
-
To deduplicate reads we used
dedupe.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out=DEDUP_OUTPUT_FASTQ_FILE
-
To reformat deduplicated output files we used
reformat.sh in=DEDUP_OUTPUT_FASTQ_FILE out1=FASTQ_READ1 out2=FASTQ_READ2
-
To merge paired-end reads .fastq we used
bbmerge.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out=merged.fastq outu1=read1_unmerged.fastq outu2=read2_unmerged.fastq
-
Subsequently we concatenate merged and unmerged outputs
cat merged.fastq read1_unmerged.fastq read2_unmerged.fastq > OUTPUT_FILE
The output of deduplication is interleaved fastq so it needs to be reformatted into two separate Read1 and Read2 fastqs. Next, we usually merge Read1 and Read2 outputs. We would suggest preserving outu files (unmerged part of fastqs) and cat them with the merged fastq. Otherwise, coverage might be too low for Skmer to handle.
APPLES can also be used with aligned sequences. Below is an example.
On some simulated data.
# Get example input data
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/ref.fa
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/query.fa
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/backbone.nwk
sed -i -e "s/>L/>QU/g" query.fa # Change query names to become easy to distinguish
# Recompute branch lengths on the backbone tree
FastTree -nosupport -nt -nome -noml -log tree.log -intree backbone.nwk < ref.fa > minimum_evo_backbone.nwk
run_apples.py -s ref.fa -q query.fa -t minimum_evo_backbone.nwk -o aligned.jplace
# Turn output to newick
guppy tog aligned.jplace
guppy tog --xml aligned.jplace
More advanced options on real data
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/backbone_ml.nwk
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/query_nothird.fa
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/ref_nothird.fa
# Reestimate branch lengths!
FastTree -nt -nosupport -nome -noml -log true_me.fasttree.log -intree backbone_ml.nwk < ref_nothird.fa > true_me.fasttree
# Run APPLES
run_apples.py -t true_me.fasttree -s ref_nothird.fa -q query_nothird.fa -m FM -c MLSE -f 0.2 -b 25 -o apples.jplace
guppy tog --xml apples.jplace