This tutorial walks you through the use of Skmer for computing distances between genome skims, FastME for phylogeny reconstruction using Skmer distances, and APPLES for phylogenetic placement.

Slides: Slides from the talk are here
Viedo: on Youtube

Tools:

Main tools:

Skmer
APPLES

Other tools we will use:

Papers:

S. Sarmashghi, K. Bohmann, M. T. P Gilbert, V. Bafna, and S. Mirarab. “Skmer: Assembly-Free and Alignment-Free Sample Identification Using Genome Skims.” Genome Biology Vol. 20, no. 1 (2019): pp. 34. doi:10.1186/s13059-019-1632-4.
M. Balaban, S. Sarmashghi, and S. Mirarab. “APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments.” Edited by David Posada. Systematic Biology Vol. 69, no. 3 (2020): pp. 566–78. doi:10.1093/sysbio/syz063.
K. Bohmann, S. Mirarab, V. Bafna, and M. T. P. Gilbert. “Beyond DNA Barcoding: The Unrealized Potential of Genome Skim Data in Sample Identification.” Molecular Ecology, (2020), pp. mec.15507. doi:10.1111/mec.15507.

Installations

The following steps show how to install the tools.

If you are unable to follow some of the steps, a lot of the outputs (all the smaller files) are provided here as a zip file you can download: Skmer+APPLES+tutorial-smallfiles.zip

On the cluster (pre-installed)

On the saga.sigma2.no cluster, the tools are already installed.

To activate, run:

## Make an interactive session
srun --ntasks=1 --mem-per-cpu=30G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i

## Activate Conda
conda activate /cluster/projects/nn9458k/oh_know/.conda/skmer

## Load modules

### Load FastME
module load FastME/2.1.6.2-GCC-10.2.0

### Load FastTree
module swap GCCcore/10.2.0 GCCcore/8.3.0
module load FastTree/2.1.11-GCCcore-8.3.0

### Best if /cluster/projects/nn9458k/oh_know/teachers//bin is in your path;
export PATH=$PATH:/cluster/projects/nn9458k/oh_know/teachers//bin

# Guppy:
### is available here:/cluster/projects/nn9458k/oh_know/teachers//bin/guppy

ls /cluster/projects/nn9458k/oh_know/teachers/bin/tsv_to_phymat.sh

# (Optional) ART
### is available here:
### /cluster/projects/nn9458k/oh_know/teachers//bin/art_bin_MountRainier/art_illumina

## Test to make sure everything is installed
skmer -h
run_apples.py -h
run_misa.py -h
fastme -h
FastTree -h

/cluster/projects/nn9458k/oh_know/teachers//bin/guppy -h
# or even better:
guppy -h

/cluster/projects/nn9458k/oh_know/teachers//bin/art_bin_MountRainier/art_illumina -h

Install on a new machine or locally

Instructions shown below work on Linux. Some changes are needed for MAC and Windows. See comments.

conda create --name skmer
conda activate skmer

conda config --add channels bioconda

### Instal Skmer
conda install skmer
skmer -h

### Instal APPLES
python -m pip install -U apples
run_apples.py -h
python -m pip  list |grep apples
### If you have versions older than 1.3.0, you may need to updating using:
python -m pip install --upgrade apples

### Install MISA
python -m pip install misa
run_misa.py -h


### Install FastME (to get backbone trees)
wget http://www.atgc-montpellier.fr/download/sources/fastme/fastme-2.1.5.tar.gz
tar xvfz fastme-2.1.5.tar.gz
chmod +x fastme-2.1.5/binaries/fastme-2.1.5-inux ## Change "inux" at the end if using other platforms (osx or windows).
./fastme-2.1.5/binaries/fastme-2.1.5-osx -h

### install fasttree
wget http://www.microbesonline.org/fasttree/FastTree.c
gcc -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm
### Note: for linux, you can find binary files on the FastTree website
./FastTree -h

### Guppy for working with .jplace files
wget https://github.com/smirarab/sepp/raw/master/tools/bundled/Linux/guppy-64
mv guppy-64 guppy
# wget https://github.com/smirarab/sepp/raw/master/tools/bundled/Darwin/guppy

chmod +x guppy
./guppy --version

### Other scripts:
wget https://raw.githubusercontent.com/balabanmetin/LSEdiag/master/tsv_to_phymat.sh
wget https://raw.githubusercontent.com/balabanmetin/misa/master/scripts/convert_to_tsv.sh
chmod +x tsv_to_phymat.sh
chmod +x convert_to_tsv.sh

### (Optional) If you want to simulate skimming, Install ART
# The link below is for MAC. For other platforms, see https://www.niehs.nih.gov/research/resources/software/biostatistics/art/
    # For MAC, it's https://www.niehs.nih.gov/research/resources/assets/docs/artbinmountrainier2016.06.05macos64.tgz
wget https://www.niehs.nih.gov/research/resources/assets/docs/artbinmountrainier2016.06.05linux64.tgz 
tar xvfz artbinmountrainier2016.06.05macos64.tgz
art_bin_MountRainier/art_illumina -h

### newick utilities
conda install -c bioconda newick_utils
conda install -c bioconda/label/cf201901 newick_utils

Obtain large datasets

We have pre-downloaded the datasets on the cluster. Just make a link:

### Move to userwork
cd $USERWORK

### make a directory
mkdir skmer-tutorial
cd skmer-tutorial
### Link the data
ln -s /cluster/projects/nn9458k/oh_know/teachers/smirarab/genomes .

If you are on another machine, here is how you would get the data:

### Obtain yeast genomes as test case
wget https://github.com/balabanmetin/yeast-genomes/raw/master/yeast-genomes.tar.bz2
tar xvfj yeast-genomes.tar.bz2

du -sm genomes/*/*

head genomes/Saccharomyces_kudriavzevii/GCA_900682665.1_SKCA111_genomic.fna

Alignment-free analyses: full genomes

We will start by computing pairwise distances between genomes

Run Skmer on example genome data

# First, prepare a directory with all the input files
mkdir nonhybrids
for x in `cat genomes/nonhybrids.txt`; do  
	cp genomes/$x/*fna  nonhybrids/$x.fna; 
done # We only choose non-hybrid species
# Then, set aside the genome S. cerevisiae as query
mkdir nonhybrids-query 
mv nonhybrids/Saccharomyces_cerevisiae.fna nonhybrids-query/

# Look at the directories
ls nonhybrids*

# Now, run Skmer on this input directory.
skmer reference -t nonhybrids/
# This command will sketch each input file and will also compare all pairs to compute their distance
# On my machine it takes 4-5 minutes
# `-t` Tells is to transform distances using JC69 transformation

Let us examine results:

ls library/
CONFIG  Saccharomyces_arboricola  Saccharomyces_eubayanus  Saccharomyces_jurei  Saccharomyces_kudriavzevii  Saccharomyces_mikatae  Saccharomyces_paradoxus  Saccharomyces_uvarum

cd library
cat CONFIG
ls *

### See a .dat file
cat Saccharomyces_paradoxus/Saccharomyces_paradoxus.dat

### Get all genome lengths
grep genome_length */*dat

### How big are sketches?
du -sm */*msh

cd ..

### Look at distances:
cat ref-dist-mat.txt

Infer backbone tree with branch lengths using FastME

We will now estimate a FastME tree using our distance.

tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat.phy
# If this doesn't work, try:
# bash /cluster/projects/nn9458k/oh_know/teachers/bin/tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat.phy

### Look at results:
cat ref-dist-mat.phy

# Infer backbone from scratch
fastme  -i ref-dist-mat.phy  -o backbone-fastme.tre

ls
cat backbone-fastme.tre
nw_display backbone-fastme.tre

Note: Here, we inferred a new tree. If you already have a backbone tree, you need to (re)compute branch lengths on it with the right unit. Here is an example.

# Make a tree
echo "((((Saccharomyces_paradoxus,(Saccharomyces_jurei,Saccharomyces_mikatae)),Saccharomyces_kudriavzevii),Saccharomyces_arboricola),(Saccharomyces_eubayanus,Saccharomyces_uvarum));" > backbone.tre

# Recompute branch lengths:
fastme  -i ref-dist-mat.phy  -o backbone.tre -u backbone-fastme.tre
nw_display backbone-fastme-2.tre

Place new sample on the backbone tree

We now place a query (Saccharomyces cerevisiae) onto the backbone tree

First, compute distances from query to the references.

# Ask skmer to compute distances from query to references
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna library/ 

# Convert to right format using script above
convert_to_tsv.sh dist-saccharomyces_cerevisiae.txt > dist-saccharomyces_cerevisiae.tsv
# If this doesn't work, try:
# bash /cluster/projects/nn9458k/oh_know/teachers/bin/convert_to_tsv.sh dist-saccharomyces_cerevisiae.txt > dist-saccharomyces_cerevisiae.tsv

cat dist-saccharomyces_cerevisiae.tsv

Note that the query is not added to the reference set. Use -a if you want the query to be added back to the library

ls library/
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna -a library/
ls library

Now that we have distances, let's add the query to the tree using APPLES.

run_apples.py -t backbone-fastme.tre -d dist-saccharomyces_cerevisiae.tsv -o cerevisiae.jplace

cat cerevisiae.jplace

The output is in the .jplace format. Examine it a bit.

You can always convert the .jplace to newick. We will use guppy, but there are more recent tools.

# Turn output to newick
guppy tog cerevisiae.jplace

cat cerevisiae.tog.tre
nw_display cerevisiae.tog.tre

Alignment-free analyses: genome skims

Now, let's move on to working with genome skims instead of assembled genomes.

Run ART to simulate genome skims

We start with simulating some genome skimming. In real life, you would have these. If you want, you can skip this step and use the data available at /cluster/projects/nn9458k/oh_know/teachers/smirarab/skims/.

# To copy existing files, 
cp -r /cluster/projects/nn9458k/oh_know/teachers/smirarab/skims/ .

# To use ART to create genome skims from your genomes at 2X coverage
mkdir skims skims/nonhybrids skims/nonhybrids-query/
ls nonhybrids*/*fna|xargs -n1 -I@ /cluster/projects/nn9458k/oh_know/teachers/bin/art_bin_MountRainier/art_illumina  -i @  -l 100 -f 2 -o skims/@
cd skims
ls */

Repeat analyses using genome skims

Reference set

# Build library of skims
skmer reference -t nonhybrids/

ls library
cat library/Saccharomyces_paradoxus.fna/Saccharomyces_paradoxus.fna.dat

grep "" library/*/*.fna.dat

Compare with query:

# Compute distance of query to backbone sequence
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna.fq -a library/ 
sed -i -e "s/.fna//g" dist-saccharomyces_cerevisiae.fna.txt # Remove extra .fna from species names.
# Convert file format
convert_to_tsv.sh dist-saccharomyces_cerevisiae.fna.txt > dist-saccharomyces_cerevisiae.tsv
cat dist-saccharomyces_cerevisiae.tsv

Let us compare assembly-based and assembly-free results

### Start with genome lengths
paste <(grep genome_length library/*/*.dat ) <(grep genome_length ../library/*/*dat)

### And move on to distances
paste dist-saccharomyces_cerevisiae.fna.txt ../dist-saccharomyces_cerevisiae.txt

OK, we can now move on to phylogenetic placement

# Run APPLES
run_apples.py -t ../backbone-fastme.tre -d dist-saccharomyces_cerevisiae.tsv -o cerevisiae.jplace

# Turn output to newick
guppy tog cerevisiae.jplace
nw_display cerevisiae.tog.tre

# Compare to the genome-based run
nw_display ../cerevisiae.tog.tre

cd ..

Some other points.

Read cleanup

For preprocessing we use BBtools tool (https://jgi.doe.gov/data-and-tools/bbtools/).
We stick to 5 steps:

To decontaminate .fastq files we used

bbduk.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out1=FASTQ_READ1
out2=FASTQ_READ2 ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 tpe tbo

To deduplicate reads we used

dedupe.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out=DEDUP_OUTPUT_FASTQ_FILE

To reformat deduplicated output files we used

reformat.sh in=DEDUP_OUTPUT_FASTQ_FILE out1=FASTQ_READ1 out2=FASTQ_READ2

To merge paired-end reads .fastq we used

bbmerge.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out=merged.fastq  outu1=read1_unmerged.fastq outu2=read2_unmerged.fastq

Subsequently we concatenate merged and unmerged outputs

cat merged.fastq read1_unmerged.fastq read2_unmerged.fastq > OUTPUT_FILE

The output of deduplication is interleaved fastq so it needs to be reformatted into two separate Read1 and Read2 fastqs. Next, we usually merge Read1 and Read2 outputs. We would suggest preserving outu files (unmerged part of fastqs) and cat them with the merged fastq. Otherwise, coverage might be too low for Skmer to handle.

Using APPLES on aligned sequences

APPLES can also be used with aligned sequences. Below is an example.

On some simulated data.

# Get example input data
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/ref.fa
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/query.fa
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/backbone.nwk
sed -i -e "s/>L/>QU/g"  query.fa # Change query names to become easy to distinguish

# Recompute branch lengths on the backbone tree 
FastTree  -nosupport -nt -nome -noml -log tree.log -intree backbone.nwk < ref.fa > minimum_evo_backbone.nwk

run_apples.py -s ref.fa -q query.fa -t minimum_evo_backbone.nwk -o aligned.jplace

# Turn output to newick
guppy tog aligned.jplace
guppy tog --xml aligned.jplace

More advanced options on real data

wget https://raw.githubusercontent.com/balabanmetin/hslU/master/backbone_ml.nwk
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/query_nothird.fa
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/ref_nothird.fa

# Reestimate branch lengths!
FastTree -nt -nosupport -nome -noml -log true_me.fasttree.log -intree backbone_ml.nwk < ref_nothird.fa > true_me.fasttree

# Run APPLES
run_apples.py -t true_me.fasttree -s ref_nothird.fa -q query_nothird.fa -m FM -c MLSE -f 0.2 -b 25 -o apples.jplace
guppy tog  --xml  apples.jplace

2021_Introduction to k mers for analyzing skimming data - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

Tools:

Papers:

Installations

On the cluster (pre-installed)

Install on a new machine or locally

Obtain large datasets

Alignment-free analyses: full genomes

Run Skmer on example genome data

Infer backbone tree with branch lengths using FastME

Place new sample on the backbone tree

Alignment-free analyses: genome skims

Run ART to simulate genome skims

Repeat analyses using genome skims

Some other points.

Read cleanup

Using APPLES on aligned sequences

⚠️ GitHub.com Fallback ⚠️

2021_Introduction to k mers for analyzing skimming data - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

Tools:

Papers:

Installations

On the cluster (pre-installed)

Install on a new machine or locally

Obtain large datasets

Alignment-free analyses: full genomes

Run Skmer on example genome data

Infer backbone tree with branch lengths using FastME

Place new sample on the backbone tree

Alignment-free analyses: genome skims

Run ART to simulate genome skims

Repeat analyses using genome skims

Some other points.

Read cleanup

Using APPLES on aligned sequences

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️