Phylogenetic placement of samples (Skmer) & estimating genomic distance (APPLES) - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

Alignment-free analyses: full genomes

We will start by computing pairwise distances between genomes

Run Skmer on example genome data

# First, prepare a directory with all the input files
mkdir nonhybrids
for x in `cat genomes/nonhybrids.txt`; do  
	cp genomes/$x/*fna  nonhybrids/$x.fna; 
done # We only choose non-hybrid species
# Then, set aside the genome S. cerevisiae as query
mkdir nonhybrids-query 
mv nonhybrids/Saccharomyces_cerevisiae.fna nonhybrids-query/

# Look at the directories
ls nonhybrids*

# Now, run Skmer on this input directory.
skmer reference -t nonhybrids/
# This command will sketch each input file and will also compare all pairs to compute their distance
# On my machine it takes 4-5 minutes
# `-t` Tells is to transform distances using JC69 transformation

Let us examine results:

ls library/
CONFIG  Saccharomyces_arboricola  Saccharomyces_eubayanus  Saccharomyces_jurei  Saccharomyces_kudriavzevii  Saccharomyces_mikatae  Saccharomyces_paradoxus  Saccharomyces_uvarum

cd library
cat CONFIG
ls *

### See a .dat file
cat Saccharomyces_paradoxus/Saccharomyces_paradoxus.dat

### Get all genome lengths
grep genome_length */*dat

### How big are sketches?
du -sm */*msh

cd ..

### Look at distances:
cat ref-dist-mat.txt

Infer backbone tree with branch lengths using FastME

We will now estimate a FastME tree using our distance.

tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat.phy
# If this doesn't work, try:
# bash /cluster/projects/nn9458k/oh_know/teachers/bin/tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat.phy

### Look at results:
cat ref-dist-mat.phy

# Infer backbone from scratch
fastme  -i ref-dist-mat.phy  -o backbone-fastme.tre

ls
cat backbone-fastme.tre
nw_display backbone-fastme.tre

Note: Here, we inferred a new tree. If you already have a backbone tree, you need to (re)compute branch lengths on it with the right unit. Here is an example.

# Make a tree
echo "((((Saccharomyces_paradoxus,(Saccharomyces_jurei,Saccharomyces_mikatae)),Saccharomyces_kudriavzevii),Saccharomyces_arboricola),(Saccharomyces_eubayanus,Saccharomyces_uvarum));" > backbone.tre

# Recompute branch lengths:
fastme  -i ref-dist-mat.phy  -o backbone.tre -u backbone-fastme.tre
nw_display backbone-fastme-2.tre

Place new sample on the backbone tree

We now place a query (Saccharomyces cerevisiae) onto the backbone tree

First, compute distances from query to the references.

# Ask skmer to compute distances from query to references
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna library/ 

# Convert to right format using script above
convert_to_tsv.sh dist-saccharomyces_cerevisiae.txt > dist-saccharomyces_cerevisiae.tsv
# If this doesn't work, try:
# bash /cluster/projects/nn9458k/oh_know/teachers/bin/convert_to_tsv.sh dist-saccharomyces_cerevisiae.txt > dist-saccharomyces_cerevisiae.tsv

cat dist-saccharomyces_cerevisiae.tsv

Note that the query is not added to the reference set. Use -a if you want the query to be added back to the library

ls library/
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna -a library/
ls library

Now that we have distances, let's add the query to the tree using APPLES.

run_apples.py -t backbone-fastme.tre -d dist-saccharomyces_cerevisiae.tsv -o cerevisiae.jplace

cat cerevisiae.jplace

The output is in the .jplace format. Examine it a bit.

You can always convert the .jplace to newick. We will use guppy, but there are more recent tools.

# Turn output to newick
guppy tog cerevisiae.jplace

cat cerevisiae.tog.tre
nw_display cerevisiae.tog.tre

Alignment-free analyses: genome skims

Now, let's move on to working with genome skims instead of assembled genomes.

Run ART to simulate genome skims

We start with simulating some genome skimming. In real life, you would have these. If you want, you can skip this step and use the data available at /cluster/projects/nn9458k/oh_know/teachers/smirarab/skims/.

# To copy existing files, 
cp -r /cluster/projects/nn9458k/oh_know/teachers/smirarab/skims/ .

# To use ART to create genome skims from your genomes at 2X coverage
mkdir skims skims/nonhybrids skims/nonhybrids-query/
ls nonhybrids*/*fna|xargs -n1 -I@ /cluster/projects/nn9458k/oh_know/teachers/bin/art_bin_MountRainier/art_illumina  -i @  -l 100 -f 2 -o skims/@
cd skims
ls */

Repeat analyses using genome skims

Reference set

# Build library of skims
skmer reference -t nonhybrids/

ls library
cat library/Saccharomyces_paradoxus.fna/Saccharomyces_paradoxus.fna.dat

grep "" library/*/*.fna.dat

Compare with query:

# Compute distance of query to backbone sequence
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna.fq -a library/ 
sed -i -e "s/.fna//g" dist-saccharomyces_cerevisiae.fna.txt # Remove extra .fna from species names.
# Convert file format
convert_to_tsv.sh dist-saccharomyces_cerevisiae.fna.txt > dist-saccharomyces_cerevisiae.tsv
cat dist-saccharomyces_cerevisiae.tsv

Let us compare assembly-based and assembly-free results

### Start with genome lengths
paste <(grep genome_length library/*/*.dat ) <(grep genome_length ../library/*/*dat)

### And move on to distances
paste dist-saccharomyces_cerevisiae.fna.txt ../dist-saccharomyces_cerevisiae.txt

OK, we can now move on to phylogenetic placement

# Run APPLES
run_apples.py -t ../backbone-fastme.tre -d dist-saccharomyces_cerevisiae.tsv -o cerevisiae.jplace

# Turn output to newick
guppy tog cerevisiae.jplace
nw_display cerevisiae.tog.tre

# Compare to the genome-based run
nw_display ../cerevisiae.tog.tre

cd ..

Some other points.

Read cleanup

  1. To decontaminate .fastq files we used

    bbduk.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out1=FASTQ_READ1
    out2=FASTQ_READ2 ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 tpe tbo
  2. To deduplicate reads we used

    dedupe.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out=DEDUP_OUTPUT_FASTQ_FILE
  3. To reformat deduplicated output files we used

    reformat.sh in=DEDUP_OUTPUT_FASTQ_FILE out1=FASTQ_READ1 out2=FASTQ_READ2
  4. To merge paired-end reads .fastq we used

    bbmerge.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out=merged.fastq  outu1=read1_unmerged.fastq outu2=read2_unmerged.fastq
  5. Subsequently we concatenate merged and unmerged outputs

    cat merged.fastq read1_unmerged.fastq read2_unmerged.fastq > OUTPUT_FILE

The output of deduplication is interleaved fastq so it needs to be reformatted into two separate Read1 and Read2 fastqs. Next, we usually merge Read1 and Read2 outputs. We would suggest preserving outu files (unmerged part of fastqs) and cat them with the merged fastq. Otherwise, coverage might be too low for Skmer to handle.

Using APPLES on aligned sequences

APPLES can also be used with aligned sequences. Below is an example.

On some simulated data.

# Get example input data
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/ref.fa
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/query.fa
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/backbone.nwk
sed -i -e "s/>L/>QU/g"  query.fa # Change query names to become easy to distinguish

# Recompute branch lengths on the backbone tree 
FastTree  -nosupport -nt -nome -noml -log tree.log -intree backbone.nwk < ref.fa > minimum_evo_backbone.nwk

run_apples.py -s ref.fa -q query.fa -t minimum_evo_backbone.nwk -o aligned.jplace

# Turn output to newick
guppy tog aligned.jplace
guppy tog --xml aligned.jplace

More advanced options on real data

wget https://raw.githubusercontent.com/balabanmetin/hslU/master/backbone_ml.nwk
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/query_nothird.fa
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/ref_nothird.fa

# Reestimate branch lengths!
FastTree -nt -nosupport -nome -noml -log true_me.fasttree.log -intree backbone_ml.nwk < ref_nothird.fa > true_me.fasttree

# Run APPLES
run_apples.py -t true_me.fasttree -s ref_nothird.fa -q query_nothird.fa -m FM -c MLSE -f 0.2 -b 25 -o apples.jplace
guppy tog  --xml  apples.jplace
⚠️ **GitHub.com Fallback** ⚠️