Phylogenetic placement of samples (Skmer) & estimating genomic distance (APPLES) - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki
We will start by computing pairwise distances between genomes
# First, prepare a directory with all the input files
mkdir nonhybrids
for x in `cat genomes/nonhybrids.txt`; do
cp genomes/$x/*fna nonhybrids/$x.fna;
done # We only choose non-hybrid species
# Then, set aside the genome S. cerevisiae as query
mkdir nonhybrids-query
mv nonhybrids/Saccharomyces_cerevisiae.fna nonhybrids-query/
# Look at the directories
ls nonhybrids*
# Now, run Skmer on this input directory.
skmer reference -t nonhybrids/
# This command will sketch each input file and will also compare all pairs to compute their distance
# On my machine it takes 4-5 minutes
# `-t` Tells is to transform distances using JC69 transformation
Let us examine results:
ls library/
CONFIG Saccharomyces_arboricola Saccharomyces_eubayanus Saccharomyces_jurei Saccharomyces_kudriavzevii Saccharomyces_mikatae Saccharomyces_paradoxus Saccharomyces_uvarum
cd library
cat CONFIG
ls *
### See a .dat file
cat Saccharomyces_paradoxus/Saccharomyces_paradoxus.dat
### Get all genome lengths
grep genome_length */*dat
### How big are sketches?
du -sm */*msh
cd ..
### Look at distances:
cat ref-dist-mat.txt
We will now estimate a FastME tree using our distance.
tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat.phy
# If this doesn't work, try:
# bash /cluster/projects/nn9458k/oh_know/teachers/bin/tsv_to_phymat.sh ref-dist-mat.txt ref-dist-mat.phy
### Look at results:
cat ref-dist-mat.phy
# Infer backbone from scratch
fastme -i ref-dist-mat.phy -o backbone-fastme.tre
ls
cat backbone-fastme.tre
nw_display backbone-fastme.tre
Note: Here, we inferred a new tree. If you already have a backbone tree, you need to (re)compute branch lengths on it with the right unit. Here is an example.
# Make a tree
echo "((((Saccharomyces_paradoxus,(Saccharomyces_jurei,Saccharomyces_mikatae)),Saccharomyces_kudriavzevii),Saccharomyces_arboricola),(Saccharomyces_eubayanus,Saccharomyces_uvarum));" > backbone.tre
# Recompute branch lengths:
fastme -i ref-dist-mat.phy -o backbone.tre -u backbone-fastme.tre
nw_display backbone-fastme-2.tre
We now place a query (Saccharomyces cerevisiae) onto the backbone tree
First, compute distances from query to the references.
# Ask skmer to compute distances from query to references
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna library/
# Convert to right format using script above
convert_to_tsv.sh dist-saccharomyces_cerevisiae.txt > dist-saccharomyces_cerevisiae.tsv
# If this doesn't work, try:
# bash /cluster/projects/nn9458k/oh_know/teachers/bin/convert_to_tsv.sh dist-saccharomyces_cerevisiae.txt > dist-saccharomyces_cerevisiae.tsv
cat dist-saccharomyces_cerevisiae.tsv
Note that the query is not added to the reference set.
Use -a
if you want the query to be added back to the library
ls library/
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna -a library/
ls library
Now that we have distances, let's add the query to the tree using APPLES.
run_apples.py -t backbone-fastme.tre -d dist-saccharomyces_cerevisiae.tsv -o cerevisiae.jplace
cat cerevisiae.jplace
The output is in the .jplace format. Examine it a bit.
You can always convert the .jplace to newick. We will use guppy, but there are more recent tools.
# Turn output to newick
guppy tog cerevisiae.jplace
cat cerevisiae.tog.tre
nw_display cerevisiae.tog.tre
Now, let's move on to working with genome skims instead of assembled genomes.
We start with simulating some genome skimming. In real life, you would have these.
If you want, you can skip this step and use the data available at /cluster/projects/nn9458k/oh_know/teachers/smirarab/skims/
.
# To copy existing files,
cp -r /cluster/projects/nn9458k/oh_know/teachers/smirarab/skims/ .
# To use ART to create genome skims from your genomes at 2X coverage
mkdir skims skims/nonhybrids skims/nonhybrids-query/
ls nonhybrids*/*fna|xargs -n1 -I@ /cluster/projects/nn9458k/oh_know/teachers/bin/art_bin_MountRainier/art_illumina -i @ -l 100 -f 2 -o skims/@
cd skims
ls */
Reference set
# Build library of skims
skmer reference -t nonhybrids/
ls library
cat library/Saccharomyces_paradoxus.fna/Saccharomyces_paradoxus.fna.dat
grep "" library/*/*.fna.dat
Compare with query:
# Compute distance of query to backbone sequence
skmer query -t nonhybrids-query/Saccharomyces_cerevisiae.fna.fq -a library/
sed -i -e "s/.fna//g" dist-saccharomyces_cerevisiae.fna.txt # Remove extra .fna from species names.
# Convert file format
convert_to_tsv.sh dist-saccharomyces_cerevisiae.fna.txt > dist-saccharomyces_cerevisiae.tsv
cat dist-saccharomyces_cerevisiae.tsv
Let us compare assembly-based and assembly-free results
### Start with genome lengths
paste <(grep genome_length library/*/*.dat ) <(grep genome_length ../library/*/*dat)
### And move on to distances
paste dist-saccharomyces_cerevisiae.fna.txt ../dist-saccharomyces_cerevisiae.txt
OK, we can now move on to phylogenetic placement
# Run APPLES
run_apples.py -t ../backbone-fastme.tre -d dist-saccharomyces_cerevisiae.tsv -o cerevisiae.jplace
# Turn output to newick
guppy tog cerevisiae.jplace
nw_display cerevisiae.tog.tre
# Compare to the genome-based run
nw_display ../cerevisiae.tog.tre
cd ..
-
For preprocessing we use BBtools tool (https://jgi.doe.gov/data-and-tools/bbtools/).
-
We stick to 5 steps:
-
To decontaminate .fastq files we used
bbduk.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out1=FASTQ_READ1 out2=FASTQ_READ2 ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 tpe tbo
-
To deduplicate reads we used
dedupe.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out=DEDUP_OUTPUT_FASTQ_FILE
-
To reformat deduplicated output files we used
reformat.sh in=DEDUP_OUTPUT_FASTQ_FILE out1=FASTQ_READ1 out2=FASTQ_READ2
-
To merge paired-end reads .fastq we used
bbmerge.sh in1=FASTQ_READ1 in2=FASTQ_READ2 out=merged.fastq outu1=read1_unmerged.fastq outu2=read2_unmerged.fastq
-
Subsequently we concatenate merged and unmerged outputs
cat merged.fastq read1_unmerged.fastq read2_unmerged.fastq > OUTPUT_FILE
The output of deduplication is interleaved fastq so it needs to be reformatted into two separate Read1 and Read2 fastqs. Next, we usually merge Read1 and Read2 outputs. We would suggest preserving outu files (unmerged part of fastqs) and cat them with the merged fastq. Otherwise, coverage might be too low for Skmer to handle.
APPLES can also be used with aligned sequences. Below is an example.
On some simulated data.
# Get example input data
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/ref.fa
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/query.fa
wget https://raw.githubusercontent.com/balabanmetin/apples/master/data/backbone.nwk
sed -i -e "s/>L/>QU/g" query.fa # Change query names to become easy to distinguish
# Recompute branch lengths on the backbone tree
FastTree -nosupport -nt -nome -noml -log tree.log -intree backbone.nwk < ref.fa > minimum_evo_backbone.nwk
run_apples.py -s ref.fa -q query.fa -t minimum_evo_backbone.nwk -o aligned.jplace
# Turn output to newick
guppy tog aligned.jplace
guppy tog --xml aligned.jplace
More advanced options on real data
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/backbone_ml.nwk
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/query_nothird.fa
wget https://raw.githubusercontent.com/balabanmetin/hslU/master/ref_nothird.fa
# Reestimate branch lengths!
FastTree -nt -nosupport -nome -noml -log true_me.fasttree.log -intree backbone_ml.nwk < ref_nothird.fa > true_me.fasttree
# Run APPLES
run_apples.py -t true_me.fasttree -s ref_nothird.fa -q query_nothird.fa -m FM -c MLSE -f 0.2 -b 25 -o apples.jplace
guppy tog --xml apples.jplace