1. Phylogenetic classification, genotyping and data processing - maxlcummins/APEC-MGEN-2018 GitHub Wiki

MLST, e-serotype, phylogroup and genotyping was undertaken using ARIBA and subsequently processed using ARIBAlord. A Jupyter notebook can be found here. This Jupyter notebook has been modified for viewing on this page.

Australian Avian Pathogenic E. coli - Collection 1

This is a Jupyter notebook documenting the workflow for the analysis of a collection of E. coli provided by the University of Melbourne. The collection consists of E. coli that originate from poultry operations around Australia.

This will follow the genotyping and phylogrouping of the samples, through detection of genes associated with virulence, antimicrobial resistance, plasmid carriage, phylogroup, e-serotype and multi-locus sequence type with ARIBA and its subsequent processing with ARIBAlord.

Subsequent notebooks will follow genome assembly and generation of phylogenetic trees using snippy and phylosift.

This workflow assumes you have miniconda3 installed (Note the 3! Let's leave Python 2 in the past...). If you do not, open a new terminal window and enter the following commands and follow any installation prompts, then switch back to this workbook and continue as normal. If you come across any issues you can find installation instructions for miniconda here

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh

Next we will create two environments the first for ARIBA and ARIBAlord. We will then activate the ARIBA environment before starting the workflow.

conda create -y -n ariba -c bioconda ariba
conda create -y -n aribalord
conda create -y -n aribalord git pandas regex
source activate ariba

Sourcing ARIBA databases

We now download our ARIBA databases from the Centre for Genomic Epidemiology. Note that we also create an info file for each database. It is important that we know which databases versions are used for our analysis.

mkdir ~/ARIBA_databases
cd ~/ARIBA_databases

#get databases for use in ARIBA
ariba getref plasmidfinder ~/ARIBA_databases/plasmidfinder > ~/ARIBA_databases/plasmidfinder_version.info
ariba getref resfinder ~/ARIBA_databases/resfinder > ~/ARIBA_databases/resfinder_version.info
ariba getref virulencefinder ~/ARIBA_databases/virulencefinder > ~/ARIBA_databases/virulencefinder_version.info
ariba pubmlstget "Escherichia coli#1" ~/ARIBA_databases/get_mlst
wget https://raw.githubusercontent.com/maxlcummins/E_coli_customDB/master/EC_customDB.fasta -P ~/ARIBA_databases
wget https://raw.githubusercontent.com/maxlcummins/E_coli_customDB/master/E_coli_phylogroup.fasta -P ~/ARIBA_databases
wget https://raw.githubusercontent.com/katholt/srst2/master/data/EcOH.fasta -P ~/ARIBA_databases

#Making version info files and moving them to the same file
echo "MLST database accessed on:" > ~/ARIBA_databases/MLST_version.info; date >> ~/ARIBA_databases/MLST_version.info
echo "EC_customDB database accessed on:" > ~/ARIBA_databases/EC_customDB.info; date >> ~/ARIBA_databases/EC_customDB.info
echo "E_coli_phylogroup database accessed on:" > ~/ARIBA_databases/E_coli_phylogroup.info; date >> ~/ARIBA_databases/E_coli_phylogroup.info
echo "EcOH database accessed on:" > ~/ARIBA_databases/EcOH.info; date >> ~/ARIBA_databases/EcOH.info
mkdir ~/ARIBA_databases/database_info; mv ~/ARIBA_databases/*.info ~/ARIBA_databases/database_info

Prepareref

ARIBA's prepareref step prepares sequence databases for use by ariba run. Note that the default behaviour of this step is to remove sequences that are not coding sequences - therefore we make a '.warnings' file for each database that will serve as a store of information regarding removed sequences.

ariba prepareref -f ~/ARIBA_databases/plasmidfinder.fa -m ~/ARIBA_databases/plasmidfinder.tsv ~/ARIBA_databases/plasmidfinder.prepareref > ~/ARIBA_databases/plasmidfinder.warnings
ariba prepareref -f ~/ARIBA_databases/resfinder.fa -m ~/ARIBA_databases/resfinder.tsv ~/ARIBA_databases/resfinder.prepareref > ~/ARIBA_databases/resfinder.warnings
ariba prepareref -f ~/ARIBA_databases/virulencefinder.fa -m ~/ARIBA_databases/virulencefinder.tsv ~/ARIBA_databases/virulencefinder.prepareref  > ~/ARIBA_databases/virulencefinder.warnings
# No prepareref step required for MLST database
ariba prepareref --all_coding no -f ~/ARIBA_databases/EC_customDB.fasta ~/ARIBA_databases/EC_customDB.prepareref > ~/ARIBA_databases/EC_customDB.warnings
ariba prepareref --all_coding no -f ~/ARIBA_databases/E_coli_phylogroup.fasta ~/ARIBA_databases/E_coli_phylogroup.prepareref > ~/ARIBA_databases/E_coli_phylogroup.warnings
ariba prepareref --all_coding no -f ~/ARIBA_databases/EcOH.fasta ~/ARIBA_databases/EcOH.prepareref > ~/ARIBA_databases/EcOH.warnings
mkdir ~/ARIBA_databases/database_warnings; mv ~/ARIBA_databases/*.warnings ~/ARIBA_databases/database_warnings

Submitting ARIBA jobs to the cluster

Jobs were submitted with qsub to a high powered computing cluster at UTS. The qsub script can be accessed on github as below. If you would like to use it on a different cluster this script will require some editing.

Syntax for this qsub job submission is as follows:

qsub -v READ_PATH=\*path_to_reads\*,REF=\*prepareref_database\*,OUT=\*output_suffix\*,EMAIL=\*user_email\* ~/qsub/ariba_scratch.qsub

Currently, a student number is required. Providing this will give alerts from the HPC via student email as to when a job has started, has been abored, or has completed. Changing "$STU_NO" on line 9 of ariba_scratch.qsub to any valid email address will also work.

#Note that before we submit the jobs we first download the ariba_scratch.qsub file from github with 'wget'
rm ~/qsub/ariba_scratch.qsub
wget https://raw.githubusercontent.com/maxlcummins/Bioinformatics/master/ariba_scratch.qsub -P ~/qsub

#Submit ariba run via qsub to the computing cluster
cd ~
qsub -v READ_PATH=BigChook_Reads/reads,REF=ARIBA_databases/plasmidfinder.prepareref,OUT=plasmidfinder,STU_NO=11025234 ~/qsub/ariba_scratch.qsub
qsub -v READ_PATH=BigChook_Reads/reads,REF=ARIBA_databases/resfinder.prepareref,OUT=resfinder,STU_NO=11025234 ~/qsub/ariba_scratch.qsub
qsub -v READ_PATH=BigChook_Reads/reads,REF=ARIBA_databases/virulencefinder.prepareref,OUT=virulencefinder,STU_NO=11025234 ~/qsub/ariba_scratch.qsub
qsub -v READ_PATH=BigChook_Reads/reads,REF=ARIBA_databases/get_mlst/ref_db,OUT=MLST,STU_NO=11025234 ~/qsub/ariba_scratch.qsub
qsub -v READ_PATH=BigChook_Reads/reads,REF=ARIBA_databases/EC_customDB.prepareref,OUT=EC_customDB,STU_NO=11025234 ~/qsub/ariba_scratch.qsub
qsub -v READ_PATH=BigChook_Reads/reads,REF=ARIBA_databases/E_coli_phylogroup.prepareref,OUT=E_coli_phylogroup,STU_NO=11025234 ~/qsub/ariba_scratch.qsub
qsub -v READ_PATH=BigChook_Reads/reads,REF=ARIBA_databases/EcOH.prepareref,OUT=EcOH,STU_NO=11025234 ~/qsub/ariba_scratch.qsub

Summarising output from ARIBA runs

Next we need to summarise the output for each respective run. For more detailed information about this check out the GitHub repo on ARIBAlord.

Say we have 100 read-pairs, corresponding to 100 samples, all beginning with 'AVC'. Following the completion of our job submissions in the step above (you should be notified by email), all read-pairs will be associated with an output folder for each database we used to screen the samples:

ariba summary --no_tree --cluster_cols assembled,ref_seq ~/BigChook_Reads/reads/BigChook3_plasmidfinder ~/BigChook_Reads/reads/*plasmidfinder.out/report.tsv
ariba summary --no_tree --cluster_cols assembled,ref_seq ~/BigChook_Reads/reads/BigChook3_resfinder ~/BigChook_Reads/reads/*resfinder.out/report.tsv
ariba summary --no_tree --cluster_cols assembled,ref_seq ~/BigChook_Reads/reads/BigChook3_virulencefinder ~/BigChook_Reads/reads/*virulencefinder.out/report.tsv
#MLST must be processed a bit differently
ariba summary --no_tree --cluster_cols assembled,ref_seq ~/BigChook_Reads/reads/BigChook3_EC_customDB ~/BigChook_Reads/reads/*EC_customDB.out/report.tsv
ariba summary --no_tree --cluster_cols assembled,ref_seq ~/BigChook_Reads/reads/BigChook3_E_coli_phylogroup ~/BigChook_Reads/reads/*E_coli_phylogroup.out/report.tsv
ariba summary --no_tree --cluster_cols assembled,ref_seq ~/BigChook_Reads/reads/BigChook3_EcOH ~/BigChook_Reads/reads/*EcOH.out/report.tsv


#MLST pre-processing
for f in ~/BigChook_Reads/reads/*MLST.out/mlst_report.tsv; do paste $f <(yes $f | head -n $(cat $f | wc -l)) > $f.new; done
cat ~/BigChook_Reads/reads/*MLST.out/*.new > ~/BigChook_Reads/reads/BigChook3_MLST.tsv

mkdir ~/BigChook_Reads/reads/BigChook3_ARIBA_summaries
cp ~/BigChook_Reads/reads/*.csv ~/BigChook_Reads/reads/BigChook3_MLST.tsv ~/BigChook_Reads/reads/BigChook3_ARIBA_summaries
rm ~/BigChook_Reads/reads/BigChook3_ARIBA_summaries/*.phandango*

ARIBAlord - combining ARIBA outputs

Here we create a new conda environment called ARIBAlord for running... ARIBAlord...

We install the required dependencies, activate the environment and clone the github to the home directory. We then run the python executable ARIBAlord.py from within the downloaded github repo and feed in two variables:

The path to the ARIBA_summaries (contains the outputs of our ariba summary and MLST processing steps)
An output prefix (in this case I have chosen BigChook3)

source deactivate
source activate aribalord

cd ~
git clone https://github.com/maxlcummins/ARIBAlord.git

Cleaning sample names in the ariba output files

#to clean sample names in non MLST ariba outputs
perl -pi -e 's/_R1.*tsv//g' ~/BigChook_Reads/reads/BigChook3_ARIBA_summaries/BigChook3_*.csv
perl -pi -e 's/.*AVC/AVC/g' ~/BigChook_Reads/reads/BigChook3_ARIBA_summaries/BigChook3_*.csv

#to clean sample names in MLST ariba output
perl -pi -e 's/_R1.*tsv//g' ~/BigChook_Reads/reads/BigChook3_ARIBA_summaries/BigChook3_MLST.tsv
perl -pi -e 's/\/shared.*reads\///g' ~/BigChook_Reads/reads/BigChook3_ARIBA_summaries/BigChook3_MLST.tsv

Running ARIBAlord to combine tables

python ~/ARIBAlord/ARIBAlord ~/BigChook_Reads/reads/BigChook3_ARIBA_summaries ~/BigChook_Reads/reads/BigChook3