Methodology - Silantoi33/Genomic-Analysis-of-Carbapenem-Resistant-Pseudomonas-aeruginosa GitHub Wiki
Step 1: Downloading Sequence Data
We obtained the sequences of all our strains from NCBI-GenBank by utilizing the provided accession numbers. The sequences were previously assembled, and unfortunately, we were unable to locate raw data for further analysis. The data retrieval was conducted through the Ubuntu terminal.
Setup
sudo apt install python3-pip
pip install ncbi-acc-download
ncbi-acc-download Version=0.2.8
Bash script
!/bin/bash
#Authors: Silantoi & Kevin
#Date: 27/11/2023
#Modified Date: 07/12/2023
#Script name:fasta.acc.sh
#Use: bash fasta.acc.sh
#loops through each file
#links
for file in $(cat /home/sequser/Downloads/GACRPA-Project/fasta.acc.list.txt)
do
#Download fasta files
ncbi-acc-download --format fasta $file
done
#Download fasta links using the links
for a in $(cat /home/sequser/Downloads/GACRPA-Project/links.txt)
do
wget $a
#gunzip .gz files
gunzip *.gz
done
Output
/home/sequser/Downloads/GACRPA-Project/fasta_files AE004091.2.fa
CP014999.1.fa
CP015001.1.fa
CP015003.1.fa
CP021380.2.fa
LVXB01.1.fsa_nt
PHSS01.1.fsa_nt
PHST01.1.fsa_nt
Step 2: Running strain typing (MLST)
Bash script
#!/bin/bash
#Authors: Silantoi & Kevin
#Date: 27/11/2023
#Modified Date: 07/12/2023
#Script name:gacrpa_mlst.sh
#Usage: bash gacrpa_mlst.sh
#Description: For loop bash script to identify the multi locus sequence type (MLST)
# do mlst
conda activate mlst
for a in *.fa
do
mlst $a | cut -f 1,2,3 >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_mlst.tsv
done
for b in *.fsa_nt
do
mlst $b | cut -f 1,2,3 >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_mlst.tsv
done
Output
AE004091.2.fa paeruginosa -
CP014999.1.fa paeruginosa 277
CP015001.1.fa paeruginosa 277
CP015003.1.fa paeruginosa 277
CP021380.2.fa paeruginosa 277
LVXB01.1.fsa_nt paeruginosa 277
PHSS01.1.fsa_nt paeruginosa 277
PHST01.1.fsa_nt paeruginosa 277
Step 3: Screening for AMR, Virulence and Plasmid determinants
Abricate
Setup
conda create -n abricate -c bioconda abricate
List dependencies;
conda activate abricate
Version= abricate 1.0.1
abricate --list
all databases available in abricate will be listed
Output
DATABASE SEQUENCES DBTYPE DATE
ecoli_vf 2701 nucl 2021-Mar-27
resfinder 3142 nucl 2023-Nov-10
ecoh 597 nucl 2021-Mar-27
megares 6635 nucl 2021-Mar-27
ncbi 5386 nucl 2021-Mar-27
card 2631 nucl 2021-Mar-27
plasmidfinder 460 nucl 2021-Mar-27
argannot 2223 nucl 2021-Mar-27
vfdb 2597 nucl 2021-Mar-27
We will use CARD, ResFinder, & Vfdb
CARD
The Comprehensive Antibiotic Resistance Database (CARD; http://arpcard.mcmaster.ca) is a manually curated resource containing high quality reference data on the molecular basis of antimicrobial resistance (AMR), with an emphasis on the genes, proteins and mutations involved in AMR.
ResFinder
ResFinder uses BLAST for identification of acquired antimicrobial resistance genes in whole-genome data.
Vfdb
The virulence factor database (VFDB, http://www.mgc.ac.cn/VFs/) provides up-to-date knowledge of virulence factors (VFs) of various bacterial pathogens.
Bash script
#!/bin/bash
#Authors: Silantoi & Kevin
#Date: 27/11/2023
#Modified Date: 07/12/2023
#Script name: gacrpa.abricate.sh
#Usage: bash script name
#Decription:For loop bash script to screen for AMR, Virulence and plasmids determinants
#Tool: ABRICATE
#Database: Resfinder and CARD for AMR genes, VFDB - Virulence factors and Plasmidfinder - plasmids
#do resfinder first
for a in *.fa
do
abricate --db resfinder --csv $a >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_resfinder.csv
done
for b in *.fsa_nt
do
abricate --db resfinder --csv $b >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_resfinder.csv
done
#do now card
for c in *.fa
do
abricate --db card --csv $c >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_card.csv
done
for d in *.fsa_nt
do
abricate --db card --csv $d >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_card.csv
done
#do virulence
for e in *.fa
do
abricate --db vfdb --csv $e >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_vfdb.csv
done
for f in *.fsa_nt
do
abricate --db vfdb --csv $f >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_vfdb.csv
done
# do plasmidfinder
for g in *.fa
do
abricate --db plasmidfinder --csv $g >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_plasmids.csv
done
for h in *.fsa_nt
do
abricate --db plasmidfinder --csv $h >> /home/sequser/Downloads/GACRPA-Project/gacrpa_results/gacrpa_plasmids.csv
done
Results
Step 4: Pangenome Annotation using Prokka 1.11
Genome annotation is the process of identifying functional elements along the sequence of a genome, thus giving meaning to it. It is necessary because the sequencing of DNA produces sequences of unknown function.
Setup
conda create -n prokka
conda activate prokka
conda install -c bioconda prokka
Version=prokka 1.11
Script
#!/bin/bash
#Authors: Silantoi & Kevin
#Date: 27/11/2023
#Modified Date: 07/12/2023
#Script name: gacrpa.prokka.sh
#Usage: bash gacrpa.prokka.sh
#Description: Annotation of gemones using PROKKA: rapid prokaryotic genome annotation tool
for file in *.fa
do
tag=${file%.fa}
prokka --cpus 16 --prefix "$tag" --locustag "$tag" --outdir "$tag"_prokka "$file"
done
for file in *.fsa_nt
do
tag=${file%.fsa_nt}
prokka --cpus 16 --prefix "$tag" --locustag "$tag" --outdir "$tag"_prokka "$file"
done
#move output to another directory
mv *_prokka ./gacrpa_results/prokka
Output
AE004091.2_prokka
CP014999.1_prokka
CP015001.1_prokka
CP015003.1_prokka
CP021380.2_prokka
LVXB01.1_prokka
PHSS01.1_prokka
PHST01.1_prokka
Step 5: Gene alignment using Roary 3.13.0
#Roary-is a high speed stand alone pan genome pipeline, which takes annotated assemblies in GFF3 format (produced by Prokka (Seemann, 2014)) and calculates the pan genome
Setup
conda create -n roary
conda activate roary
mamba install roary
Version=roary 3.13.0
Prokka output files are in the.gff file format. The .gff files are used in roary pipeline to align genes.
We cd
to all prokka results directory and copied the .gff files to new directory called GFF
GFF output
AE004091.2.gff
CP014999.1.gff
CP015001.1.gff
CP015003.1.gff
CP021380.2.gff
LVXB01.1.gff
PHSS01.1.gff
PHST01.1.gff
Roary Command to alignment .gff files
roary -f roaryresults -e -n -v -mafft -p 16 *.gff
Output
accessory_binary_genes.fa
accessory.header.embl
clustered_proteins
core_accessory.tab
gene_presence_absence.csv
number_of_genes_in_pan_genome.Rtab
pan_genome_reference.fa
accessory_binary_genes.fa.newick
accessory.tab
core_accessory_graph.dot
core_alignment_header.embl
gene_presence_absence.Rtab
number_of_new_genes.Rtab
summary_statistics.txt
accessory_graph.dot
blast_identity_frequency.Rtab
core_accessory.header.embl
core_gene_alignment.aln
number_of_conserved_genes.Rtab
number_of_unique_genes.Rtab
The file that has aligned sequences is core_gene_alignment.aln
Step 6: Phylogenetic construction using RAxML-NG 0.9.0
Setup
conda create -n raxml
conda activate raxml
conda install raxml-ng
Run raxml-ng
raxml-ng --all --msa core_gene_alignment.aln --model LG+G8+F --tree pars{10} --bs-trees 200
Output
core_gene_alignment.aln.raxml.bestModel
core_gene_alignment.aln.raxml.bootstraps
core_gene_alignment.aln.raxml.mlTrees
core_gene_alignment.aln.raxml.startTree
core_gene_alignment.aln.raxml.bestTree
core_gene_alignment.aln.raxml.log
core_gene_alignment.aln.raxml.rba
core_gene_alignment.aln.raxml.support
ITOL
Visualization of Phylogenetic tree using (Interactive Tree Of Life)-
Upload this output file from raxml-ng core_gene_alignment.aln.raxml.bestTree
-
Once uploaded, on the left side of your screen, on Branch lengths tap "ignore" to visualize your tree.
Results
Step 7: Comparative Genomic Analysis Using BRIG 0.95
We followed a youtube tutorial to use BRIG
- To install we follow this youtube tutorial to install BRIG. Click here
- After installing we still used youtube tutorial to run BRIG. Click here