Anvi'o Pangenomic Workflow - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki
You can run this code directly on the command line or segment them into bash scripts to submit to SLURM.
If you are running Anvio on docker this will get you to the docker environment
docker run --rm -it -v `pwd`:`pwd` -w `pwd` -p 8080:8080 meren/anvio:latest
When you want to CLOSE OUT OF DOCKER first press ctrl+c
to kill the anvio server that is listening on port 8080.THEN you can logout of the docker environment by pressing ctrl+d
. Failure to do so will require you to restart Docker entirely, or find a way to kill the process on port 8080.
I separated the old files out just for organizational purposes but you can reorganize how you see fit.
module load anvio
for i in *.fa
do
anvi-script-reformat-fasta $i -o r_$i --min-len 0 --simplify-names
done
mkdir originalFastaFiles
mv -- !(r_*.fa) originalFastaFiles
Load Prokka and run in the same folder as all the 'renamed' fasta files. This script takes awhile so I suggest running it as a batch file, that way if your connection goes out, or you need to close your terminal, the program will keep running.
mkdir prok_docs
module load prokka
for i in `ls *.fa | awk -F "/" '{print $1}' | sed "s/.fa//g"`
do
prokka --outdir prok_docs/pk_$i --prefix $i $i.fa --cpus 20
done
rmdir tmp
Separate and organize the files so that you have the reformatted .fa and .gff files in the same place.
mkdir reform_fa_gff
cp prok_docs/pk_*/*.gff reform_fa_gff/
mv r_*.fa reform_fa_gff
cd reform_fa_gff
Parse gff files using gff_parser.py to get gene annotations and gene calls. First you need to download the gff_parser tool from github.
wget https://raw.githubusercontent.com/karkman/gff_parser/master/gff_parser.py -O gff_parser.py
If you are NOT running on the HiperGator you will need to make sure that you have gffutils and argeparse installed.
pip install gffutils
pip install argparse
Now we can run this code.
module python
for i in `ls *.gff | awk -F "/" '{print $1}' | sed "s/.gff//g"`
do
python gff_parser.py "$i”.gff --gene-calls “$i”_gene_calls.txt --annotation “$i”_gene_annot.txt
done
There should be a _gene_calls.txt and _gene_annot.txt file for every .fa and .gff file. Store them the folder with the reformatted Files.
Something optional you can do is to set-up the NCBI’s Clusters of Orthologus Groups (COG) functions.
mkdir COGS
anvi-setup-ncbi-cogs --cog-data-dir COGS #or whatever path to the COG files
This is one that I do suggest running as a SLURM job just because of how long it takes. Make sure to run it in the same folder with the _gene_calls, gene_annot, and .fa files. If done correctly this should output a .db file for every .fa file.
module load anvio
for i in `ls *.fa | awk -F "/" '{print $1}' | sed "s/.fa//g"`
do
anvi-gen-contigs-database -f "$i".fa -o "$i".db --external-gene-calls "$i"_gene_calls.txt -n "$i"
anvi-import-functions -c "$i".db -i "$i"_gene_annot.txt
anvi-run-hmms -c "$i".db
anvi-run-ncbi-cogs -c "$i".db --cog-data-dir COGS/ #Optional for if you setup the COGS
done
You will need to make tab delimited text file called external-genomes.txt. At its simplest it should contain the name of the genome and its contig.db path. See an example external-genomes.txt here. Make sure you have this in the same folder as all of your .db files.
module load anvio
anvi-gen-genomes-storage -e external-genomes.txt -o _inputNameOfProject_-GENOMES.db --gene-caller Prodigal
Now we can run the Pangenomic analysis on our Genome database. This should be run in the same directory as your __-GENOMES.db
anvi-pan-genome -g _inputNameOfProject_-GENOMES.db -n _inputNameOfProject_
You cannot use the HiperGator for the actual display, you'll need to transfer the inputNameOfGenome folder created n the step above, and the inputNameOfProject-GENOMES.db and then run the commands locally. First start up the Docker Anvio file.
docker run --rm -it -v `pwd`:`pwd` -w `pwd` -p 8080:8080 meren/anvio:latest
Now we can display our results with
anvi-display-pan -p _inputNameOfProject_-PAN.db -g _inputNameOfProject_-GENOMES.db
Open up GoogleChrome and go to http://localhost:8080 to see your results. When you're finished make sure you Ctrl+c
to kill the server before closing the terminal or Docker.
Take a look at layer-additional-data.txt in vi or vim to make sure that it is tab delimited
anvi-import-misc-data layer-additional-data.txt \
-p HaloEndo/HaloEndo-PAN.db \
--target-data-table layers
You can choose whichever category you want.
anvi-get-enriched-functions-per-pan-group -p HaloEndo/HaloEndo-PAN.db \
-g HaloEndo-GENOMES.db \
--category isolate_genus \ #You can choose whchever category
--annotation-source Prokka:Prodigal \
-o HaloEndo-PAN-enriched-functions-genus.txt \
--functional-occurrence-table-output HaloEndo-functions-occurrence-genus.txt
YOU HAVE TO MANUALLY ADD TABS after ^[:alnum:], s/^, and /name. I used this a script and checked with vi before running
sed "s/[^[:alnum:] _]/_/g" HaloEndo-functions-occurrence-genus.txt | \
tr -s \_ _ | \
sed 's/^ /name /' \
> HaloEndo-functions-occurrence-fixed-a-little.txt
Cut down on repeat names and merge some instances
cut -f 1 HaloEndo-functions-occurrence-fixed-a-little.txt | sort | uniq -d
wget https://gist.githubusercontent.com/ShaiberAlon/aff0b2493637a370c7d52e1a5aacecea/raw/7e2647fa391bd55617cd4d7685c0056600ec4eae/fix_functional_occurrence_table.py
./fix_functional_occurrence_table.py HaloEndo-functions-occurrence-fixed-a-little.txt HaloEndo-functions-occurrence-fixed.txt
anvi-matrix-to-newick HaloEndo-functions-occurrence-fixed.txt \
-o HaloEndo-functions-tree.txt
anvi-matrix-to-newick HaloEndo-functions-occurrence-fixed.txt \
-o HaloEndo-functions-layers-tree.txt \
--transpose
anvi-interactive -p HaloEndo-functions-manual-profile.db \
--tree HaloEndo-functions-tree.txt \
-d HaloEndo-functions-occurrence-fixed.txt \
--manual \
--dry-run
echo -e "item_name\tdata_type\tdata_value" > HaloEndo-functions-layers-order.txt
echo -e "HaloEndo_functions_tree\tnewick\t`cat HaloEndo-functions-layers-tree.txt`" \
>> HaloEndo-functions-layers-order.txt
anvi-import-misc-data HaloEndo-functions-layers-order.txt \
-p HaloEndo-functions-manual-profile.db \
-t layer_orders \
--just-do-it
Import Information from Previous PAN.db
anvi-export-misc-data -p HaloEndo/HaloEndo-PAN.db \
-t layers \
-o HaloEndo-layer-additional-data.txt
Add that data to the manual Database
anvi-import-misc-data HaloEndo-layer-additional-data.txt \
-p HaloEndo-functions-manual-profile.db \
-t layers
anvi-interactive -p HaloEndo-functions-manual-profile.db \
-t HaloEndo-functions-tree.txt \
-d HaloEndo-functions-occurrence-fixed.txt \
--title "HaloEndo Pan - functional occurrence" \
--manual