Anvi'o Pangenomic Workflow - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki

You can run this code directly on the command line or segment them into bash scripts to submit to SLURM.

Information about running on Docker

If you are running Anvio on docker this will get you to the docker environment

      docker run --rm -it -v `pwd`:`pwd` -w `pwd` -p 8080:8080 meren/anvio:latest

When you want to CLOSE OUT OF DOCKER first press ctrl+c to kill the anvio server that is listening on port 8080.THEN you can logout of the docker environment by pressing ctrl+d. Failure to do so will require you to restart Docker entirely, or find a way to kill the process on port 8080.

Reformat Fasta Files

I separated the old files out just for organizational purposes but you can reorganize how you see fit.

      module load anvio
      for i in *.fa
      do
           anvi-script-reformat-fasta $i -o r_$i --min-len 0 --simplify-names
      done
      mkdir originalFastaFiles
      mv -- !(r_*.fa) originalFastaFiles

Generate Gene Calls with Prokka

Load Prokka and run in the same folder as all the 'renamed' fasta files. This script takes awhile so I suggest running it as a batch file, that way if your connection goes out, or you need to close your terminal, the program will keep running.

      mkdir prok_docs
      module load prokka
      for i in `ls *.fa | awk -F "/" '{print $1}' | sed "s/.fa//g"`
      do
           prokka --outdir prok_docs/pk_$i --prefix  $i $i.fa --cpus 20
      done
      rmdir tmp

Separate and organize the files so that you have the reformatted .fa and .gff files in the same place.

      mkdir reform_fa_gff
      cp prok_docs/pk_*/*.gff reform_fa_gff/
      mv r_*.fa reform_fa_gff
      cd reform_fa_gff

Parse gff files

Parse gff files using gff_parser.py to get gene annotations and gene calls. First you need to download the gff_parser tool from github.

      wget https://raw.githubusercontent.com/karkman/gff_parser/master/gff_parser.py -O gff_parser.py

If you are NOT running on the HiperGator you will need to make sure that you have gffutils and argeparse installed.

      pip install gffutils
      pip install argparse

Now we can run this code.

      module python
      for i in `ls *.gff | awk -F "/" '{print $1}' | sed "s/.gff//g"`
      do
           python gff_parser.py "$i”.gff --gene-calls “$i”_gene_calls.txt --annotation “$i”_gene_annot.txt
      done

There should be a _gene_calls.txt and _gene_annot.txt file for every .fa and .gff file. Store them the folder with the reformatted Files.

Generate Contigs Database

Something optional you can do is to set-up the NCBI’s Clusters of Orthologus Groups (COG) functions.

  mkdir COGS
  anvi-setup-ncbi-cogs --cog-data-dir COGS #or whatever path to the COG files

This is one that I do suggest running as a SLURM job just because of how long it takes. Make sure to run it in the same folder with the _gene_calls, gene_annot, and .fa files. If done correctly this should output a .db file for every .fa file.

 module load anvio
 for i in `ls *.fa | awk -F "/" '{print $1}' | sed "s/.fa//g"`
 do
      anvi-gen-contigs-database -f "$i".fa -o "$i".db --external-gene-calls "$i"_gene_calls.txt -n "$i"
      anvi-import-functions -c "$i".db -i "$i"_gene_annot.txt
      anvi-run-hmms -c "$i".db
      anvi-run-ncbi-cogs -c "$i".db --cog-data-dir COGS/ #Optional for if you setup the COGS
 done

Generate Genome Database

You will need to make tab delimited text file called external-genomes.txt. At its simplest it should contain the name of the genome and its contig.db path. See an example external-genomes.txt here. Make sure you have this in the same folder as all of your .db files.

      module load anvio
      anvi-gen-genomes-storage -e external-genomes.txt -o _inputNameOfProject_-GENOMES.db --gene-caller Prodigal

Run Pangenome Analysis

Now we can run the Pangenomic analysis on our Genome database. This should be run in the same directory as your __-GENOMES.db

      anvi-pan-genome -g _inputNameOfProject_-GENOMES.db -n _inputNameOfProject_

Display

You cannot use the HiperGator for the actual display, you'll need to transfer the inputNameOfGenome folder created n the step above, and the inputNameOfProject-GENOMES.db and then run the commands locally. First start up the Docker Anvio file.

  docker run --rm -it -v `pwd`:`pwd` -w `pwd` -p 8080:8080 meren/anvio:latest

Now we can display our results with

   anvi-display-pan -p _inputNameOfProject_-PAN.db -g _inputNameOfProject_-GENOMES.db

Open up GoogleChrome and go to http://localhost:8080 to see your results. When you're finished make sure you Ctrl+c to kill the server before closing the terminal or Docker.

Layer Additional Data

Take a look at layer-additional-data.txt in vi or vim to make sure that it is tab delimited

 anvi-import-misc-data layer-additional-data.txt \
           -p HaloEndo/HaloEndo-PAN.db \
           --target-data-table layers

Get Enriched Functions

You can choose whichever category you want.

      anvi-get-enriched-functions-per-pan-group -p HaloEndo/HaloEndo-PAN.db \
                                                -g HaloEndo-GENOMES.db \
                                                --category isolate_genus \ #You can choose whchever category
                                                --annotation-source Prokka:Prodigal \
                                                -o HaloEndo-PAN-enriched-functions-genus.txt \
                                                --functional-occurrence-table-output HaloEndo-functions-occurrence-genus.txt

Clean up the functional occurence table

YOU HAVE TO MANUALLY ADD TABS after ^[:alnum:], s/^, and /name. I used this a script and checked with vi before running

      sed "s/[^[:alnum:]  _]/_/g" HaloEndo-functions-occurrence-genus.txt | \
      tr -s \_ _ | \
      sed 's/^  /name     /' \
      > HaloEndo-functions-occurrence-fixed-a-little.txt

Cut down on repeat names and merge some instances

      cut -f 1 HaloEndo-functions-occurrence-fixed-a-little.txt | sort | uniq -d
      wget https://gist.githubusercontent.com/ShaiberAlon/aff0b2493637a370c7d52e1a5aacecea/raw/7e2647fa391bd55617cd4d7685c0056600ec4eae/fix_functional_occurrence_table.py
      ./fix_functional_occurrence_table.py HaloEndo-functions-occurrence-fixed-a-little.txt HaloEndo-functions-occurrence-fixed.txt

Create Trees for the Interface

      anvi-matrix-to-newick HaloEndo-functions-occurrence-fixed.txt \
                       -o HaloEndo-functions-tree.txt

      anvi-matrix-to-newick HaloEndo-functions-occurrence-fixed.txt \
                       -o HaloEndo-functions-layers-tree.txt \
                       --transpose

Create a dry-run of the Database

      anvi-interactive -p HaloEndo-functions-manual-profile.db \
                  --tree HaloEndo-functions-tree.txt \
                  -d HaloEndo-functions-occurrence-fixed.txt \
                  --manual \
                  --dry-run

Import layers for the tree

           echo -e "item_name\tdata_type\tdata_value" > HaloEndo-functions-layers-order.txt
           echo -e "HaloEndo_functions_tree\tnewick\t`cat HaloEndo-functions-layers-tree.txt`" \
                                   >> HaloEndo-functions-layers-order.txt

           anvi-import-misc-data HaloEndo-functions-layers-order.txt \
                            -p HaloEndo-functions-manual-profile.db \
                            -t layer_orders \
                            --just-do-it

Import Information from Previous PAN.db

           anvi-export-misc-data -p HaloEndo/HaloEndo-PAN.db \
                -t layers \
                -o HaloEndo-layer-additional-data.txt

Add that data to the manual Database

                anvi-import-misc-data HaloEndo-layer-additional-data.txt \
                  -p HaloEndo-functions-manual-profile.db \
                  -t layers

Visualization

 anvi-interactive -p HaloEndo-functions-manual-profile.db \
           -t HaloEndo-functions-tree.txt \
           -d HaloEndo-functions-occurrence-fixed.txt \
           --title "HaloEndo Pan - functional occurrence" \
           --manual
⚠️ **GitHub.com Fallback** ⚠️