Plastid Genome Annotation with GeSeq - barrettlab/2021-Genomics-bootcamp GitHub Wiki

Plastid Genome Annotation with GeSeq

I have a complete, circular plastid genome. WooHooo! Now what do I do?

In order to publish one or more genomes, you will need to conduct genome annotation.

Good annotations are important for many reasons:

  • They allow you to conduct many downstream analyses, e.g. phylogenomics based on coding regions, molecular evolutionary analyses, population genomics…

  • They provide a useful resource for others conducting genomic analyses

  • You will be required by pretty much every journal to which you submit a manuscript to make your annotated genomes publicly available (NCBI GenBank, etc.)

You should now have an assembled, complete plastome as a FASTA file that you generated using FastPlast for Sorghum.

  1. You can open this file in a text editor, or using ‘nano’ in the UNIX terminal. Edit the FASTA header any way you wish, but I would suggest something informative like:
>Sorhum_halepense_McKain_1234_Alabama
ATGGCGCGTACACCGGT...
  1. Download the file, and navigate to GeSeq

  2. You’ll see a bunch of options, but here we will walk through one implementation of how to annotate your plastome.

  3. First, upload your FASTA plastome file in the top left, and choose either ‘circular’ or ‘linear,’ which will specify how you draw your plastome figure.

  • Circular is good for single plastomes, and this allows you to automatically detect the inverted repeat regions (although you should already know this from FastPlast output).

  • Linear is good for annotating multiple plastomes, especially if you want to make a nice figure later comparing plastid genome structure among several species/accessions. The downside is that for some reason GeSeq won’t allow OGDraw to show the IR in linear mode.

  1. Now, it is time to choose a bunch of options. If you chose ‘circular,’ check the box that says ‘Annotate plastid Inverted Repeat (IR),’ and also the box that says ‘Annotate plastid trans-spliced rps12.’
  • This is the only gene in (most) plastomes that is trans-spliced, meaning the 5’ and 3’ exons are not adjacent. (To make matters worse, 2 of the 3 exons of rps12 are in the IR, which is a pain to annotate, but GeSeq will do it for you).
  1. Under Annotation Support, click ‘Support annotation by Chloë’ and under Annotation Revision click ‘Keep best annotation only.’ We are only going to use Chloë here for simplicity, but you can choose multiple 3rd party annotators if you are interested, to see if there are any differences in the annotations they create. Leave the other boxes below unchecked.

  2. Now, under ‘BLAT Reference Sequences’ leave everything unchecked, but make sure in the box below (3rd Party Stand-Alone Annotators) that ‘Chloe’ is checked, as well as the three boxes for ‘CDS’ (i.e. protein-coding sequences), ‘tRNA’ (transfer RNA genes), and ‘rRNA’ (ribosomal RNA genes).

  3. Under Output Options (top right), click the box for ‘Generate multi-GenBank.’ A nice feature here is that you can annotate many FASTA files in one ‘run’ of GeSeq. Here we are only annotating one plastome, but if you have a multi-fasta file of e.g. 50 plastomes, you can use the following file structure:

>Genus_species_accession1
ATGGTGGTTGGACAGCCA…

>Genus_species_accession2
ATGGTGGTTGGTCAGCCA…

>Genus_species_accession3
ATGGTCGTTGGTCAGCCA…

>Genus_species_accession4
ATGGTCGTTGGTCAGGCA…

etc…
  1. Now, we are ready to annotate! (This is so much easier than when I was a grad student/postdoc!!!). Accept the disclaimer and click submit.

  2. After waiting a few minutes, GeSeq will start to output some things under Results. The things you will be most interested in are:

  • Annotation -- multi-GenBank

  • Visualization -- OGDraw

  1. Click ‘OGDraw’ to see a nice, pretty diagram of your plastome.

https://github.com/barrettlab/2021-Genomics-bootcamp/blob/main/Sorghum_trial1_Sorghum_halepense-L0011-1_OGDRAW.jpg

  1. Click on ‘multi-GenBank’ and have a look at the format. You can download or copy/paste this into a text file, and call it ‘Sorghum_XXXX.gb’ or something similar (you will see .gb, .gbk as a file extension oftentimes). This is a GenBank Flat File. You can then edit this file to include information for each of your annotations. At a minimum you should include the following:
  • specimen_voucher

  • location/country of collection

  • source of material

  • etc. (basically, any information you can include to make the annotation as useful as possible)

  1. Other considerations

A. Submitting to GenBank (a whole other lecture/workshop in itself). Unfortunately, GenBank does not accept its own format (GenBank Flat Files). I never understood that. Instead, you need to convert the .gb file to another format, .sqn. Luckily, there is a link on the left to a script that converts to this format (GB2equin).

B. Getting a publication-quality plastome diagram with OGDRAW. You can upload your resulting GenBank Flat File(.gb, .gbk, etc.) to OGDRAW and export a vector file, to be edited with e.g. Adobe Illustrator (.pdf, .svg, .ps).

C. Weird, non-conformist plastomes. If you are working with a green photosynthetic plant species, chances are you will be fine following everything we have done here. However some of us (i.e. my lab members and I) work with leafless, parasitic orchids, which show all kinds of extreme modification of the plastome. These include genomic deletions, inversions, expansions/contractions of the IR, and divergent pseudogenes. These kinds of plastomes require a lot of extra work, to verify functional CDS vs. pseudogenes.

  • In this case, GeSeq might be a good starting point, but you will need to do some additional annotation yourself (Geneious is good for this, and can export .gb files). GeSeq will not annotate pseudogenes, so you will just see a blank spot in the output.

  • If you are lucky enough to have a close reference genome, you can use this as an annotation database instead of the NCBI RefSeq database used by e.g. BLAT and Chloe in GeSeq.

  • To do this, under BLAT -- 3rd Party References, you can search for specific plastomes in GenBank, or provide your own (e.g. if you have an unpublished reference annotation or set of genes) under User References.