Dictionary: Gene - petermr/CEVOpen GitHub Wiki

Owner

Vasant

Collaborators

Giulia

Dictionary

Gene (gene)

Overview

This dictionary contains the names of genes in plants. It will be used to search paper corpora around terpenes, a class of plant chemical compounds that possess interesting properties for both the plants that make them and us as humans (e.g. pesticide activity, medicinal properties, material properties, etc.). By using this dictionary via ami search along with more dictionaries (if needed), will we find out more about the toxicology of terpenes and what part genes play in that? [Giulia]

Source

Choice of organisms

I think the following species make sense to include - we are by no means limited to these. It's more of a starting point with species of particular interest (commercial or academic) because of the groups they belong to:

Algal models

Chlamydomonas reinhardtii

Ostreococcus tauri

Penium margaritaceum

Non-vascular models

Physcomitrella patens

Marchantia polymorpha

Marchantia paleacea

Anthoceros agrestis

Seedless models

Selaginella moellendorffii

Azolla filiculoides

Flowering models

Arabidopsis thaliana

Nicotiana benthamiana

Medicago truncatula

Brachypodium distachyon

Crop models

Oryza sativa

Zea mays

Solanum lycopersicum

Solanum tuberosum

Glycine max

Triticum aestivum

Gossypium hirsutum

Hordeum vulgare

Medicinal plants

Papaver somniferum

Digitalis purpurea

Catharanthus roseus

Artemisia annua

Cannabis sativa

Tree models

Populus trichocarpa

Eucalyptus grandis

Eucalyptus globulus

Picea abies

Picea glauca

Pinus taeda

I took inspiration from this paper for some of these models: https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC5068971&blobtype=pdf

In particular, trees might be interesting to explore because of their terpenes and essential oils. Cannabis is also an interesting model which has become very popular as a source of valuable metabolites including terpenes http://europepmc.org/article/PMC/4740396

Retrieval of gene names

We can find genes in the major plant genomic databases, which are Phytozome and Ensembl Plants:

https://phytozome-next.jgi.doe.gov/

http://plants.ensembl.org/species.html

I used Phytozome's BioMart. It allows you to select a number of species (I picked 21 for this test, it can be expanded) and export details like gene name, gene description, species name and a whole other lot of details. The data set needs cleaning but I think it could be a good starting point for making a gene dictionary from a list of terms.

Starting from Phytozome's home page, I selected species of interest from the phylogeny on the right. I made sure the species I selected were associated with the UNRST data policy (green rectangle next to the species name) which allows the unrestricted use of the genomic data associated with the species provided that the source is cited.

I could not find all the species from the list I made in the previous section, but I could find representatives from many of the groups. I selected 21 species to test whether I could get gene names in bulk or at least locus names (the physical address of a gene in the genome).

Genes retrieved for the following species

Chlamydomonas reinhardtii

Ostreococcus lucimarinus

Selaginella moellendorffii

Arabidopsis thaliana

Medicago truncatula

Salix purpurea

Eucalyptus grandis

Brachypodium distachyon

Physcomitrium patens (previously Physcomitrella patens)

Marchantia polymorpha

Oryza sativa

Zea mays

Solanum tuberosum

Olea europaea

Coffea arabica

Vitis vinifera

Zea mays

Glycine max

Solanum lycopersicum

Gossypium hirsutum

Populus trichocarpa

After selecting the species, I selected build custom data sets on the top right and was re-directed to the BioMart page. There, I left Filters as it was and clicked on Attributes to specify that I wanted the name of the gene, the name of the transcript, the description of the gene and the species name. I also asked for the PFAM code and description (it contains information about the protein domains and functions encoded by the gene).

I cleaned up the raw dataset and obtained a list of gene names. The input files, scripts and output files are here: https://github.com/petermr/CEVOpen/tree/master/dictionary/eoGene

[Giulia]

Issues

For all instances, gene name = locus name. Many genes have multiple names, either because of the family they belong to or because they changed names when a new genome version was released. We'll need to either find a way to integrate common name (e.g. AT5G47500 will show up as PME5 in papers, unless the authors mention the locus in the Materials and Methods section). Maybe we can just use regex for that?

Collaborator : Vasant and Sagar

To create plant gene dictionary we have selected few plants such as A. thaliana, M. domestica ,E. grandis , V. vinifera

  • All the locus Ids were provided by Gita mam in the XL sheet format

A. thaliana

M. domestica ,E. grandis