Dictionary: Gene - petermr/CEVOpen GitHub Wiki
Owner
Vasant
Collaborators
Giulia
Dictionary
Gene (gene
)
Overview
This dictionary contains the names of genes in plants. It will be used to search paper corpora around terpenes, a class of plant chemical compounds that possess interesting properties for both the plants that make them and us as humans (e.g. pesticide activity, medicinal properties, material properties, etc.). By using this dictionary via ami search
along with more dictionaries (if needed), will we find out more about the toxicology of terpenes and what part genes play in that?
[Giulia]
Source
Choice of organisms
I think the following species make sense to include - we are by no means limited to these. It's more of a starting point with species of particular interest (commercial or academic) because of the groups they belong to:
Algal models
Chlamydomonas reinhardtii
Ostreococcus tauri
Penium margaritaceum
Non-vascular models
Physcomitrella patens
Marchantia polymorpha
Marchantia paleacea
Anthoceros agrestis
Seedless models
Selaginella moellendorffii
Azolla filiculoides
Flowering models
Arabidopsis thaliana
Nicotiana benthamiana
Medicago truncatula
Brachypodium distachyon
Crop models
Oryza sativa
Zea mays
Solanum lycopersicum
Solanum tuberosum
Glycine max
Triticum aestivum
Gossypium hirsutum
Hordeum vulgare
Medicinal plants
Papaver somniferum
Digitalis purpurea
Catharanthus roseus
Artemisia annua
Cannabis sativa
Tree models
Populus trichocarpa
Eucalyptus grandis
Eucalyptus globulus
Picea abies
Picea glauca
Pinus taeda
I took inspiration from this paper for some of these models: https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC5068971&blobtype=pdf
In particular, trees might be interesting to explore because of their terpenes and essential oils. Cannabis is also an interesting model which has become very popular as a source of valuable metabolites including terpenes http://europepmc.org/article/PMC/4740396
Retrieval of gene names
We can find genes in the major plant genomic databases, which are Phytozome and Ensembl Plants:
https://phytozome-next.jgi.doe.gov/
http://plants.ensembl.org/species.html
I used Phytozome's BioMart. It allows you to select a number of species (I picked 21 for this test, it can be expanded) and export details like gene name, gene description, species name and a whole other lot of details. The data set needs cleaning but I think it could be a good starting point for making a gene dictionary from a list of terms.
Starting from Phytozome's home page, I selected species of interest from the phylogeny on the right. I made sure the species I selected were associated with the UNRST
data policy (green rectangle next to the species name) which allows the unrestricted use of the genomic data associated with the species provided that the source is cited.
I could not find all the species from the list I made in the previous section, but I could find representatives from many of the groups. I selected 21 species to test whether I could get gene names in bulk or at least locus names (the physical address of a gene in the genome).
Genes retrieved for the following species
Chlamydomonas reinhardtii
Ostreococcus lucimarinus
Selaginella moellendorffii
Arabidopsis thaliana
Medicago truncatula
Salix purpurea
Eucalyptus grandis
Brachypodium distachyon
Physcomitrium patens (previously Physcomitrella patens)
Marchantia polymorpha
Oryza sativa
Zea mays
Solanum tuberosum
Olea europaea
Coffea arabica
Vitis vinifera
Zea mays
Glycine max
Solanum lycopersicum
Gossypium hirsutum
Populus trichocarpa
After selecting the species, I selected build custom data sets
on the top right and was re-directed to the BioMart page. There, I left Filters
as it was and clicked on Attributes
to specify that I wanted the name of the gene, the name of the transcript, the description of the gene and the species name. I also asked for the PFAM code and description (it contains information about the protein domains and functions encoded by the gene).
I cleaned up the raw dataset and obtained a list of gene names. The input files, scripts and output files are here: https://github.com/petermr/CEVOpen/tree/master/dictionary/eoGene
[Giulia]
Issues
For all instances, gene name = locus name. Many genes have multiple names, either because of the family they belong to or because they changed names when a new genome version was released. We'll need to either find a way to integrate common name (e.g. AT5G47500 will show up as PME5 in papers, unless the authors mention the locus in the Materials and Methods section). Maybe we can just use regex for that?
Collaborator : Vasant and Sagar
To create plant gene dictionary we have selected few plants such as A. thaliana, M. domestica ,E. grandis , V. vinifera
- All the locus Ids were provided by Gita mam in the XL sheet format
A. thaliana
- To get synonyms for the A. thaliana for different locus Id I used https://www.arabidopsis.org/ website.
- Websites contains all the data related Arabidopsis
- Uploaded the csv file on git hub https://github.com/petermr/CEVOpen/blob/master/dictionary/eoGene/plant_gene%20Arabidopsis.csv
M. domestica ,E. grandis
- Information of synonyms were retireved using different literature https://github.com/petermr/CEVOpen/tree/master/dictionary