crop5 - petermr/CEVOpen GitHub Wiki
crop5 miniproject template
- Five miniprojects for DBT/KARYA interns 2021-09.
- Duration 2 months
- Each intern chooses a project from a list of 7 crops (see TIGR2ESS 2019 workshop)
- project is phased , some being iterative
We take Maize (Zea mays
, Zm
) as a typical project. Each intern will substitute their crop.
Communal resources
- terpene synthase dictionary (Sagar)
- terpenes (eo_compound)
- eo_plants (useful to see which plants are co-studied)
- eo_plant_parts
goals of each miniproject
-
manually assess rapidly (hours) whether the literature on
Zm
+ TPS is large enough to be useful. If not, select another plant. This may need communal discussion. -
each intern builds separate mini-dictionaries for:
-
Zm genes or enzymes keyed on enzyme name. Start with Sagar's dictionary. We want to find what is mentioned in the literature.
-
Zm enzyme products (mainly terpenes)
-
-
search EPMC using mini-dictionaries to assess scope/feasibility
-
increase size or precision of dictionaries by snowballing (particularly important for abbreviations - if they are common).
-
refine minicorpus to contain high precision content on Zm enzymes. At this stage the minicorpus will be a collection of papers which are primarily about terpene synthases and their products in Zm.
-
communally compare dictionaries and corpus (mainly by term frequency) to decide:
- which TPS are most important in each plant
- which terpenes are most important in each plant
project architecture
-
each intern has their own wiki (e.g.
Zea_mays
) -
they record everything daily on the wiki. For large data they create a subdirectory (see TIGR2ESS projects https://github.com/petermr/tigr2ess/tree/master/crops )
-
a daily standup report with links to wiki
-
create an initial minicorpus (Zm100), main purpose to snowball terms, abbreviations, etc.
-
create skeleton dictionaries by searching Zm100. The goal is to find out what genes or compounds are most frequently reported, what syntaxes are used
- TPSgene => Zm100gene. Initially a list of enzyme names. Gradually add synonyms, new enzymes and abbreviations.
NOTE: The genes may or may not have the form
ZmTPS
orZmHMGR
, etc. Abbreviations may be standard or highly variable. This will be messy, but valuable. - eo_compounds => Zm100terp . A list of compounds created by terpene synthases. There will be many synonyms and possibly some abbreviation.
- TPSgene => Zm100gene. Initially a list of enzyme names. Gradually add synonyms, new enzymes and abbreviations.
NOTE: The genes may or may not have the form
-
Each intern has a major project that they are responsible for, and a minor project that they help with.
Crop specific TPS dictionary (KARYA Interns)
-
Generation of Hand created terms in text file
-
Installation of pygetpapers and (ami) https://github.com/petermr/pygetpapers/blob/main/README.md
-
Pygetpaper query
pygetpapers -q "terpene synthase volatile Camellia AND (((SRC:MED OR SRC:PMC OR SRC:AGR OR SRC:CBA) NOT (PUB_TYPE:"Review")))" -o CamelliaTPS -x -p -s
It will create a folder "CamelliaTPS" containing papers
Interns can also use the following queries.
"terpene synthase volatile Mentha"
"terpene synthase volatile Citrus sinensis"
"terpene synthase volatile Zea mays"
"terpene synthase volatile Vitis vinifera"
-
Focus only research articles
-
Go through each paper with control F function scoping for TPS.
-
Collect gene (names) terms such as CsTPS, MonoTPS and so on. Put those terms into excel file as a list and save excel file as gene.txt file
-
Use this command to create a dictionary
amidict -v --dictionary eo_Gene --directory gene --input gene.txt create --informat list --outformats xml
-
Create corpus using this command
pygetpapers -q "terpene synthase TPS plant volatile" -o TPSvolatile -x -p -k <number of papers>
-
Testing dictionary
ami -p "TPSvolatile" section
ami -p "TPSvolatile" search --dictionary eo_Gene.xml