crop5 - petermr/CEVOpen GitHub Wiki
crop5 miniproject template
- Five miniprojects for DBT/KARYA interns 2021-09.
- Duration 2 months
- Each intern chooses a project from a list of 7 crops (see TIGR2ESS 2019 workshop)
- project is phased , some being iterative
We take Maize (Zea mays
, Zm
) as a typical project. Each intern will substitute their crop.
Communal resources
- terpene synthase dictionary (Sagar)
- terpenes (eo_compound)
- eo_plants (useful to see which plants are co-studied)
- eo_plant_parts
goals of each miniproject
manually assess rapidly (hours) whether the literature on
+ TPS is large enough to be useful. If not, select another plant. This may need communal discussion. -
each intern builds separate mini-dictionaries for:
Zm genes or enzymes keyed on enzyme name. Start with Sagar's dictionary. We want to find what is mentioned in the literature.
Zm enzyme products (mainly terpenes)
search EPMC using mini-dictionaries to assess scope/feasibility
increase size or precision of dictionaries by snowballing (particularly important for abbreviations - if they are common).
refine minicorpus to contain high precision content on Zm enzymes. At this stage the minicorpus will be a collection of papers which are primarily about terpene synthases and their products in Zm.
communally compare dictionaries and corpus (mainly by term frequency) to decide:
- which TPS are most important in each plant
- which terpenes are most important in each plant
project architecture
each intern has their own wiki (e.g.
) -
they record everything daily on the wiki. For large data they create a subdirectory (see TIGR2ESS projects )
a daily standup report with links to wiki
create an initial minicorpus (Zm100), main purpose to snowball terms, abbreviations, etc.
create skeleton dictionaries by searching Zm100. The goal is to find out what genes or compounds are most frequently reported, what syntaxes are used
- TPSgene => Zm100gene. Initially a list of enzyme names. Gradually add synonyms, new enzymes and abbreviations.
NOTE: The genes may or may not have the form
, etc. Abbreviations may be standard or highly variable. This will be messy, but valuable. - eo_compounds => Zm100terp . A list of compounds created by terpene synthases. There will be many synonyms and possibly some abbreviation.
- TPSgene => Zm100gene. Initially a list of enzyme names. Gradually add synonyms, new enzymes and abbreviations.
NOTE: The genes may or may not have the form
Each intern has a major project that they are responsible for, and a minor project that they help with.
Crop specific TPS dictionary (KARYA Interns)
Generation of Hand created terms in text file
Installation of pygetpapers and (ami)
Pygetpaper query
pygetpapers -q "terpene synthase volatile Camellia AND (((SRC:MED OR SRC:PMC OR SRC:AGR OR SRC:CBA) NOT (PUB_TYPE:"Review")))" -o CamelliaTPS -x -p -s
It will create a folder "CamelliaTPS" containing papers
Interns can also use the following queries.
"terpene synthase volatile Mentha"
"terpene synthase volatile Citrus sinensis"
"terpene synthase volatile Zea mays"
"terpene synthase volatile Vitis vinifera"
Focus only research articles
Go through each paper with control F function scoping for TPS.
Collect gene (names) terms such as CsTPS, MonoTPS and so on. Put those terms into excel file as a list and save excel file as gene.txt file
Use this command to create a dictionary
amidict -v --dictionary eo_Gene --directory gene --input gene.txt create --informat list --outformats xml
Create corpus using this command
pygetpapers -q "terpene synthase TPS plant volatile" -o TPSvolatile -x -p -k <number of papers>
Testing dictionary
ami -p "TPSvolatile" section
ami -p "TPSvolatile" search --dictionary eo_Gene.xml