crop5 - petermr/CEVOpen GitHub Wiki

crop5 miniproject template

  • Five miniprojects for DBT/KARYA interns 2021-09.
  • Duration 2 months
  • Each intern chooses a project from a list of 7 crops (see TIGR2ESS 2019 workshop)
  • project is phased , some being iterative

We take Maize (Zea mays, Zm) as a typical project. Each intern will substitute their crop.

Communal resources

  • terpene synthase dictionary (Sagar)
  • terpenes (eo_compound)
  • eo_plants (useful to see which plants are co-studied)
  • eo_plant_parts

goals of each miniproject

  • manually assess rapidly (hours) whether the literature on Zm + TPS is large enough to be useful. If not, select another plant. This may need communal discussion.

  • each intern builds separate mini-dictionaries for:

    • Zm genes or enzymes keyed on enzyme name. Start with Sagar's dictionary. We want to find what is mentioned in the literature.

    • Zm enzyme products (mainly terpenes)

  • search EPMC using mini-dictionaries to assess scope/feasibility

  • increase size or precision of dictionaries by snowballing (particularly important for abbreviations - if they are common).

  • refine minicorpus to contain high precision content on Zm enzymes. At this stage the minicorpus will be a collection of papers which are primarily about terpene synthases and their products in Zm.

  • communally compare dictionaries and corpus (mainly by term frequency) to decide:

    • which TPS are most important in each plant
    • which terpenes are most important in each plant

project architecture

  • each intern has their own wiki (e.g. Zea_mays)

  • they record everything daily on the wiki. For large data they create a subdirectory (see TIGR2ESS projects https://github.com/petermr/tigr2ess/tree/master/crops )

  • a daily standup report with links to wiki

  • create an initial minicorpus (Zm100), main purpose to snowball terms, abbreviations, etc.

  • create skeleton dictionaries by searching Zm100. The goal is to find out what genes or compounds are most frequently reported, what syntaxes are used

    • TPSgene => Zm100gene. Initially a list of enzyme names. Gradually add synonyms, new enzymes and abbreviations. NOTE: The genes may or may not have the form ZmTPS or ZmHMGR , etc. Abbreviations may be standard or highly variable. This will be messy, but valuable.
    • eo_compounds => Zm100terp . A list of compounds created by terpene synthases. There will be many synonyms and possibly some abbreviation.
  • Each intern has a major project that they are responsible for, and a minor project that they help with.

Crop specific TPS dictionary (KARYA Interns)

  • Generation of Hand created terms in text file

  • Installation of pygetpapers and (ami) https://github.com/petermr/pygetpapers/blob/main/README.md

  • Pygetpaper query

    pygetpapers -q "terpene synthase volatile Camellia AND (((SRC:MED OR SRC:PMC OR SRC:AGR OR SRC:CBA) NOT (PUB_TYPE:"Review")))" -o CamelliaTPS -x -p -s

    It will create a folder "CamelliaTPS" containing papers

    Interns can also use the following queries.

    "terpene synthase volatile Mentha"

    "terpene synthase volatile Citrus sinensis"

    "terpene synthase volatile Zea mays"

    "terpene synthase volatile Vitis vinifera"

  • Focus only research articles

  • Go through each paper with control F function scoping for TPS.

  • Collect gene (names) terms such as CsTPS, MonoTPS and so on. Put those terms into excel file as a list and save excel file as gene.txt file

  • Use this command to create a dictionary

    amidict -v --dictionary eo_Gene --directory gene --input gene.txt create --informat list --outformats xml

  • Create corpus using this command pygetpapers -q "terpene synthase TPS plant volatile" -o TPSvolatile -x -p -k <number of papers>

  • Testing dictionary ami -p "TPSvolatile" section

    ami -p "TPSvolatile" search --dictionary eo_Gene.xml