GoldStandard: Abstracts of oil26 - petermr/CEVOpen GitHub Wiki

We are devising Gold Standard for future work on classification and Machine and Learning. As a primer, we are doing a fun activity of manually annotating abstracts of papers in oil26 corpus. Specifically, we are scavenging for mentions of plants, compounds, activity, country and plant part in the papers.

The end result would be a curated .csv with stand-off annotation from all annotators. We can, then, compare the annotations to check for agreements. This is a blinded experiment, meaning each annotator will go off annotate individually without helping each other. Annotators will have to agree on a date to publish their work.

Mini-corpus

Dictionaries and the criteria:

eoPlant

  • Should contain:
    • Two words (usually in Italics) with first word abbreviated, sometimes
  • Will not include:
    • common name, synonyms, family names etc.
    • only genus name Note: You might want to spend time understanding the hierarchy of organism classification before the exercise.

plant_part

  • Will include:
    • plant parts like root, rhizome, leaf, stem, flower, seed, etc. in English language
  • Will not include:
    • terms like trees, branches, etc used in the computer science context

eoPlant_compound

  • Will include:
    • specific chemical
    • Well-known common name
    • compounds likely created by a plant
  • Will not include:
    • Not generic (i.e., not sugar, amino acid, etc.) Note: Essential oil is, sometimes, a combination of different plant compounds.

country

  • Will include:
    • countries with ISO codes
  • Will not include:
    • regions, cities, geographical location, etc.

activity

  • Should contain:
    • Biological activity like antiseptic, antimicrobial, and so on.

Coordinators

  • Bhavini
  • Shweata

Annotators

  • Bhavini
  • Chaitanya
  • Radhu
  • Kanishka
  • Sagar
  • Talha
  • Vasant

Initial annotation (PMR)

We are annotating those terms which we would expect the system to annotate automatically. We use our own scientific experience and judgment. There is no right or wrong. We do this "blind" to see how well we humans agree. By definition the machine cannot do better than the humans. The degree of agreement between the humans is the "inter-annotator agreement"

If two humans disagree we discuss how to create a rule and then we all agree on this rule. We then encode this rule by:

  • creating a term in the dictionary
  • creating search rules which the machine must carry out

Any human using the dictionary and following the rules should get the same results as any other. But it's often difficult to write the rules precisely for humans and to encode them for machines. We also expect different agreements for different dictionaries - e.g. a dictionary of binomial plant names will lead to better agreement than a dictionary of common taxon names.

Instructions:

  • Each one of you will go through individual abstracts, looking for mentions of:

    • plants (Binomial names)
    • Essential Oils (EO) produced PMR: this should be plantCompounds; one oil contains many compounds
    • the associated activity (antimicrobial, antiseptic, antifungal, and so on)
    • the country in which the plant sample was collected
    • plant part from which the EO was obtained
  • We are, however, not validating whether these terms are present in our dictionaries or not.

  • Record your annotations in a .csv table. Here is an example:


<?xml version="1.0" encoding="UTF-8"?>
<abstract>
 <sec id="st1">
  <title>Background:</title>
  <p>
   <italic>Kundmannia anatolica</italic> Hub.-Mor. is an endemic specie of Apiaceae diversified in Turkey. Several parts of the plant may contain essential oils in different quantity which can be influenced by environmental factors, mainly altitude. The aim of this study was to test whether there is any altitude effect on volatile chemical constituents of essential oil obtained from the fruits of 
   <italic>K. anatolica</italic> growing spontaneously in different altitudes of Lakes Region in Turkey.
  </p>
 </sec>
 <sec id="st2">
  <title>Materials and Methods:</title>
  <p>
   <italic>K. anatolica</italic> was collected in 2015 at different altitudes (400, 820, 1002 and 1560 m) of Lakes Region Turkey. The fruits of the plants were distilled for 3 h using a Clevenger type apparatus according to the British Pharmacopiea (1980). Essential oils of the fruits were collected using hydro distillation method and analyzed by GC-MS/FID.
  </p>
 </sec>
 <sec id="st3">
  <title>Results:</title>
  <p>Essential oil contents of fruits increased by corresponding increase in altitude level. Predominant compounds were a-Pinene (27.87-61.94%) and β-Pinene (24.92-36.46%) of the total oil of 
   <italic>K. anatolica</italic>. Other important compounds were α-Thujene (2.66-8.15%), l-Limonene (1.83-8.23%), α-Phellandrene (1.85-5.01%) and these compounds were higher in low altitudes.
  </p>
 </sec>
 <sec id="st4">
  <title>Conclusion:</title>
  <p>Altitude change affected the terpenoid biosynthesis and oxygenated monoterpenes generally and were greatest when low; while sesquiterpene constituents were greatest at high altitudes. The influence of altitude seems to be an important factor for yielding the chemical profile of 
   <italic>K. anatolica</italic> essential oils. Thus, the location of the plant must be taken into account depending on the intended use.
  </p>
 </sec>
</abstract>

Here is the annotation:

Paper plant Country Plant Part Plant Compound Activity
PMC5411863 Kundmannia anatolica X 3 Turkey Fruits a-Pinene, ?-Pinene, ?-Thujene, l-Limonene, ?-Phellandrene
⚠️ **GitHub.com Fallback** ⚠️