Aroma the Game - petermr/CEVOpen GitHub Wiki

Aroma the Game

An experiment in modern scientific informatics to engage citizens of many ages in understanding and recording the scientific process, based on semantic publications and dictionaries. Anyone will enjoy this but especially those who are interested in herbal teas, medicines, aromatherapy, gardening/horticulture, agriculture, green science, cooking and much more. No previous science is required but we welcome open minds.

Background

Gita Yadav has offered to show her research on plant science to colleagues at St. Edmund's College, Cambridge. We are preparing this as a game, with informatics prosthetics which all participants can run. The game will be run "in May 2021" and will probably be a mixture of virtual and real-life. We've tried to build in flexibility. There is a lot that is new so it's likely that we'll be continually making adjustments! We've been asked to fill a Saturday morning with tea/coffee (or other plants!) in the middle. We plan for about 30 attendees, organized in teams, but also with casual observers online and real-life.

Essential oils

("Essential" relates to volatile organic compounds (VOC) which are often obtained as oils (e.g. "oil of cloves/garlic/bergamot/oregano ..."). Please read about them in Wikipedia (https://en.wikipedia.org/wiki/Essential_oil ) and get a feel for the terminology. Many common garden plants (e.g. herbs) are rich in essential oils. NOTE: * Essential oils are not by definition safe or beneficial to humans - and we are not making any recommendations. *

Goals

There are many high-level aims:

to showcase the scientific process and how it is reported in the literature
to show how diverse plants are and how they use volatile chemicals for many purposes
to show the internationality and multiculture/language nature of science
to introduce the power of Wikidata ("data from Wikipedia") and the precision of annotation
to shows the power of the aggregated scientific literature and the need for it to be free for humans and machines

Procedure

We are asking you to find and interpret articles on plants, their essential oils and the uses. Fortunately the actual science has a clear framework which can be gamified.

Game components

In real life we'd use a board and pieces/cards because it's fun and we'll simulate these online. We'll use:

tools to search EuropePubMedCentral
a set of scientific articles on essential oils extracted from plants
semantic dictionaries of
- relevant plants (ca 700)
- chemical compounds in essential oils ca. 1100
- countries ca 300
- medicinal activities ca 150
- plant parts (seeds, twigs...) (150)
local and remote software for searching the articles
an electronic table and map of game progress (a social meeting tool Jitsi, Zoom, GatherTown, etc.)

We'll create everything in Python and may invite players to use their own laptops or online devices (not mobile-friendly).

Your components

MiniCorpus

You will have a corpus of (maybe 180) tested articles which contain articles on plants, their processing, the oils, their chemicals and medicinal activities. This is randomly selected from the 250,000 articles in the literature.

Dictionaries

The core semantic dictionaries are:

Country
(Oil)Plant
PlantPart
ChemicalCompound
Medcinal or anti-organism activity
Invasive Plants

Section

You can choose which part of the article you search in. Major sections are:

article title
abstract
introduction
methods/experimental
results and discussion

suggested game plan

Let's assume 6 people in 4 teams. Each team has a "room" where they can meet. The team should have the capabilities (maybe some in the same person) of:

a hacker - who can run the software
a bioscientist - who is comfy with scientific papers
a Wikidata wizard (ok, someone who has run a SPARQL query there and understands Wikidata IDs
a EuropePMC expert who can search the site using standard (Boolean) queries
a captain
a scribe

manual

This is to show participants the science behind the informatics. It will illustrate where terms occur in the document. report A. search EPMC. ask them to run 6 simple queries and record the number of hits. Typical examples:

papers with Ocimum basilicum|
papers with Rajasthan OR Kerala
papers with rhizomes NOT seeds
papers with carvacrol and eugenol
papers with Cambridge AND NIPGR
papers on essential oils in 2020

This winner must get (roughly correct) numbers and do it in the shortest time. This will also help them to organize their teams. This can be shared out within the team and can be a straight race between the teams.

I'm guessing that if well organized a team can do this in 10 mins. (I've done this before with highlighters which works very nicely, but probably not so good online). I think the first time through might be 20 mins as some people will struggle with the format of the paper.

Might be useful to have a second round...

B. Each team gets 6 PDF papers and must search them for:

the country where the work was done
the plant used
the part of the plant used
the 3 most abundant compounds in the oil
the reported activity

(If there are several in a category, the first is taken)

The answer as Wikidata IDs is put in a spreadsheet. The organizers may use the spreadsheet to display the results with added information such as pictures and maps

analysis and discussion

How well did teams do? Did they understand it? Do they realise that grad students have to do this manually on thousands of papers?

with tools

The teams can inspect the dictionaries

We'll let them run manual searches for practice, If they use Jupyter, they can configure the facets of the search. The goal is to find papers with COUNTRY, PLANT, PLANT_PART, COMPOUND, ACTIVITY that nobody else has found. This is rather like the current TV game Pointless. Currently I suggest:

each team creates a query for EPMC for 50 papers. They do this to try to optimize the likelihood of novel papers. They can use the search results on OIL186 to guess which terms would be useful. Note that a common term would be likely to be guessed by others, a rare one might not find any papers.
they download them, analyze them with standard dictionaries and create a spreadsheet (Pandas?) of the results
the referees then compare spreadsheets (with software) and eliminate duplicates and announce the winner.

Probably repeat several times. This means that more than one team can win.

Biomedical Informatics

Much science is published as articles (papers) in journals in a form (PDF) that specialist scientists can read, although every journal is irritatingly different. Machines can't easily understand PDF, so there's a format (XML) that is designed for them (we use "semantic" to mean machine-understandable).

Current search engines are not optimised for detailed scientific questions. Google doesn't particularly care about searching science and nor do most scholarly publishers. There's a high rate of false positives. But the biomedical community has an impressive tool, * PubMedCentral , which contains the text of all Open Access biomedical articles.* Fortunately they also index a lot of the plant literature and they create JATS-XML for machines (we owe NIH our thanks). But it's only partially semantic; it knows what the sections are for (abstract, method, results, etc.) but not individual words ("rose" could be a flower (Noun) or "ascended" (Verb)). So we create "dictionaries" to annotate them.

AMI/Wikidata dictionaries

AMI ("scholarly amanuensis") is our software for getting, searching and annotating scientific literature. It's created for serious work and can analyse thousands of articles in an hour. It consists of :

(py)getpapers tool to search EuropePMC
ami section to divide paper into sections
amidict to create dictionaries from Wikidata
ami search to search with dictionaries and regular expressions
display/analysis tools