Dictionary: Overview - petermr/CEVOpen GitHub Wiki
This overview is a modified version of the dictionary overview given in the openVirus
project.
The purpose of Dictionaries in the CEVOpen
project is:
- to identify words and phrases ("entities") in the documents (running text and images).
- to provide (computable) links to their meaning and context ("ontologies").
- to collect a subset of terms representing a high-level concept ("species", "pests", "chemical compound",...).
The benefits include:
- understanding the meanings of words.
- background reading.
- aggregation ("searching") for the same or related entities in the corpus (collection of documents).
- building computable knowledge networks/graphs.
- classifying documents.
This can be described as ontological annotations in semantic networks.
There are many established uses of such annotations:
We are often put off by unfamiliar terms, e.g. "trichome". Wikipedia has an article on https://en.wikipedia.org/wiki/Trichome#:~:text=Trichomes%20(%2F%CB%88tra%C9%AA,hairs%2C%20scales%2C%20and%20papillae.:
Trichomes (/ˈtraɪkoʊmz/ or /ˈtrɪkoʊmz/), from the Greek τρίχωμα (trichōma) meaning "hair", are fine outgrowths or appendages on plants, algae, lichens, and certain protists. They are of diverse structure and function. Examples are hairs, glandular hairs, scales, and papillae.
With mouseover or footnotes this can dramatically improve speed of reading.
Annotations are easily aggregated in indexes or search engines.
People may confuse trees (a group of diverse organisms) with plants (one of life's kingdoms which includes land plants and certain algae).
As an example from Wikipedia (https://en.wikipedia.org/wiki/Phytophthora_infestans)
Phytophthora infestans is an oomycete or water mold, a microorganism that causes the serious potato and tomato disease known as late blight or potato blight.
This sentence links potato blight to Phytophthora infestans. Indeed we can write:
- Potato blight isA disease
- Potato blight isCausedBy Phytophthora infestans
Ami's annotations allow software to discover and use such annotation. We can find all diseases isCausedBy oomycetes.
What's "moss"? https://en.wikipedia.org/wiki/Moss_(disambiguation) tells us:
Moss is a small, soft, non-vascular plant that does not have flowers or seeds.
Moss may also refer to:
- Moss (language), a musical language designed by Jackson Moore
- Moss Bros, a menswear outfitters in the United Kingdom
- Moss Brothers Aircraft, an English aircraft manufacturer (1936–1955)
- Moss FK, a Norwegian football club
... and many more ...
We can label the different concepts by using a unique identifier system as in Wikidata.
Dictionaries have a simple format, best supported by XML or JSON (currently mainly XML). This defines certain elements and attributes (in <element att1="attval1" att2="attval2" ... >
). We are developing validation software. In general:
- unknown elements are ignored
-
<desc>
and<entry>
and<alternative>
are optional and repeatable. - all attributes except
dictionary/@title
are optional (at this stage) - order of elements and attributes is irrelevant (but worth making pretty and consistent)
This is the root element and contains the title which MUST be a single word and MUST be the base of the filename, e.g.
pests.xml
must have the structure
<dictionary title="pests">
...
</dictionary>
There is no XML namespace.
There is a header of zero or more <desc>
description elements, though we may enforce mandatory elements later. These can describe metadata such as dates, maintenance, provenance, authors etc. They are not yet standardised but will be. Here is a snippet from the eoPlant
dictionary (contains plant species names):
<dictionary title="eoPlant">
<desc>A dictionary of 1678 plant names extracted mentioned in the 186 test articles downloaded from PubMed. Of the 1678 entries, 1567 had their names normalized and tagged with corresponding Wikidata IDs</desc>
<authors>Dr. Gitanjali Yadav, Ph.D., Computational Biology Laboratory, NIPGR National Institute of Plant Genome Research, Lecturer, University of Cambridge Dept. of Plant Sciences; Ambarish Kumar</authors>
<contributors>Shruthi Mohan; Emanuel Arruda, President https://www.verriclar.com, https://www.verriclar.com.br/; Peter Murray-Rust, Reader Emeritus in Molecular Informatics, Unilever Centre, Dept. Of Chemistry University of Cambridge</contributors>
<datasource>http://www.nipgr.ac.in/Essoildb/</datasource>
</dictionary>
The main component of a dictionary are entries, still slightly evolving. An entry is a well-defined object which can normally be mapped / linked to a Wikidata item. This gives it a unique identifier (Q-number), removing the need to maintain identifiers. Typical entry (with new element synonym
and more use of desc
with new attributes:
<dictionary title="miniterpenes">
<entry term="borneol" wikipedia="borneol" wikidata="Q27089413" name="(-)-borneol" description="chemical compound" id="CM.myterpenes.0" term.hi="बोर्निऑल" term.it="borneolo" term.zh="冰片" regex="(\([+-]\)\-)?[Bb]borneol">
<desc date="2020-07-22">added Bornyl-alcohol synonym</desc>
<alternative>(-)-Bornyl alcohol</alternative>
<entry>
...
</dictionary>
- the
term
is the unique lexical string (word) defining the entry. Terms are always lowercase and always start with a letter. The term may or may not be the linguistic entity in documents. - the
name
is the preferred name for the term. It is case-sensitive, and will often occur in text,name
andterm
may or may not be identical words. -
term.xx
can occur as language equivalents wherexx
is the appropriate 2- or 3-letter language code. See https://en.wikipedia.org/wiki/ISO_639-2. These can often be picked up from the links to Wikipedia pages from a Wikidata item (bottom of page). (Experimental). -
regex
is a regular expression for locating possible matches in text. This one finds(-)-borneol
,(+)-borneol
, andborneol
. -
description
is a human-readable string describing the entry. However it is often created directly from Wikidata and may be used for grouping or disambiguation. -
wikipedia
is the name of the Wikipedia page. It is often the term (for single words). It may not have spaces and may have escaped punctuation. resolves to (e.g. for EN,https://en.wikipedia.org/wiki/<wikipedia>
-
wikidata
is the identifier of the Wikidataitem
, always of the formQddddd..
(occasionallyPddd...
). It resolves tohttps://wikidata.org/wiki/<wikidata>
. There is only one identifier for a Wikidata item and the relationships and graphs are language-independent. -
id
is a local autogenerated ID and is not stable.
We are introducing 2 children of entry
-
desc
has the same semantics asdesc
fordictionary
-
<alternative>
. These are alternative lexical forms for theterm
. There are deliberately no semantics. They may or may not be exact synonyms, and may or may not be narrower/broader terms. These ontological relations can often be obtained from Wikidata.
- dictionaries will provide search terms (
term
,name
,regex
,alternative
) forami
,Lucene/Solr
orKNIME
. - dictionaries provide a link to Wikipedia pages or Wikidata Items. Annotation software can create hyperlinks for humans to follow.
Conventional dictionaries take a lot of effort to create and maintain, particularly if they contain ontological relationships. Often only specialist maintainers can do this. ContentMine dictionaries remove this problem by reducing the problem to a selection of relevant term
s. Often this selection is already made, in Wikipedia pages, or other collections. Many dictionaries are thus "views" (subsets) of Wikidata. There are several ways of doing this (see other sections of this wiki).