Our Project and Machine Learning - petermr/CEVOpen GitHub Wiki

Our goal is to build an automatic system that reads the scientific literature and extracts structure and meaning (semantics) from it. It is largely based on semi-structured text and needs preprocessing before it can be automatically analysed. This is Natural Language Processing (NLP) which is a very important branch of ML. (https://en.wikipedia.org/wiki/Machine_learning). We will be using both Supervised Learning (using dictionaries) and Unsupervised Learning (often for initial exploration).

We build the system from components, and we, therefore, work as a team. Each person's contribution relies on other people and our discipline is to plan in public and report at very regular intervals ("standups"). The key goals are that contributions must be:

  • documented
  • tested
  • reliable
  • interoperable
  • maintainable and upgradable

We want each person to benefit from their time by learning new detailed skills (e.g. NLP) and new transferable skills (team work, testing, presentations. In return, they will contribute a modular piece of the system (code, data, documentation, annotation, presentation/outreach, etc.) that can be used by everyone else and can be maintained and adopted. Where possible we automate our processes.

In general ML consists of:

  1. finding data
  2. cleaning data
  3. annotating it (classification, feature extraction)
  4. analysis

In our case 1) is largely solved (we query existing repositories) and 2) is relatively straightforward as the input is marked up XML. Our need for ML is 3) and 4). This sequence is often iterative - when we have classified a subset we may refine our questions and repeat 1) with a more precise query. It is important to automate the workflow as much as possible so we can do this rapidly and without errors.

A typical sequence might be:

  1. search EPMC for "invasive plant species". This is usually very crude (just using the precise (stemmed) words) and will give thousands of hits. Some of these will be false positives (they may contain phrases such as "we omitted invasive plant species" or it is mentioned in the title of a reference.) 1a) we download a subset (say 500) to analyze in detail.
  2. we create a Bag Of Words (Python Counter) which indicates the most important frequent terms. Some of these will be valuable terms to refine our query. We may make a dictionary of these words (in collaboration with the project owner of "invasive plant species"). These will help to identify papers that don't use precise terms. We re-search (1)
  3. we enhance our dictionary (Wikidata) and make it semantic.

Where dictionaries/supervised don't work well we look for linguistic patterns. Currently, we are extracting phrases using NLTK-RAKE. These have weights and allow us to identify new terms. (Some years ago we did the same with TensorFlow). It's often not possible to say which method will work without experimenting.

Then we try to get figures (Recall, Precision, F) giving some idea of how well this works.