THESIS strategy - petermr/CEVOpen GitHub Wiki
Record all results ON THE WIKI
each intern should
- have one or two dictionaries
- build the dictionary and record the process ; chapter "how I built dictionary X"
- build a minicorpus for TESTING the dictionary. chapter: "how I used (py)getpapers to build my minicorpus". - The corpus should have many examples of enriched papers. e.g.
getpapers -q "invasive plant species"
- test the search tools (
ami search
orsearch_lib
) with DICTIONARY against MINICORPUS. Inspect results manually. Use sections you are familiar with. record in CSV file. - https://en.wikipedia.org/wiki/False_positives_and_false_negatives
- True positives. TP The matches I expected
- False positives . FP Matches that are unexpected/wrong
- False negatives. FN Matches I expected but not found
- (True negatives, TN Matches I did not expect and were not found). NOT relevant
- refine your dictionary. e.g. Remove terms that give false positives ; Use synonyms, etc.
- refine your search, e.g. use terms from the dictionary (
cedrol OR thymol
) - repeat ... you now have a better minicorpus and better dictionary
- test dictionary against OTHER minicorpora. Does this give more false results?
- develop ways of displaying results (e.g. plots or tables or hyperlinks)
- develop a scientific hypothesis. "do different countries / continents have different chemistry?", "do invasive plant species have different compounds?"
- test the hypothesis. OR
- do a classification. "Can a machine distinguish sections (e.g. intro and method) by their contents?"
- https://en.wikipedia.org/wiki/False_positives_and_false_negatives
Update (2021-05-24)
Just to add aspects that you should touch on in the thesis. The goal is to create a tested, re-usable toolkit that is well documented for the next cohort of users. You are unlikely to have enough time to create novel domain results but you will be able to demonstrate how, with more time and a larger corpus this could be done.
- what you are all doing is important novel science. I would call it Phytochemical Informatics (my own title is Cambridfge is Reader in Molecular Informatics).
- You are building and validating tools. This is a very important scientific activity and comparable to building (parts of) telescopes, microscopes or spectrometers.
- within your internship you are building these from scratch or from a prototype. You should :
- show you are reasonably comprehensive within a domain (e.g. the majority of oil-producing plants)
- have a strong ontological basis for your entries (mainly linked to Wikidata - explain this)
- disambiguate where necessary
- provide a range of search entries (synonyms, foreign identifiers, non-EN labels, etc.)
- show the additional information that links to Wikidata can add (e.g. countries, chemical properties, etc.)
(These will vary from dictionary to dictionary)
- tools are not useful if they are not used. How can you persuade others to use yours? - could be examples, tutorials, videos,
Because all your experience is stored on Github (and can also be stored by NIPGR or Universities or Zenodo - we need to decide which) you don't need to put every detail in the thesis. E.g. you can describe in detail 20 entries in your dictionary , illustrating the benefits and problems, and then link to the Github repo. Be sure that make sure this is VERSIONED so that even if it's amended later the reader can retrieve the actual instance.
The current AMI allows detection and plotting of frequency of occurrence of terms. How good is this? Does AM I detect all the occurrences of a term? (you will have to grep the text manually or use a different tool) It's possible there will be more enhancements, but don't rely on them. This type of work is slightly fuzzy.
Among the key things are:
- questioning. Don't make assumptions about how well EPMC or AMI works - they will certainly have bugs.
- carefulness.
- recording at the time. "Writing up afterwards" is a flawed strategy.