Testing dictionaries against corpus - petermr/CEVOpen GitHub Wiki

Testing dictionaries against corpus

I created "invasive plant" minicorpus of 500 research articles using

getpapers -q "invasive plant" -o corpora1 -x -p -k 500 -f corpora1/log.txt
I ran following commands

a. ami -p "corpora1" section

b. ami -p "corpora1" search --dictionary country.xml invasive_plant.xml

c. ami -p "corpora1" search --dictionary plant_compound.xml invasive_plant.xml
I obtained following data tables.

Country Invasive plant

Compound Invasive plant

plant_material_history

plant_genus
invasive_plant, country, plant_material_history, plant_genus and plant_compound dictionaries are working well. After analyzing, I found values are coming from "Reference" section also. So, I will be analyzing results from ami_gui.
Tested city.xml dictionary against "corpora1". It didn't worked. xml formatting problem.

I have tried creating dictionary with several attempts. (using list (txt) and sparql). However, it give error as follow
Tested eo_Gene.xml dictionary against "corpora1". It didn't worked. xml formatting problem.
Testing and documenting plant_compound dictionary against "oil186" corpus.

Presently documenting csv file containing section wise term counting, false +ves, -ves and true +ves and -ves.
- Difficulties in search:
  1. Italic not counted.
  2. Paper wise differential formatting.
  3. Differential counting of isomeric and non isomeric compounds.
  4. Counts from "Reference" section.
  5. Counting chemically modified form of compound (For eg. Counting 1-decanol as well as 1-decanol acetate).
  6. Capital letter not detected (eg. acetone detected but not Acetone).