Testing dictionaries against corpus - petermr/CEVOpen GitHub Wiki

Testing dictionaries against corpus

  1. I created "invasive plant" minicorpus of 500 research articles using

    getpapers -q "invasive plant" -o corpora1 -x -p -k 500 -f corpora1/log.txt

  2. I ran following commands

    a. ami -p "corpora1" section

    b. ami -p "corpora1" search --dictionary country.xml invasive_plant.xml

    c. ami -p "corpora1" search --dictionary plant_compound.xml invasive_plant.xml

  3. I obtained following data tables.

    Country Invasive plant

    Compound Invasive plant

    plant_material_history

    plant_genus

  4. invasive_plant, country, plant_material_history, plant_genus and plant_compound dictionaries are working well. After analyzing, I found values are coming from "Reference" section also. So, I will be analyzing results from ami_gui.

  5. Tested city.xml dictionary against "corpora1". It didn't worked. xml formatting problem.

    city

    I have tried creating dictionary with several attempts. (using list (txt) and sparql). However, it give error as follow

    mode

    Dictionary create

  6. Tested eo_Gene.xml dictionary against "corpora1". It didn't worked. xml formatting problem.

  7. Testing and documenting plant_compound dictionary against "oil186" corpus.

    oil186 full data table

    Presently documenting csv file containing section wise term counting, false +ves, -ves and true +ves and -ves.

    • Difficulties in search:
      1. Italic not counted.
      2. Paper wise differential formatting.
      3. Differential counting of isomeric and non isomeric compounds.
      4. Counts from "Reference" section.
      5. Counting chemically modified form of compound (For eg. Counting 1-decanol as well as 1-decanol acetate).
      6. Capital letter not detected (eg. acetone detected but not Acetone).