Weekly reports - larsgw/contentmine-fellowship GitHub Wiki

Converting Blogger HTML to markdown every week is too much for me. Weekly reports and other posts are solely posted here.

Week 31

I was on holiday for the last two weeks, so I was not able to do all to much and I missed the second webinar, but I did read some on-topic papers about NLP. To find the occurrence of certain species and chemicals seems fairly easy to me, with the right dictionaries of course, but to find the relation between a species and a chemical in an unstructured sentence can't be done without understanding the sentence, and without a program to do this, you would have to extract the data manually. I first wanted to do this with RegExp, but now that I have read the OSCAR4 and ChemicalTagger papers, I know there are other, better options.

Especially OSCAR is said to be highly domain-independent. My conclusion is that this, and with this perhaps a modified version of ChemicalTagger can be used in my project to find the relation between occurring species, chemicals and diseases.

If this doesn't work, for some reason (i.e. it takes to long to build dictionaries, the grammar is to complex), there is always Google's SyntaxNet and it's English parser Parsey McParseface. Be as it may, this seems a bit over the top. They are made to figure out the grammar of less strict sentences, and therefore they have to use more complex systems, such as a "globally normalized transition-based neural network". This is needed to choose the best option in situations like the following.

One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence Alice drove down the street in her car has at least two possible dependency parses:

The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity.

If it's necessary, however, it's always an option.

Week 28

I made a test JavaScript file for NodeJS to practice it, and prepare more for the project. The result (here) is a small program with some options:

Options:

  -h, --help     output usage information
  -V, --version  output the version number
  -o <path>      File to write data to
  -c <path>      File to print
  -t <string>    Text to print
  -d <string>    Data to write to file
  -v <string>    Verbosity: DEBUG, INFO, LOG, WARN or ERROR

I made a custom logger and a program to parse passed arguments. The option to parse with the commander module, as seen here:

var program = require(commander);

program.parse(process.argv);

which is used in the quickscrape code as well, didn't really work for me. Maybe I did something wrong.

When I run the following code

var program = require('commander');

program
  .option('-t',
      'Test A')
  .option('-T',
      'Test B')
 .parse(process.argv);

console.log(program);

like this

node file.js -t 'Test_A' -T 'Test_B'

the console outputs this:

{ commands: [],
  options: 
  [ { flags: '-t',
required: 0,
optional: 0,
bool: true,
long: '-t',
description: 'Test A' },
    { flags: '-T',
  required: 0,
  optional: 0,
  bool: true,
  long: '-T',
  description: 'Test B' } ],
  _execs: {},
  _allowUnknownOption: false,
  _args: [],
  _name: 'file',
  Command: [Function: Command],
  Option: [Function: Option],
  _events: { '-t': [Function], '-T': [Function] },
  rawArgs: 
  [ 'node',
    '/home/larsw/public_html/cm-repo/js/file.js',
    '-t',
    'Test_A',
    '-T',
    'Test_B' ],
  T: true,
  args: [ 'Test_A', 'Test_B' ] }

This doesn't seem right to me. If is understand correctly, 'Test_A' is passed to -t and 'Test_B' to -T. Instead, commander seems to say -t and/or -T passed (T: true is present when you pass only -t too), and two strings are passed without context. Maybe there's something wrong with the program.option() things. Maybe my knowledge about command line arguments is incorrect, but I don't think to this extent. Just the same, I made a parser myself. Not sure if it follows any standard, but it works for me, for now. I will update the documentation shortly.

Next week I'll look into norma to see how it changes the format of papers from XML/HTML to sHTML, and ChemicalTagger to see how it recognises sentences.

Week 27

I updated Citation.js to accept quickscrape's results.json. Results can be viewed here.

I made a card prototype for the database HTML page, using data extracted with quickscrape. It's mainly, although not advanced, a design thing, but it's nice to see what information I can fit. Results are here.

I have taken a look at quickscrape. What I'd like to do is to practise building a similar program myself, first just outputting strings to the command line, later processing arguments, to eventually use in my project.

I'll probably try to do that next week.

Week 26

Start of the project. What I am going to do:

update Citation.js to accept quickscrape's results.json
look into quickscrape and the other programs, mainly to see how I can make a similar command for my own project
perhaps add/improve some scrapers
try to make a card prototype for the database

No real work yet, but some orientation and preparing until I have some more information.