Backend - PMeschenmoser/Visual-Analyzer-for-News-Dependencies GitHub Wiki

YAML Documentation for routes The documentation can be edited in your favorite IDE or text editor, but we recommend to use the Swagger Editor, which nicely displays the API documentation. If you do that, you must copy the new file contents from the Swagger Editor to the file above in this repository.

app.js

Define node express properties. We parse the articles in root/data/ and save data as app property to make it accessible accross submodules. Then, we make public files accessible for the client. In app.js, you can easily toggle the VAND Graph/Multi frontend by app.use(express.static(path.join(__dirname, 'public_multi')));

routes/central.js

Each backend request gets pipelined through central.js. A POST-Request with the path /:actor/:f will call function req.params.f, defined in req.params.actor.

color.js

This module assigns color values to articles, referenced by the title. Returns [{index: id, values:[r,g,b,a]}]. Accessed by actors/processor.js

featureex.js

This module extracts features out of an document array.

Function Description

pos(docs, cb) Extracts POS features out of the document array and subsequently runs the callback with the extracted features. Actual feature extraction is done by sending a concatenated version of documents to the Stanford Core NLP server. As a postprocessing step, particular POS types are filtered out and character offsets are set relative to the actual document again (not relative to the concatenated version). Runs the callback with [ {type:'pos', docs:[{id:docID, features: [...] }]}]

ner(docs) ... is currently deprecated. We recommend to handle named entities as extended POS tags, and submit the NER parameter to the Stanford Core NLP server.

Function	Description
pos(docs, cb)	Extracts POS features out of the document array and subsequently runs the callback with the extracted features. Actual feature extraction is done by sending a concatenated version of documents to the Stanford Core NLP server. As a postprocessing step, particular POS types are filtered out and character offsets are set relative to the actual document again (not relative to the concatenated version). Runs the callback with `[ {type:'pos', docs:[{id:docID, features: [...] }]}]`
ner(docs)	... is currently deprecated. We recommend to handle named entities as extended POS tags, and submit the NER parameter to the Stanford Core NLP server.

metric.js

Preprocessing is done with the node module natural.

Function	Description
cosine	Computes cosine similarity in tf-idf vector space, cf. IR book. docs are assumed to have a content attribute. Return: reduced array of objects with `{source:int,target:int,value:double}`
jaccard	With the help of node module jaccard. Return: reduced array of objects with `{source:int,target:int,value:double}`.
sherlock	With the help of sherlock program (you need to compile the sherlock.c file in the cbin folder first). Return: reduced array of objects with `{source:int,target:int,value:double}`.
jplag	With the help of JPlag. Return: reduced array of objects with `{source:int,target:int,value:double}`.

reader.js

A simple parser for backend articles. Formats according to wiki resource 'Add Article Sets to your Server'.

segmentizer.js

In this class our segmentation algorithm is implemented. Its principle is to build 2,3,...10-grams out of features from featureex.js. Given an n-gram of a main article, we check each n-gram of each reference article for feature intersection. In case of feature intersection, we trim the n-grams. I.e. given matching and non-matching features, we perform the operation (nonmatch, match , nonmatch, match, nonmatch) -> (match, nonmatch, match).

Function	Description
_getPOSMatches(docs, mainID)	Computes matches between the main article and the provided docs, using the above matching algorithm. We return an object array, with items describing segment meta such as character offsets and intersecting features. This information gets forwarded to the frontend.
_generateNGrams(f, id)	`f` is a feature array, `id` denotes the corresponding document ID

Remaining functions in the segmentizer stem from an earlier matching algorithm. It takes segment character length and only same boundary features into account. Faster than the currently used algorithm, but with more false positives.