Word2Vec RestServer - quhfus/DoSeR-Disambiguation GitHub Wiki
Word2Vec Rest Server
Our Python Word2Vec Rest Server delivers word2vec similarities between Wikipedia/DBpedia entities. Moreover, it is able to accept sentences/documents and computes a similarity score between the document and one or multiple entities.
Requirements
The Word2Vec Rest server needs the following requirements:
Python Packages
- Python 2.7 or later
- Gensim 0.12.1 or later
- Gunicorn 19.3.0 or later
- Flask 0.10 or later
Hardware
Since all embeddings are hold in the memory, the underlying system requires at least 40 GB ram memory, depending on which embeddings are loaded. In fact, for smaller embedding files less system memory is required. If your system does not provide the necessary requirements, you can use our Word2VecRest server under the following link:
http://zaire.dimis.fim.uni-passau.de:8999/doser/Word2VecRest/w2vsim
or
http://zaire.dimis.fim.uni-passau.de:8999/doser/Word2VecRest/d2vsim
Setup
At first adapt the config.ini configuration file and set the correct paths for your embeddings. The entries "embeddings_w2v_wikipedia" and "embeddings_d2v_wikipedia" are mandatory to use the disambiguation service. For instance, a configuration file might look like this:
[Word2VecRest]
embeddings_w2v_wikipedia = /mnt/ssd1/disambiguation/word2vec/WikiEntityModel_400_neg10_iter5.seq
embeddings_w2v_calbc = /mnt/ssd1/disambiguation/word2vec/calbcsmall_model_sg_500.bin
embeddings_d2v_wikipedia = /mnt/ssd1/disambiguation/word2vec/doc2vec/Wiki_Standard_Model/doc2vec_wiki_model.d2v
embeddings_d2v_wikipedia_german = /mnt/ssd1/disambiguation/word2vec/doc2vec/Wikipedia_Standard_German/doc2vec_model_german.d2v
Simply start > startserver to start the server. Default settings: Running on http://127.0.0.1:5000 Settings can be adapted in Word2VecRest.py in the constructor of GunicornApplication.
Possible Settings:
- IP/Port Address: The ip address and port the server is binded to
- Workers: The number of parallel requests (Default 5)
If the server should be reachable from another host, you should install a proxy server (e.g. nginx) to forward the request since flask and gunicorn do not provide connection outside of localhost by default.
Usage
###Word2Vec To compute the similarities between the DBpedia entities Alan_Turing and Computer_science as well as Ada_Lovelace and Lord_Byron we use the JSON code below.
{
"domain":"DBpedia",
"data":["Alan_Turing|Computer_science", "Ada_Lovelace|Lord_Byron"]
}
The domain attribute specifies the kind of entities. In our service, we provide DBpedia entities by default (i.e. "DBpedia"). However, the domain attribute "DBpedia" must be given. To compute the word2vec similarity between a set of entity and another set of entities, we concatenate multiple entities that should be compared. Internally, the vectors of the respective entities are summed up before similarity computation. In the example above we form the average vectors of "Alan Turing" and "Computer Science" as well as the average vectors of "Ada Lovelace" and "Lord Byron".
We note that the entity names are the same as provided by Wikipedia/DBpedia.
###Doc2Vec With Doc2Vec we are able to infer a vector out of a text snippet. This vector is compared with the given entities vectors. In other words we compute the similarity between the given text and the entity describing texts of the given entities.
{
"document":[{
"surfaceForm":"Ada Lovelace",
"qryNr":"0",
"context":"Lovelace was born 10 December 1815 as the only legitimate child of the poet George, Lord Byron and his wife Anne Isabella Noel. All Byron's other children were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever four months later, eventually dying of disease in the Greek War of Independence when Ada was eight years old. Ada's mother remained bitter towards Lord Byron and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing what she saw as the insanity seen in her father, but Ada remained interested in him despite this (and was, upon her eventual death, buried next to him at her request).",
"candidates":["Ada_Lovelace", "Lord_Byron"]
}]
}
The resulting similarity value is in the range between 0 and 2, with 2 meaning that the documents are identical. Again, the entity names are the same as provided by Wikipedia/DBpedia.