Word2Vec Training - Turkish-Word-Embeddings/Word-Embeddings-Repository-for-Turkish GitHub Wiki

Word2Vec is a group of algorithms used to learn word embeddings from a large corpus of text. These models consist of two-layer neural networks that are trained to reconstruct the linguistic contexts of words [1]. You can train your own Word2Vec model using the scripts provided here.

TensorFlow Word2Vec (Skip-gram with Negative Sampling)

Word2Vec with negative sampling is implemented using TensorFlow, following this tutorial. You can run the .ipynb notebook or .py files to test them using a smaller corpus. For large corpora, we recommend you to use gensim implementation. These are provided only to make it easier to understand the inner structure of word2vec algorithms.

Gensim Word2Vec

You can run the notebook or .py script to train Word2Vec word embeddings using gensim library. You can train the specific algorithms using the parameters sg and hs. Resulting training algorithms for each combination is provided below:

SG HS Negative Training Algorithm
1 1 Skip-Gram Hierarchical Softmax
1 0 $\neq$ 0 Skip-Gram Negative Sampling
1 0 = 0 No training
0 1 CBOW Hierarchical Softmax
0 0 $\neq$ 0 CBOW Negative Sampling
0 0 = 0 No training

To be able to train word embeddings with left-right aligned context windows, you should install the modified version of gensim:

pip install git+https://github.com/KarahanS/custom-gensim.git@window-alignment

If you don't want to use this version and train your model with centered windows, you should remove the part where parameter window_alignment is provided as an argument to the constructor of Word2Vec. You can save your model using model.save() or only the word vectors using model.wv.save_word2vec_format().

You can run the .py script as well rather than running the individual cells in the notebook. To do so, you have to provide the following arguments from command line:

  • -i, --input: Input txt file. If file name includes space(s), enclose it in double quotes.
  • -o, --output: Output file (trained model). If file name includes space(s), enclose it in double quotes. Defaults to word2vec.model.
  • -m, --min_count: Minimum frequency for a word. All words with total frequency lower than this will be ignored. Defaults to 10 if not provided.
  • -e, --emb: Dimensionality of word vectors, defaults to 300.
  • -w, --window: Window size, defaults to 5.
  • -ep, --epoch: Number of epochs, defaults to 5.
  • -sg, --sg: Use skip-gram model, defaults to 1 (Skip-gram).
  • -hs, --hs: Use hierarchical softmax, defaults to 0 (negative sampling).
  • -n, --negative: Number of negative samples, defaults to 5.
  • -wl, --window_alignment: Alignment of the training window, left if -1, right if 1 and centered if 0. Defaults to 0.

An exemplary call would be as follows:

python word2vec/word2vec.py -i "corpus/bounwebcorpus.txt" --min_count 10 --emb 300 --window 5 -o "word2vec.model"

Again, you can remove the window_alignment part and install the official gensim implementation if you want to train your model only with centered context.

References

  1. https://en.wikipedia.org/wiki/Word_embedding