Gensim Word2Vec Modifications for Window Alignment - Turkish-Word-Embeddings/Word-Embeddings-Repository-for-Turkish GitHub Wiki
As of March 14th, 2023, the Gensim Word2Vec implementation does not support left or right-aligned windows. This means that we can only use centered windows where n words are taken from both the left and right neighbors of the current word. The Word2Vec class in Gensim takes the window size n as a parameter called window. To compare the performance of left and right-aligned implementations, we made some modifications to the Gensim architecture. In order to be able to reproduce our results using Gensim, you can follow the steps below:
- In
gensim.models.Word2Vec, add a new parameter calledwindow_alignmentto the constructor ofWord2Vecclass and update the documentation accordingly:
class Word2Vec(utils.SaveLoad):
def __init__(
self, sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, epochs=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(),
comment=None, max_final_vocab=None, shrink_windows=True, window_alignment=0
):
"""
...
window_alignment: int, optional
When window parameter is set to -1, only the left context of the current word is used for training. For instance, if the window size is set
to 10, only the first 10 words on the left of the current word will be used. On the other hand, when window is set to 1, only the right
context of the current word will be used. If window is set to 0, both the left and right context of the current word is used, that is, if
window is set to 5, 5 words are used from both the left and right contexts, adding up to a total of 10 words.
...
"""
- In
gensim.models.word2vec_inner.pyx, add the new parameterwindow_alignmentto functiontrain_batch_sg.
def train_batch_sg(model, sentences, alpha, _work, compute_loss, window_alignment):
"""Update skip-gram model by training on a batch of sentences.
Called internally from :meth:`~gensim.models.word2vec.Word2Vec.train`.
Parameters
----------
model : :class:`~gensim.models.word2Vec.Word2Vec`
The Word2Vec model instance to train.
sentences : iterable of list of str
The corpus used to train the model.
alpha : float
The learning rate
_work : np.ndarray
Private working memory for each worker.
compute_loss : bool
Whether or not the training loss should be computed in this batch.
window_alignment: int
If -1, left-aligned context. If 1, right-aligned context. If 0, centered alignment.
Returns
-------
int
Number of words in the vocabulary actually used for training (They already existed in the vocabulary
and were not discarded by negative sampling).
"""
- In
gensim.models.Word2Vec, giveself.window_alignmentas an argument to the function.
tally += train_batch_sg(self, sentences, alpha, work, self.compute_loss, self.window_alignment)
- In
gensim.models.word2vec_inner.pyx, update the training part of thetrain_batch_sgfunction as follows:
with nogil:
for sent_idx in range(effective_sentences):
idx_start = c.sentence_idx[sent_idx]
idx_end = c.sentence_idx[sent_idx + 1]
# iterate over all words in the sentence
for i in range(idx_start, idx_end):
# centered context and left-aligned context
j = i - c.window + c.reduced_windows[i]
# right-aligned context
if window_alignment == 1:
j = i
if j < idx_start:
j = idx_start
# centered context and right-aligned context
k = i + c.window + 1 - c.reduced_windows[i]
# left-aligned context
if window_alignment == -1:
k = i
if k > idx_end:
k = idx_end
for j in range(j, k):
if j == i:
continue
if c.hs:
w2v_fast_sentence_sg_hs(c.points[i], c.codes[i], c.codelens[i], c.syn0, c.syn1, c.size, c.indexes[j], c.alpha, c.work, c.words_lockf, c.words_lockf_len, c.compute_loss, &c.running_training_loss)
if c.negative:
c.next_random = w2v_fast_sentence_sg_neg(c.negative, c.cum_table, c.cum_table_len, c.syn0, c.syn1neg, c.size, c.indexes[i], c.indexes[j], c.alpha, c.work, c.next_random, c.words_lockf, c.words_lockf_len, c.compute_loss, &c.running_training_loss)
- In
gensim/models/word2vec_inner.pxd, create a new field calledwindow_alignmentfor structWord2VecConfig:
cdef struct Word2VecConfig:
int hs, negative, sample, compute_loss, size, window, cbow_mean, workers, window_alignment
- Lastly, in
gensim/models/word2vec_inner.pyxgive the value of thewindow_alignmentfrommodeltoc:
cdef init_w2v_config(Word2VecConfig *c, model, alpha, compute_loss, _work, _neu1=None):
c[0].hs = model.hs
c[0].negative = model.negative
c[0].sample = (model.sample != 0)
c[0].cbow_mean = model.cbow_mean
c[0].window = model.window
c[0].workers = model.workers
c[0].window_alignment = model.window_alignment