Gensim Word2Vec Modifications for Window Alignment - Turkish-Word-Embeddings/Word-Embeddings-Repository-for-Turkish GitHub Wiki

As of March 14th, 2023, the Gensim Word2Vec implementation does not support left or right-aligned windows. This means that we can only use centered windows where n words are taken from both the left and right neighbors of the current word. The Word2Vec class in Gensim takes the window size n as a parameter called window. To compare the performance of left and right-aligned implementations, we made some modifications to the Gensim architecture. In order to be able to reproduce our results using Gensim, you can follow the steps below:

  • In gensim.models.Word2Vec, add a new parameter called window_alignment to the constructor of Word2Vec class and update the documentation accordingly:
class Word2Vec(utils.SaveLoad):
    def __init__(
            self, sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5, min_count=5,
            max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
            sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, epochs=5, null_word=0,
            trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(),
            comment=None, max_final_vocab=None, shrink_windows=True, window_alignment=0
        ):
        """
        ...

        window_alignment: int, optional
            When window parameter is set to -1, only the left context of the current word is used for training. For instance, if the window size is set 
            to 10, only the first 10 words on the left of the current word will be used. On the other hand, when window is set to 1, only the right 
            context of the current word will be used. If window is set to 0, both the left and right context of the current word is used, that is, if 
            window is set to 5, 5 words are used from both the left and right contexts, adding up to a total of 10 words.

        ...
        """
  • In gensim.models.word2vec_inner.pyx, add the new parameter window_alignment to function train_batch_sg.
def train_batch_sg(model, sentences, alpha, _work, compute_loss, window_alignment):
    """Update skip-gram model by training on a batch of sentences.

    Called internally from :meth:`~gensim.models.word2vec.Word2Vec.train`.

    Parameters
    ----------
    model : :class:`~gensim.models.word2Vec.Word2Vec`
        The Word2Vec model instance to train.
    sentences : iterable of list of str
        The corpus used to train the model.
    alpha : float
        The learning rate
    _work : np.ndarray
        Private working memory for each worker.
    compute_loss : bool
        Whether or not the training loss should be computed in this batch.
    window_alignment: int
        If -1, left-aligned context. If 1, right-aligned context. If 0, centered alignment. 

    Returns
    -------
    int
        Number of words in the vocabulary actually used for training (They already existed in the vocabulary
        and were not discarded by negative sampling).

    """
  • In gensim.models.Word2Vec, give self.window_alignment as an argument to the function.
tally += train_batch_sg(self, sentences, alpha, work, self.compute_loss, self.window_alignment)
  • In gensim.models.word2vec_inner.pyx, update the training part of the train_batch_sg function as follows:
with nogil:
   for sent_idx in range(effective_sentences):
       idx_start = c.sentence_idx[sent_idx]
       idx_end = c.sentence_idx[sent_idx + 1]
       # iterate over all words in the sentence
       for i in range(idx_start, idx_end):
          # centered context and left-aligned context
          j = i - c.window + c.reduced_windows[i] 
          # right-aligned context
          if window_alignment == 1:
              j = i
          if j < idx_start:
              j = idx_start
          # centered context and right-aligned context
          k = i + c.window + 1 - c.reduced_windows[i]
          # left-aligned context
          if window_alignment == -1:
              k = i
          if k > idx_end:
              k = idx_end
          for j in range(j, k):
              if j == i:
                  continue
              if c.hs:
                  w2v_fast_sentence_sg_hs(c.points[i], c.codes[i], c.codelens[i], c.syn0, c.syn1, c.size, c.indexes[j], c.alpha, c.work, c.words_lockf, c.words_lockf_len, c.compute_loss, &c.running_training_loss)
              if c.negative:
                  c.next_random = w2v_fast_sentence_sg_neg(c.negative, c.cum_table, c.cum_table_len, c.syn0, c.syn1neg, c.size, c.indexes[i], c.indexes[j], c.alpha, c.work, c.next_random, c.words_lockf, c.words_lockf_len, c.compute_loss, &c.running_training_loss)
  • In gensim/models/word2vec_inner.pxd, create a new field called window_alignment for struct Word2VecConfig:
cdef struct Word2VecConfig:
    int hs, negative, sample, compute_loss, size, window, cbow_mean, workers, window_alignment
  • Lastly, in gensim/models/word2vec_inner.pyx give the value of the window_alignment from model to c:
cdef init_w2v_config(Word2VecConfig *c, model, alpha, compute_loss, _work, _neu1=None):
    c[0].hs = model.hs
    c[0].negative = model.negative
    c[0].sample = (model.sample != 0)
    c[0].cbow_mean = model.cbow_mean
    c[0].window = model.window
    c[0].workers = model.workers
    c[0].window_alignment = model.window_alignment