SkipGram - axkoro/graph-impute GitHub Wiki

Theoretical Background

Skip-gram Model

  • Goal: Learn vector embeddings for nodes by training a neural network to predict surrounding nodes in a Random Walk given a center node.
    • Originally, the Word2vec paper did this for words and sentences instead of nodes and random walks.
  • Key Idea: For a given center node in a random walk, the model predicts the nodes within a specified window around it. Each (center, context) pair forms a training example.
  • Model: The model consists of only one hidden layer.
Illustration of the model's layers

Model

Optimization: Negative Sampling

See NegativeSampler for our implementation of the sampling mechanism.

  • Problem:
    • Theoretically, for each training pair we would have to perform two matrix-vector multiplications and then apply a softmax function over the entire output vector: $y = \sigma(W' \cdot (W \cdot x))$. Since the input vector is one-hot encoded, the first multiplication is actually only selecting a single column from $W$. But the second multiplication would require $\text{embedding\_size} \cdot \text{num\_nodes}$ double multiplications. Then we would have to apply the softmax to this vector which requires $O(\text{num\_nodes})$ exponentiation, summing and division operations. We would then have to propagate back and update all $\text{embedding\_size} \cdot \text{num\_nodes}$ weights in $W'$.
      • This would be computationally prohibitive (remember: we're doing this for every training pair).
  • Solution:
    • The DeepWalk paper proposes using a hierarchical softmax but later methods (e.g. node2vec) use negative sampling instead.
    • Update only:
      • the weights corresponding to the output component of the true ("positive") result, and
      • the weights corresponding to a small number ($k$) of nodes that shouldn't have been predicted by the model ("negative") per training instance.
    • This means we only have to calculate $k+1$ dot-products (see section Forward Pass for why), apply the sigmoid function on each of those ($O(1)$ per application) and then propagate back, where we also only need to update $k+1$ rows in $W'$.
      • Our $k$ is typically much much smaller than the original $\text{num\_nodes}$ (typically something like 5-20).
  • Loss Function:
    • See Algorithm section below for explanation where these symbols and operations come from.
    • The training objective for a single pair $(c, o)$ with negative samples is
$$\log \sigma(\mathbf{v}'_o \cdot \mathbf{v}_c)+ \sum_{n \in \text{negatives}} \log \sigma(-\mathbf{v}'_n \cdot \mathbf{v}_c)$$

where $\sigma(x)=\frac{1}{1+e^{-x}}$.

Sources: original paper, simple explanation

Algorithm

Forward Pass (using Negative Sampling)

Forward_Pass

Backpropagation

Derivation of Gradients for the SkipGram Model with Negative Sampling

We start from the objective for a given center word vector $v_c$, a positive (target) output vector $v'_o$, and a set of negative samples with output vectors $v'_n$:

$$J = \log \sigma\bigl(v'_o \cdot v_c\bigr) + \sum_{n \in \text{negatives}} \log \sigma\bigl(-v'_n \cdot v_c\bigr),$$

where the sigmoid function is defined as

$$\sigma(x) = \frac{1}{1 + e^{-x}}.$$
  1. Gradient with Respect to $v'_o$

Focus on the positive term:

$$f(v'_o) = \log \sigma\bigl(v'_o \cdot v_c\bigr).$$

Let

$$x = v'_o \cdot v_c.$$

Then, using

$$\frac{d}{dx} \log \sigma(x) = 1 - \sigma(x),$$

the chain rule gives:

$$\frac{\partial f}{\partial v'_o} = (1 - \sigma(x))\, \frac{\partial x}{\partial v'_o} = \bigl(1 - \sigma(v'_o \cdot v_c)\bigr) \, v_c.$$

Thus,

$$\boxed{ \frac{\partial J}{\partial v'_o} = \bigl(1 - \sigma(v'_o \cdot v_c)\bigr) \, v_c. }$$
  1. Gradient with Respect to Each Negative Sample $v'_n$

For a negative sample, consider:

$$g(v'_n) = \log \sigma\bigl(-v'_n \cdot v_c\bigr).$$

Let

$$y = -v'_n \cdot v_c.$$

Then,

$$\frac{d}{dy} \log \sigma(y) = 1 - \sigma(y).$$

Applying the chain rule:

$$\frac{\partial g}{\partial v'_n} = (1 - \sigma(y)) \, \frac{\partial y}{\partial v'_n} = (1 - \sigma(-v'_n \cdot v_c))\, \bigl(-v_c\bigr).$$

Since

$$\sigma(-x) = 1 - \sigma(x) \quad \Longrightarrow \quad 1 - \sigma(-x) = \sigma(x),$$

we have:

$$\boxed{ \frac{\partial J}{\partial v'_n} = -\sigma\bigl(v'_n \cdot v_c\bigr) \, v_c. }$$
  1. Gradient with Respect to $v_c$

The center word vector $v_c$ appears in both positive and negative terms.

  • From the positive term:
$$\frac{\partial}{\partial v_c} \log \sigma\bigl(v'_o \cdot v_c\bigr) = \bigl(1 - \sigma(v'_o \cdot v_c)\bigr) \, v'_o.$$
  • From each negative term:
$$\frac{\partial}{\partial v_c} \log \sigma\bigl(-v'_n \cdot v_c\bigr) = (1 - \sigma(-v'_n \cdot v_c)) \, \frac{\partial (-v'_n \cdot v_c)}{\partial v_c}.$$

Since

$$\frac{\partial (-v'_n \cdot v_c)}{\partial v_c} = -v'_n,$$

and using

$$1 - \sigma(-v'_n \cdot v_c) = \sigma(v'_n \cdot v_c),$$

it follows that:

$$\frac{\partial}{\partial v_c} \log \sigma\bigl(-v'_n \cdot v_c\bigr) = -\sigma\bigl(v'_n \cdot v_c\bigr) \, v'_n.$$

Summing over all terms:

$$\boxed{ \frac{\partial J}{\partial v_c} = \bigl(1 - \sigma(v'_o \cdot v_c)\bigr) \, v'_o - \sum_{n \in \text{negatives}} \sigma\bigl(v'_n \cdot v_c\bigr) \, v'_n. }$$
⚠️ **GitHub.com Fallback** ⚠️