Word2Vec Transformations - Patrisimo/Reddit GitHub Wiki
If I create word embeddings from the reddit data, how do those embeddings change over time?
First, the data: 49,950 reddit comments from 3/1/2017 ("batch 0") 49,949 reddit comments from 3/1/2017 ("batch 1"), disjoint from batch 0 (in ID)
I've heard that if you run word2vec twice on the same data, the embeddings you get should just be orthogonal transformations of each-other. So let's train four embeddings: X, Y, and Z on batch 0, and W on batch 1. I take the embeddings for the 50,000 most common words, however since those are different for different datasets, I'll only be looking at the 26,307 words that are in the top 50k for both batch 0 and batch 1. Thus, X, Y, Z, and W are all 26,307 x 128 matrices, as each word is embedded into R^128.
First, let's look at the transformation A: X→Y, found by least-squares XA = Y. Now, what does this matrix look like? When I figure out how to put images in here, there'll be images. The heatmap of the matrix doesn't look very interesting, as is somewhat expected. However, looking at the eigenvalues, we see something somewhat interesting: the magnitude of the eigenvalues are less than 0.12. In fact, this is more than interesting, but rather problematic. Indeed, if this is representative of a random such transformation, then we could expect this to be the case for A^-1^, which is obviously not true: the eigenvalues of A^-1^ should be similar to the inverses of the eigenvalues of A.
This is, in fact, exactly what we see. Regressing B: Y→X and C: X→Z, we see that A, B, and C all have very similar spectra. In addition to being problematic, this is also maybe useful. Recall that our purpose is to study the change in word meanings over time by looking at different contexts. Obviously, word2vec does not produce consistent results (not even up to isometry), however maps between embeddings from the same context are stable! Of course, that's only interesting if maps between embeddings from different contexts are different.
Indeed, they are! While the eigenvalues for (X,Y,Z) to (X,Y,Z) all look similar, with most of them around 0.08 in magnitude, the eigenvalues for D: X→W drop off much faster. However, this yields the next question: if I train another embedding on batch 1, U, what will E: W→U look like? Turns out the answer is: very different. In this case the eigenvalues of E have the more classic exponential growth, and start out with magnitude 30 and decay to zero. This is much more reasonable.
So why are they so different? Well, E might actually be doing a better job estimating the transformation W→U than A does for X→Y. Indeed, A has MSE 0.827, while E has MSE 0.422. It's worth noting that all of the vectors in X, Y, etc. are unit length, so the zero map would have MSE 1. In some sense, the embeddings on batch 0 seem to just be terrible. Indeed, all of X→Y, X→Z, and Y→X, have MSE about 0.8.
Changing Word2Vecs
Alright, to make sure this isn't because of a stupid mistake on my end, I'm switching from the basic word2vec from tensorflow to gensim's. This also means that our embeddings are into R^100^, rather than R^128^. More impactfully, I'm going to restrict the vocabulary to only words that occur at least 10 times; which drops it down to about 9k, which when intersected between the two batches, gives 7,792 words.
Table break!
Reminder: we have three embeddings of batch 0: X, Y, Z; and three embeddings of batch 1: U, V, W.
First, let's look at MSE of the maps from one embedding to another:
X | Y | Z | U | V | W | |
---|---|---|---|---|---|---|
X | - | 0.368 | 0.380 | 1.081 | 1.079 | 1.077 |
Y | 0.369 | - | 0.376 | 1.078 | 1.081 | 1.074 |
Z | 0.381 | 0.375 | - | 1.077 | 1.076 | 1.073 |
U | 1.093 | 1.089 | 1.091 | - | 0.371 | 0.374 |
V | 1.093 | 1.092 | 1.092 | 0.373 | - | 0.379 |
W | 1.089 | 1.085 | 1.088 | 0.375 | 0.379 | - |
Ok, this is weird, but after a break I come back and produce the table again, and now I get the following:
X | Y | Z | U | V | W | |
---|---|---|---|---|---|---|
X | - | 0.538 | 0.548 | 0.918 | 0.917 | 0.916 |
Y | 0.539 | - | 0.545 | 0.917 | 0.916 | 0.915 |
Z | 0.548 | 0.544 | - | 0.917 | 0.915 | 0.914 |
U | 0.923 | 0.922 | 0.922 | - | 0.542 | 0.544 |
V | 0.922 | 0.921 | 0.921 | 0.543 | - | 0.546 |
W | 0.922 | 0.919 | 0.920 | 0.544 | 0.546 | - |
Anyway, now what happens if we force that the maps have to be orthogonal?
X | Y | Z | U | V | W | |
---|---|---|---|---|---|---|
X | - | 17.092 | 0.606 | 10.246 | 5.148 | 4.623 |
Y | 17.092 | - | 17.200 | 27.286 | 22.134 | 21.596 |
Z | 0.606 | 17.200 | - | 10.139 | 5.044 | 4.519 |
U | 10.246 | 27.286 | 10.139 | - | 5.203 | 5.737 |
V | 5.148 | 22.134 | 5.044 | 5.203 | - | 0.825 |
W | 4.623 | 21.596 | 4.519 | 5.737 | 0.825 | - |
Um, this is kind of shit. However, I more or less have to use orthogonal transformations, since I can get a MSE of at most 1 by sending everything to zero, since everything has norm 1. On the flip side, maybe it's ok that we can't determine the difference between these two batches; after all, these are randomly sampled from the same data, so it's maybe ok that the embeddings are indistinguishable via this route. So let's look at pulling from different days.
1-1 | 1-2 | 1-3 | 3-1 | 3-2 | 3-3 | 5-1 | 5-2 | 5-3 | 7-1 | 7-2 | 7-3 | 9-1 | 9-2 | 9-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1-1 | 0.00 | 20.03 | 34.49 | 102.21 | 112.84 | 82.63 | 66.09 | 88.53 | 34.79 | 22.57 | 29.93 | 84.35 | 62.20 | 81.30 | 52.47 |
1-2 | 20.03 | 0.00 | 14.54 | 122.19 | 132.83 | 102.61 | 86.07 | 108.51 | 54.76 | 42.53 | 49.90 | 104.33 | 82.19 | 101.28 | 72.45 |
1-3 | 34.49 | 14.54 | 0.00 | 136.67 | 147.31 | 117.09 | 100.55 | 123.00 | 69.25 | 57.01 | 64.39 | 118.82 | 96.67 | 115.77 | 86.93 |
3-1 | 102.21 | 122.19 | 136.67 | 0.00 | 10.71 | 19.62 | 36.15 | 13.73 | 67.44 | 79.68 | 72.31 | 17.90 | 40.03 | 20.95 | 49.77 |
3-2 | 112.84 | 132.83 | 147.31 | 10.71 | 0.00 | 30.24 | 46.78 | 24.34 | 78.08 | 90.32 | 82.94 | 28.52 | 50.66 | 31.57 | 60.40 |
3-3 | 82.63 | 102.61 | 117.09 | 19.62 | 30.24 | 0.00 | 16.59 | 6.04 | 47.87 | 60.10 | 52.73 | 2.11 | 20.47 | 1.78 | 30.20 |
5-1 | 66.09 | 86.07 | 100.55 | 36.15 | 46.78 | 16.59 | 0.00 | 22.48 | 31.33 | 43.56 | 36.19 | 18.31 | 4.07 | 15.27 | 13.68 |
5-2 | 88.53 | 108.51 | 123.00 | 13.73 | 24.34 | 6.04 | 22.48 | 0.00 | 53.77 | 66.01 | 58.63 | 4.37 | 26.36 | 7.34 | 36.10 |
5-3 | 34.79 | 54.76 | 69.25 | 67.44 | 78.08 | 47.87 | 31.33 | 53.77 | 0.00 | 12.30 | 5.01 | 49.59 | 27.45 | 46.54 | 17.73 |
7-1 | 22.57 | 42.53 | 57.01 | 79.68 | 90.32 | 60.10 | 43.56 | 66.01 | 12.30 | 0.00 | 7.47 | 61.82 | 39.68 | 58.77 | 29.95 |
7-2 | 29.93 | 49.90 | 64.39 | 72.31 | 82.94 | 52.73 | 36.19 | 58.63 | 5.01 | 7.47 | 0.00 | 54.45 | 32.31 | 51.40 | 22.58 |
7-3 | 84.35 | 104.33 | 118.82 | 17.90 | 28.52 | 2.11 | 18.31 | 4.37 | 49.59 | 61.82 | 54.45 | 0.00 | 22.18 | 3.28 | 31.91 |
9-1 | 62.20 | 82.19 | 96.67 | 40.03 | 50.66 | 20.47 | 4.07 | 26.36 | 27.45 | 39.68 | 32.31 | 22.18 | 0.00 | 19.14 | 9.82 |
9-2 | 81.30 | 101.28 | 115.77 | 20.95 | 31.57 | 1.78 | 15.27 | 7.34 | 46.54 | 58.77 | 51.40 | 3.28 | 19.14 | 0.00 | 28.86 |
9-3 | 52.47 | 72.45 | 86.93 | 49.77 | 60.40 | 30.20 | 13.68 | 36.10 | 17.73 | 29.95 | 22.58 | 31.91 | 9.82 | 28.86 | 0.00 |
Alternatively, I have a heatmap image of the MSE matrix for the first ten days, three batches each, but it doesn't show a strong relation. I'd like to see 3x3 blocks, or at least clear demarcations between days, but I don't really see that at all. In other words, this doesn't appear to work for this problem.
What next?
- Actually train word2vec on sentences, not on the entire comments
- Increase batch size (50k to 150k)
Training on Sentences
Didn't help.