tfword2vecTest - juedaiyuer/researchNote GitHub Wiki

word2vec测试

初学者笔记,无论是否有基础,都能看懂的测试笔记

直入主题,可以在 tensorflow/tensorflow/examples/tutorials/word2vec/word2vec_basic.py 查看到一个最简单的实现。这个基本的例子提供的代码可以完成下载一些数据,简单训练后展示结果。

1.下载读取文件

读取文件的办法

用zipfile读取zip内容为字符串,并拆分成单词list

def read_data(filename):
    """Extract the first file enclosed in a zip file as a list of words"""
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return data

代码片段

vocabulary = read_data(filename)
print('Data size', len(vocabulary))

运行效果片段

Data size 17005207

2. 建立词典

dl/word2vec/712028-beca102c0d97bfaa.png.jpeg

vocabulary_size限制这个demo可以学习一共50000个不同的单词

# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000


def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
  count = ['UNK', -1](/juedaiyuer/researchNote/wiki/'UNK',--1)
  count.extend(collections.Counter(words).most_common(n_words - 1))
  dictionary = dict()
  for word, _ in count:
    """assign id to word"""
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      """ 统计下输入数据里有多少词不在这个dictionary里,按照个数增加UNK的数量 """
      index = 0  # dictionary['UNK']
      unk_count += 1
    """ translate word to id """
    data.append(index)
  count[0][1] = unk_count
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  """ data:ids count:list of [word,num] dictionary:word->id """
  return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                            vocabulary_size)
del vocabulary  # Hint to reduce memory.
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

data_index = 0

2.1 代码注解

获取出现单词出现的次数的列表,添加到count列表

count的UNK(也就是unknown单词,即词频率少于一定数量的稀有词的代号

count = ['UNK', -1](/juedaiyuer/researchNote/wiki/'UNK',--1)
count.extend(collections.Counter(words).most_common(n_words - 1))

形成count后dictionary来自于对count里的词频进行整理,除去重复次数但换做排行顺序(单词词频从大到小的排行)作为这个dict结构的key。单词本身即成为了dict结构的value

现在我们的词汇文本变成了用数字编号替代的格式以及词汇表和逆词汇表。逆词汇只是编号为key,词汇为value。

zip接受一系列可迭代对象作为参数,将对象中的对应元素打包成一个个tuple元祖,然后返回这些元祖组成的list列表

reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))

其中data将会被用于训练模型而dictionary将可以最后查询矢量及单词关系的翻译本

2.2 运行结果

该代码段的运行结果如下:

词频表count

Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]

Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']

3. 建立扫描器

skip-gram model,根据目标词汇预测上下文

扫描器函数

def generate_batch(batch_size, num_skips, skip_window):

batch_size是指一次扫描多少块,skip_window为左右上下文取词的长短,num_skips输入数字的重用次数。假设我们的扫描器先扫这大段文字的前8个单词,左右各取1个单词,重用次数为2次。我们就会观察到如下结果:

dl/word2vec/712028-b2f6369819a63854.jpeg

现在通过上面一步,我们构造出了input和label,就可以进行监督学习

采用断言的方式进行条件判断,在错误条件下崩溃

assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window

3.1 源代码

# Step 3: Function to generate a training batch for the skip-gram model.
def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1  # [ skip_window target skip_window ] span:单词+上下文(左右的单词数)
  buffer = collections.deque(maxlen=span) #buffer用来存取w上下文word的id
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips): #how many num_skips in a batch 返回商的整数部分,多少个不重复的单词
    target = skip_window  # target label at the center of the buffer
    targets_to_avoid = [skip_window]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  # Backtrack a little bit to avoid skipping words in the end of a batch
  data_index = (data_index + len(data) - span) % len(data)
  return batch, labels

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):
  print(batch[i], reverse_dictionary[batch[i]],
'->', labels[i, 0], reverse_dictionary[labels[i, 0]])

3.2 运行结果

为了比对方便,data的数据如下

Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']

该段代码的运行结果如下,batch_size=8, num_skips=2, skip_window=1;总过8个词,2个重复,上下文的宽度为1

3084 originated -> 5239 anarchism
3084 originated -> 12 as
12 as -> 6 a
12 as -> 3084 originated
6 a -> 195 term
6 a -> 12 as
195 term -> 6 a
195 term -> 2 of

4. 构建与训练模型skip-gram

4.1 源代码

# Step 4: Build and train a skip-gram model.
# 在本代码中,batch_size代表的是一个batch中,word的个数,而不是sentense的个数。
batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a label.

# We pick a random validation set(验证集) to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
# 验证集
valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default():

  # Input data.
  # 在这里,我们只输入word对应的id,假设batch_size是128,那么我们第一次就输入文本前128个word所对应的id
  # labels和inputs是一样的, 只不过一个是行向量(tensor),一个是列向量(tensor)
  train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

  # Ops and variables pinned to the CPU because of missing GPU implementation
  with tf.device('/cpu:0'):
    # Look up embeddings for inputs.
    # 定义一个嵌套参数矩阵。我们用唯一的随机值来初始化这个大矩阵
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs) #对批数据中的单词建立嵌套向量

    # Construct the variables for the NCE loss
    # 对噪声-比对的损失计算就使用一个逻辑回归模型。对此,我们需要对语料库中的每个单词定义一个权重值和偏差值。(也可称之为 输出权重 与之对应的 输入嵌套值)。定义如下。
    nce_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
    nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

  # Compute the average NCE loss for the batch.
  # tf.nce_loss automatically draws a new sample of the negative labels each
  # time we evaluate the loss.
  loss = tf.reduce_mean(
      tf.nn.nce_loss(weights=nce_weights,
                     biases=nce_biases,
                     labels=train_labels,
                     inputs=embed,
                     num_sampled=num_sampled,
                     num_classes=vocabulary_size))

  # Construct the SGD optimizer using a learning rate of 1.0.
  optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

  # Compute the cosine similarity between minibatch examples and all embeddings.
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
      normalized_embeddings, valid_dataset)
  similarity = tf.matmul(
      valid_embeddings, normalized_embeddings, transpose_b=True)

  # Add variable initializer.
init = tf.global_variables_initializer()

4.2 代码注解

构造计算单元

embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  • 构造一个[vocabulary_size,embedding_size]的矩阵,作为embeddings容器
  • 每一个向量代表一个vocabulary
  • 每个向量的中的分量的值都在-1到1之间随机分布

调用tf.nn.embedding_lookup,索引与train_dataset对应的向量,相当于用train_dataset作为一个id,去检索矩阵中与这个id对应的embedding

embed = tf.nn.embedding_lookup(embeddings, train_inputs)

batch_size和embedding_size的大小一致。想想也是,每一次操作batch_size大小的单词,为单词建立embed(嵌套)矩阵

5. 开始训练

# Step 5: Begin training.
num_steps = 100001

with tf.Session(graph=graph) as session:
  # We must initialize all variables before we use them.
  init.run()
  print('Initialized')

  average_loss = 0
  for step in xrange(num_steps):
    batch_inputs, batch_labels = generate_batch(
        batch_size, num_skips, skip_window)
    feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 2000 == 0:
      if step > 0:
        average_loss /= 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step ', step, ': ', average_loss)
      average_loss = 0

    # Note that this is expensive (~20% slowdown if computed every 500 steps)
    # 这段代码,花销很大,话说是如何计算出来的?
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in xrange(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8  # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k + 1]
        log_str = 'Nearest to %s:' % valid_word
        for k in xrange(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log_str = '%s %s,' % (log_str, close_word)
        print(log_str)
final_embeddings = normalized_embeddings.eval()

5.1 代码注解

5.2 运行结果

第一个循环,验证集valid_size=16,

Average loss at step  0 :  271.869750977
Nearest to that: korn, lanterns, aise, seton, renounced, minardi, para, renders,
Nearest to nine: buying, breeders, synchronized, sobriquet, hhs, heraclitus, dwellers, whimsical,
Nearest to most: equites, seamus, rossi, gestae, albuquerque, fdp, apartheid, subway,
Nearest to four: indirect, brewery, pivotal, spleen, defeat, resides, alphabetically, arthropods,
Nearest to often: implanted, analyzer, inductee, surrendered, marpol, uncompressed, ducas, grampus,
Nearest to d: chosroes, deo, channing, eraserhead, feistel, vu, hylas, prix,
Nearest to for: handily, alexandria, igf, bryozoa, wangenheim, insulted, rapprochement, theorised,
Nearest to also: disbanded, blackwood, adverse, eugenicists, ibelin, batteries, mcenroe, uptown,
Nearest to many: buckingham, spinster, aral, ashland, unruly, millwall, preface, gymnastics,
Nearest to these: pharos, resort, luck, stallman, mosquito, blurry, chthonic, dukes,
Nearest to up: another, matrimony, zak, sunda, archeologist, columbine, pedals, val,
Nearest to over: proceedings, liu, optical, fleshed, corrino, pinching, transposition, cladistic,
Nearest to united: aonb, rgya, suriname, garret, strunk, astros, keynesian, atmospheres,
Nearest to his: tae, midtown, albatross, euboea, enzyme, ynys, perturbations, ccc,
Nearest to state: hana, madman, premium, virtuosic, descended, armoured, fever, borer,
Nearest to system: workers, dancer, prefaced, luxor, interfaces, funk, meades, hinged,
Average loss at step  2000 :  114.559190983
Average loss at step  4000 :  52.4917419915
Average loss at step  6000 :  32.9205914216
Average loss at step  8000 :  23.5268910366
Average loss at step  10000 :  18.0076291623

6. 可视化embeddings

在python当前的工作空间生成了tsne.png

dl/word2vec/tsne.png

简易版的word2vec

dl/word2vec/word2vecdemo.jpeg

常见的Softmax + Cross-Entropy

NCE_LOSS

另一种源代码

source