Course5 - WeiliangGuo/deepleaning_studies GitHub Wiki
Character level language models are more computationally expensive and harder to train, they are usually used for more specialized cases, like with more specialized vocabulary. They usually handle better unknown words than word level language models. Word level language models are better at capturing longer dependencies.
Vanilla RNNs are weak at capturing long-term dependencies because they always suffer from vanishing gradient problems.
Sometimes they may also suffer from exploding gradient problems. This is easier to spot because the parameters just blow up and you might often see NaNs (Not A Number) meaning numerical overflow, one robust solution is gradient clipping: look at gradient vectors, when they are exceeding some threshold, re-scale some of the gradient vectors.
LSTM is even more powerful than GRU and more frequently used.
BRNN is powerful because it uses both previous and future information from a sequence. The con is you can only make predictions after it finished processing the whole sequence.
Vocabulary is a set of all distinct words occurred in your corpus. Each word of this vocabulary can then be represented using one-hot vector.
You may order all these words from the vocabulary alphabetically. Assume the size of such a vocabulary is 10000, word "awesome" appears in position 475, then "awesome"'s corresponding one-hot vector representations would look like [0,0,0,......,1,.....,0,0,0...]
. This vector only takes 1 at position 475, and all 0s elsewhere. Its shape is (1, 10000). But in practice the vocabulary is usually very large, which means all words are represented likewise hence there is huge sparsity among these vectors. Besides, one-hot vectors are difficult to capture semantic relationships or similarities among words. For example, there are 2 words "man" and "woman", we know they are closely related terms but when we calculate their inner product by multiplying their one-hot vectors, the result is 0. Actually the inner products of all word pairs are 0.
To overcome above-mentioned problems, we instead use word embeddings to represent words. The word embedding of each word is a same-sized feature vector with usually a much smaller shape compared to one-hot vector. For instance, the features are ['gender', 'food', 'age']
, so the element values of features of words "man"
, "woman"
may be [0.3, 0.002, 0.01]
and [0.4, 0.003, 0.02]
respectively, but , word "apple"
would be [-0.2, 0.9, 0.04]
. This tells us that the concepts of "man"
and "woman"
are closer to feature 'gender'
, whereas "apple"
is closer to feature 'food'
. This implies that "man"
is more semantically similar to "woman"
other than to "apple"
. However in reality the features are not so intuitively to interpret like this.
Trained word embeddings(see 5.2.7 for more details) from very large data set of unlabeled texts can be used for other NLP tasks such as Named Entity Recognition whose training data are much smaller(harder to collect), it's kind of transfer learning.
- Learn word embeddings from large text corpus. (1 ~ 100 Billion words) or download pre-trained word embeddings.
- Transfer embedding to new task with smaller training set. (say, 100k words).
- Optional: Continue to fine-tune the word embeddings with new data.
Transfer learning from word embeddings are useful for Named Entity Recognition, Text Summarization, Co-reference Resolution, Parsing; less useful for Language Modeling, Machine Translation.
Use cosine similarity to find which e? are semantically similar to ewoman s.t. eman - ewoman ≈ eking - e? holds.
Why does word2vec use cosine-similarity?
Concatenating All word embeddings which are column vectors is the so-called embedding matrix, whose shape is (#features, voca_size). Word2Vec is a learning algorithm used to achieve this.
In practice we usually just used pre-trained word-embeddings.
fastText is one of the most popular one for this. It supports various different languages.
In addition to the tutorial indicated above, word2vec is actually sort of self-supervised learning algorithm because the labeled data are automatically generated word pairs.
After training, the model predicts whether an unseen new word pair of words are context-word and target word.
Recommended value for #negative examples k: 5 ~ 20 for smaller data sets; 2 ~ 5 for larger data sets.
This can be used for evaluating machine translation systems or image captioning systems, but not for speech recognition.