[텍스트마이닝][4주차 2] Vector Semantics - mingoori0512/minggori GitHub Wiki

Distributed representation

  • Vector representation that encodes information about the distribution of contexts a word appears in

  • Words that appear in similar contexts have similar representations(and similar meanings, by the distributional hypothesis).

  • We have several different ways we can encode the notion of "context".

Term-document matrix

context = appearing in the same document.

linear dependent(similar 하다고 볼 수 있음)-빈도의 분포

Vectors

Vector representation of the term; vector size = number of documents

등장하는 빈도 수 기록

Cosine Similarity

  • We calculate the cosine similarity of two vectors to judge the degree of their similarity

  • Euclidean distance measure the magnitude(광도, 정도) of distance between two points

  • Cosine similarity measures their orientation

  • Cosine similarity가 0.9 이상이면, 어느 정도 비슷하다는 뜻임

Weighting dimensions

  • Not all dimensions are equally informative

TF-IDF

  • Term frequency-inverse document frequency

  • A scaling to represent a feature as function of how frequently it appears in a data point but accounting for(설명하다) its frequency in the overall collection

  • IDF for a given term = the number of documents in collection / number of documents that contain term

ex. a, the 와 같은 정관사들은 문서의 특징을 판별하기 어렵게 함. 한 단어가 전체 collection(document)에 흔하게 등장하면, 그 정도만큼 discount 하는 것임

  • Term frequency = the number of times term t occurs in document d; several variants(e.g. passing through log function).

  • Inverse document frequency = inverse fraction of number of documents containing among total number of documents N

  • IDF for the informativeness of the terms when comparing documents(모든 문서에 다 등장하는 단어는 특정 문서를 구분할 단서가 되지 않음)

Evalution

Intrinsic Evaluation

  • Relatedness: correlation(Spearman/Pearson) between vector similarity of pair of words and human judgements

  • Analogical reasoning(Mikolov et al. 2013)-연관관계의 추론. For analogy

    Germany: Berlin :: France: ???

    find closest vector to v("Berlin")-v("Germany")+v("France")

Sparse vectors

"aardvark"

V-dimensional vector, single 1 for the identity of the element

Dense vectors

context에 의해서 TF, TF-IDF

How compacting the context -> and put it in the form of the dense vector

Dense vectors from prediction

  • Learning low-dimensional representations of words by framing a predicting task: using context to predict words in surrounding window

  • Transform this into a supervised prediction problem; similar to language modeling but we're ignoring order within the context window

  • Skipgram model(Mikolov et al.2013): given a single word in a sentence, predict that words in context window around it.

ex. a cocktail with gin and seltzer

window: gin을 둘러싼 단어들

Dimensionality reduction

"the" is a point in V-dimensional space->prediction task->"the" is a point in 2-dimensional space

Word Embeddings

  • Can you predict the output word from a vector representation of the input word?

  • Rather than seeing the input as a one-hot encoded vector specifying the word in the vocabulary we're conditioning on, we can see it as indexing into the appropriate row in the weight matrix W

  • Similarly, V has one H-dimensional vector for each element in the vocabulary(for the words that are being predicted)

  • Why this behavior? dog, cat show up in similar positions

-> to make the same predictions, these numbers need to be close to each other.

Dimensionality reduction

Analogical inference(=analogy reasoning)

  • Mikolov et al. 2013 show that vector representations have some potential for analogical reasoning through vector arithmetic.

Low dimensional distributed representations

  • Low-dimensional, dense word representations are extraordinarily powerful(and are arguably responsible for much of gains that neural network models have in NLP.)

  • Lets your representations of the input share statistical strength wi words that behave similarly in terms of their distributional properties(often synonyms or words that belong to the same class).

Two kinds of "training" data

  • The labeled data for a specific task(e.g. labeled sentiment for movie reviews): ~2K labels/reviews, ~1.5M words -> used to train a supervised model

  • General text(Wikipedia, the web, books, etc.), ~ trillions of words -> used to train word distributed representations-> unsupervised learning or semi-supervised learning

Using dense vectors

  • In neural models(CNNs, RNNs, LM), replace the V-dimensional sparse vector with the much smaller K-dimensional dense one.

  • Can also take the derivative(유도체, 파생물) of the loss function with respect to those representations to optimize for a particular task.(임베딩 업데이트)

일반적인 text -> word embedding

skipgram -> 풀기위해 -> 더 잘 맞게 embedding 할 수 있음

emoji2vec(similar shapes), node2vec(graph에도 nod-embedding 가능, 그룹화 가능)

Trained embeddings(Google 개발)

학습된 임베딩

  • Word2Vec(자주 쓰임)-위키피디아

  • Glove(자주 쓰임)<-skipgram 발달, 위키피디아랑 웹 스탠포드

  • Levy/Goldberg dependency embeddings