L. Linguistics - JulTob/Mathematics GitHub Wiki
Definition of a Vector
In the context of linguistics, a vector is a mathematical representation of a word, phrase, or even a sentence in a multi-dimensional space. Each dimension of this space represents some aspect of the word's meaning, often derived from its usage in a large corpus of text.
A word vector in a high-dimensional space might look like this:
$$ w⃗=[w_1, w_2, w_3, ... , w_n ] $$
where $w_i$ are the components of the vector corresponding to different dimensions of the word's contextual usage.
For instance, in a simple model, if we consider the context of "cat" might be represented in a three-dimensional space as
$$ \text{cat⃗}=[0.7,0.2,0.1] $$
where each dimension could hypothetically represent associations with "pet", "animal", and "domestic".
Practical Implications and Applications
Semantic Similarity
Cosine Similarity
One common method to measure the similarity between two word vectors is cosine similarity, which calculates the cosine of the angle between two vectors. If two words have vectors that point in the same direction, their cosine similarity will be close to 1, indicating high semantic similarity.
For example, the vectors for "cat" and "dog" might be closer to each other than the vectors for "cat" and "car", reflecting their semantic similarity.
Example:
Let's consider a practical example to illustrate how vectors work in linguistics:
Imagine a text corpus where we observe the following co-occurrences for the words "cat", "dog", and "fish":
- "cat" appears with "pet", "animal", and "fur".
- "dog" appears with "pet", "animal", and "bark".
- "fish" appears with "water", "swim", and "gill".
By counting these co-occurrences, we might construct vectors like:
- 🔴 $cat⃗=[3,5,2,0,0,0]$ for "pet", "animal", "fur", "bark", "water", "gill".
- 🟠 $dog⃗=[3,5,0,2,0,0]$ for the same dimensions.
- 🔵 $fish⃗=[0,0,0,0,5,3]$.
Here, the vectors capture the contextual relationships, and we can see that "cat" and "dog" are more similar to each other than to "fish" by comparing their vectors.
There are several Python libraries and modules that are widely used to obtain word vectors (embeddings) that represent the meaning of words in high-dimensional space.
pip install gensim
import gensim.downloader as api
# Load the pre-trained Word2Vec model
model = api.load('word2vec-google-news-300')
# Get the vectors for the words
vector_machine = model['machine']
vector_learning = model['learning']
vector_dromedary = model['dromedary']
# Compute cosine similarities
similarity_machine_learning = model.similarity('machine', 'learning')
similarity_machine_dromedary = model.similarity('machine', 'dromedary')
similarity_learning_dromedary = model.similarity('learning', 'dromedary')
print(f"Similarity between 'machine' and 'learning': {similarity_machine_learning}")
print(f"Similarity between 'machine' and 'dromedary': {similarity_machine_dromedary}")
print(f"Similarity between 'learning' and 'dromedary': {similarity_learning_dromedary}")