On Embeddings - doraithodla/notes GitHub Wiki

Introduction to embeddings in natural language processing using Artificial Neural Network and Gensim. https://towardsdatascience.com/generating-word-embeddings-from-text-data-using-skip-gram-algorithm-and-deep-learning-in-python-a8873b225ab6

Notes and Quotes "Word embedding is used in natural language processing (NLP) to describe how words are represented for text analysis. Typically, this representation takes the form of a real-valued vector that encodes the word’s meaning with the expectation that words that are closer to one another in the vector space will have similar meanings. In a process known as word embedding, each word is represented as real-valued vectors in a predetermined vector space. The method is called deep learning since each word is assigned to a single vector, and the vector values are learned like a neural network (Jason Browniee, 2017)."

Vectorization

Vectorization is the process of transforming scalar operations into vector operations. In other words, vectorization is simply a way to convey to the computer to execute operations on entire arrays, without using loops or other control structures that would slow down the calculations.

We need vectorization because it significantly reduces the time required to perform mathematical operations, thus making our programs run faster and more efficiently. Performing mathematical operations on an array in a vectorized manner is many times faster than repeating the same operation for each element of the array.

In Python, vectorization is often achieved by using the NumPy library, which provides optimized, vectorized functions for mathematical calculations on arrays. For example, instead of using a for loop to add two arrays element-wise, we can simply use the + operator to add the arrays directly, like this:

import numpy as np

Create two arrays of equal length

a = np.array([1, 2, 3]) b = np.array([4, 5, 6])

Add the arrays element-wise

c = a + b

Output the result

print(c)

This code performs the element-wise addition of a and b in a vectorized manner, resulting in the array [5, 7, 9].

TF/IDF TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection or corpus. This statistic is used often in natural language processing, information retrieval, and machine learning.

TF-IDF takes into account two factors:

  1. Term Frequency (TF): The number of times a word appears in a document. Words that are repeated more often in a document are considered more important.

  2. Inverse Document Frequency (IDF): The inverse of the number of documents in the corpus that contain the word. Words that occur in fewer documents are considered more important.

TF-IDF Count Vectors represent a document as a vector of TF-IDF values of each word in the document. These vectors are commonly used in text classification, information retrieval, and other natural language processing tasks.

The CountVectorizer class from the scikit-learn library can be used to compute the TF-IDF Count Vectors for a given corpus of documents. Here's an example:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ]

vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names()) print(X.toarray())

In this example, we first define a corpus of four documents. Then, we create a CountVectorizer object and fit it to the corpus using the "fit_transform()" method. Finally, we print out the feature names (i.e., the unique words in the corpus) and the resulting TF-IDF Count Vectors for each document.