An embedding is a numerical representation of a real-world object, such as words, images, or videos. These representations capture the semantic meaning of the object, making them useful for various applications. For example, consider a sentence like "What is the main benefit of voting?" We can create an embedding for this sentence, represented as a vector (e.g., [0.84, 0.42, ..., 0.02]). This vector encodes the meaning of the sentence.


The process for turning raw data into embeddings and placing them into the vector space¹²

Embeddings can represent a wide variety of objects and data types. Common examples of things that can be embedded include:

Types of objects


Types of objects that can be embedded¹³

Word Embeddings: These represent words in a continuous vector space. Word2Vec, GloVe, and FastText are popular word embedding techniques.
Image Embeddings: Images can be embedded into vectors, allowing us to compare and analyze them. Convolutional neural networks (CNNs) often generate image embeddings.
Document Embeddings: These capture the meaning of entire documents or paragraphs. Doc2Vec and BERT are examples of document embedding methods.
Graph Embeddings: Used for graph-structured data (e.g., social networks, knowledge graphs). Graph neural networks (GNNs) create graph embeddings.
Entity Embeddings: Represent entities (e.g., users, products) in recommendation systems.
Time Series Embeddings: Useful for time-dependent data (e.g., stock prices, sensor readings).

Applications of Embeddings

Search Engines: Google uses embeddings to match text queries to relevant documents or web pages.
Recommendation Systems: Embeddings help recommend products, movies, or music based on user preferences.
Chatbots: Chatbots use embeddings to understand user input and generate contextually relevant responses.
Image Search and Classification: Image embeddings enable efficient image retrieval and classification.
Social Media: Platforms like Snapchat use embeddings to serve personalized ads.
Natural Language Understanding: Embeddings enhance tasks like sentiment analysis, named entity recognition, and text summarization.

Commonly used embeddings for natural language and coding

Embedding	Perk	Drawback
Cohere Embeddings	High-quality embeddings for various NLP tasks	Fewer model options compared to OpenAI
	Competitive pricing	Community support still growing
	Fast and efficient API responses
	Straightforward integration
	Responsive support and growing community
-----------------------	-----------------------	----------------------
OpenAI Embeddings	State-of-the-art accuracy and versatility	Generally higher pricing
	Comprehensive documentation and strong community support	Can experience slower performance during peak times
	Wide range of models, allowing fine-tuned customizations
	Frequent updates and continuous improvements
	Highly scalable for various application sizes
----------------------	-----------------------	----------------------
Google Universal Sentence Encoder	Easy to use with strong performance on semantic similarity	Limited customization options
	Pre-trained on a large, diverse dataset	May not be as cutting-edge as newer models
	Free to use
	Good integration with TensorFlow and other Google tools
----------------------	-----------------------	----------------------
GloVe (Global Vectors for Word Representation)	Efficient and effective for many NLP tasks	Static embeddings, do not handle polysemy
	Pre-trained on large corpora, widely used in research	Not as accurate as contextual embeddings
	Easy to integrate with various ML frameworks	No longer state-of-the-art
	Free to use
----------------------	-----------------------	----------------------
FastText Embeddings	Handles out-of-vocabulary words by using subword information	Static embeddings, do not handle polysemy
	Pre-trained models available for many languages	Not as accurate as contextual embeddings
	Good balance of efficiency and performance
	Free to use
----------------------	-----------------------	----------------------
BERT	Contextual embeddings that handle polysemy effectively	Computationally intensive and slower inference
	State-of-the-art performance for many NLP tasks	Larger models require significant resources
	Pre-trained models available for various tasks	May require fine-tuning for specific tasks
	Strong support and community
	Highly versatile and adaptable
----------------------	-----------------------	----------------------
Doc2Vec	Captures semantic meaning of entire documents	Requires substantial training data
	Good for document-level tasks	Slower training compared to word embeddings
	Handles variable length text	Not as widely supported as other embeddings
	Free to use
----------------------	-----------------------	----------------------
Code Embeddings	Specialized for source code representation	Limited to programming-related tasks
	Useful for code search and understanding	Can be complex to train and fine-tune
	Captures syntactic and semantic code features	Not as mature as text embeddings
----------------------	-----------------------	----------------------
Graph Embeddings	Captures relationships and structures in graph data	Requires knowledge of graph theory
	Effective for network analysis and link prediction	Computationally intensive
	Useful for recommendation systems and social networks	Complexity in model selection and training
	Free to use

Choosing the right embedding model

When choosing an embedding model for our project there are a few things one must take into consideration:

The information we are inputting:
- Depending on the type of information we our model will be receiving and/or will be trained with, we must choose the correct embedding. As we have discussed above, we can have from simple words, to entire documents, images and even relational data such as graphs;
- It might be relevant to also provide some context to the information received. For example with BERT encoding the model is firstly trained on a generic language data model, and then trained a second time with context information, like medical-info, law-related-information and so on (https://www.featureform.com/post/the-definitive-guide-to-embeddings)
Types of embeddings:
- Dense Embeddings:
  - Dense embeddings are continuous, real-valued vectors that capture overall semantic meaning.
  - Suitable for tasks like dense retrieval and semantic search.
  - Examples include embeddings from models like OpenAI's Ada or sentence transformers.
- Sparse Embeddings:
  - Sparse embeddings emphasize relevant information by having most values as zero.
  - Beneficial for specialized domains with rare terms (e.g., medical field).
  - Overcome limitations of Bag-of-Words (BOW) models.


Sparse Vectors vs Dense Vectors¹⁴

Multi-Vector Embeddings (ColBERT):
- Late interaction models where query and document representations interact after encoding.
- Efficient for large document collections due to pre-computed document representations.


ColBERT — Late Interaction Models¹⁵

Long Context Embeddings:
- Address challenges in embedding long documents.
- Models like BGE-M3 allow encoding sequences up to 8,192 tokens.
Variable Dimension Embeddings (Matryoshka Representation Learning):
- Nested lower-dimensional embeddings (like Matryoshka Dolls).
- Efficiently pack information at logarithmic granularities.
- Models like OpenAI's text-embedding-3-small and Nomic's Embed v1.5 use this approach.
Code Embeddings:
- Transform how developers interact with codebases.
- Semantic understanding for code snippets and functionalities.
- Models like OpenAI's text-embedding-3-small and jina-embeddings-v2-base-code facilitate code search and assistance.

How to Measure Embedding Performance:
- Retrieval Metrics and MTEB Benchmark:
  - Retrieval metrics are used to evaluate the performance of embeddings.
  - The Massive Text Embedding Benchmark (MTEB) is widely recognized for this purpose.
  - MTEB evaluates embeddings using datasets containing a corpus, queries, and mappings to relevant documents.
  - The goal is to identify pertinent documents based on similarity scores calculated using cosine similarity.
  - Metrics like nDCG@10 are commonly used to assess performance.
- Limitations of MTEB:
  - While MTEB provides insights into top embedding models, it doesn't determine the best choice for specific domains or tasks.
  - It's essential to evaluate embeddings on your own dataset to find the optimal model.
- Chunk Attribution:
  - In scenarios where raw text is available, assessing retrieval-at-generation (RAG) performance on user queries is crucial.
  - Chunk attribution helps identify which retrieved chunks or documents were used by the model to generate an answer.
  - An attribution score of 0 indicates that necessary documents weren't retrieved.
  - The average score represents the ratio of utilized chunks at a run level.
Choosing the Right Embedding Model for RAG Systems:

The process of selecting an optimal embedding model for a Retrieval-Augmented Generation (RAG) system can be enhanced by using chunk attribution to identify which model best fits a specific use case. Galileo’s GenAI Studio offers a practical demonstration using 10-K annual financial reports from Nvidia over the past four years.

Data Preparation
- Retrieval and Parsing: The 10-K reports are parsed using the PyPDF library, producing approximately 700 large text chunks.
- Question Generation: GPT-turbo with a zero-shot instruction prompt generates a question for each chunk. A subset of 100 chunks is randomly selected to ensure questions cover all reports.
Evaluation Metrics
- RAG Metrics:
  - Chunk Attribution: Boolean metric indicating whether a chunk contributed to the response.
  - Chunk Utilization: Measures the extent of chunk text used in responses.
  - Completeness: Assesses how much of the provided context was used in generating a response.
  - Context Adherence: Evaluates if the LLM’s output aligns with the given context.
Safety Metrics:
- Private Identifiable Information (PII): Flags instances of PII such as credit card numbers and email addresses.
- Toxicity: Binary classification to detect hateful or toxic information.
- Tone: Classifies response tone into nine emotional categories.
System Metrics:
- Latency: Measures the response time of LLM calls.
Workflow for Model Evaluation

A function is created to run various sweep parameters, testing different embedding models to identify the optimal one. Steps include:
1. Loading the embedding model.
2. Managing the vector index.
3. Vectorizing chunks and adding them to the index.
4. Loading the chain and defining tags.
5. Preparing Galileo callback with metrics and tags.
6. Running the chain with questions to generate answers.
Failure Analysis

Instances with an attribution score of 0 (indicating retrieval failure) can be easily identified. For example, failures occurred when chunks mentioned income tax but did not reference the specific year in question.

Practical embeddings associated with each of the platforms provided in Langchain

OpenAI:
- text-embedding-3-small
- text-embedding-3-large
- text-embedding-ada-002
Cohere:
- embed-english-light-v2.0
- embed-english-light-v3.0
- embed-english-v2.0
- embed-english-v3.0
- embed-multilingual-light-v3.0
- embed-multilingual-v2.0
- embed-multilingual-v3.0
Anthropic:
- Voyage AI: While Anthropic itself doesn't provide embeddings, Voyage AI offers a wide variety of options. Their models consider factors like dataset size, architecture, inference performance, and customization.
- voyage-large-2
- voyage-code-2
- voyage-2
- voyage-lite-02-instruct
Ollama:
- mxbai-embed-large
- nomic-embed-text
- all-minilm

References

1: Getting Started With Embeddings - Hugging Face. https://huggingface.co/blog/getting-started-with-embeddings.

2: What are embeddings in machine learning? | Cloudflare. https://www.cloudflare.com/learning/ai/what-are-embeddings/.

3: Embeddings in Machine Learning: Types, Models & Best Practices. https://swimm.io/learn/large-language-models/embeddings-in-machine-learning-types-models-and-best-practices.

4: The Full Guide to Embeddings in Machine Learning | Encord. https://encord.com/blog/embeddings-machine-learning/.

5: Understanding Word Embeddings: The Building Blocks of NLP and GPTs. https://www.freecodecamp.org/news/understanding-word-embeddings-the-building-blocks-of-nlp-and-gpts/.

6: Word Embedding Methods in Natural Language Processing: a Review. https://doaj.org/article/918fe89bfb7a472fa1ac48c6a8c5d212.

7: Introducing text and code embeddings | OpenAI. https://openai.com/blog/introducing-text-and-code-embeddings/.

8: Understanding Encoders and Embeddings in Large Language Models ... - Medium. https://medium.com/@sharifghafforov00/understanding-encoders-and-embeddings-in-large-language-models-llms-1e81101b2f87.

9: Neural Network Embeddings Explained - Towards Data Science. https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526.

10: https://openai.com/_next/static/chunks/1420.023ea14fc18e2250.js%29.

11: Mastering RAG: How to Select an Embedding Model (https://www.rungalileo.io/blog/mastering-rag-how-to-select-an-embedding-model)

12: What are Vector Embeddings? (https://qdrant.tech/articles/what-are-embeddings/)

13: What are Vector Embeddings? (https://www.techtarget.com/searchenterpriseai/definition/vector-embeddings)

14: Dense Vectors in Natural Language Processing (https://medium.com/@yasindusanjeewa8/dense-vectors-in-natural-language-processing-06818dff5cd7)

15: ColBERT — A Late Interaction Model For Semantic Search (https://medium.com/@zz1409/colbert-a-late-interaction-model-for-semantic-search-da00f052d30e)

Embeddings - runtimerevolution/labs GitHub Wiki

Types of objects

Applications of Embeddings

Commonly used embeddings for natural language and coding

Choosing the right embedding model

Practical embeddings associated with each of the platforms provided in Langchain

References

⚠️ GitHub.com Fallback ⚠️

Embeddings - runtimerevolution/labs GitHub Wiki

Types of objects

Applications of Embeddings

Commonly used embeddings for natural language and coding

Choosing the right embedding model

Practical embeddings associated with each of the platforms provided in Langchain

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️