Embeddings - runtimerevolution/labs GitHub Wiki

An embedding is a numerical representation of a real-world object, such as words, images, or videos. These representations capture the semantic meaning of the object, making them useful for various applications. For example, consider a sentence like "What is the main benefit of voting?" We can create an embedding for this sentence, represented as a vector (e.g., [0.84, 0.42, ..., 0.02]). This vector encodes the meaning of the sentence.

image
The process for turning raw data into embeddings and placing them into the vector space12

Embeddings can represent a wide variety of objects and data types. Common examples of things that can be embedded include:

Types of objects

image
Types of objects that can be embedded13
  • Word Embeddings: These represent words in a continuous vector space. Word2Vec, GloVe, and FastText are popular word embedding techniques.
  • Image Embeddings: Images can be embedded into vectors, allowing us to compare and analyze them. Convolutional neural networks (CNNs) often generate image embeddings.
  • Document Embeddings: These capture the meaning of entire documents or paragraphs. Doc2Vec and BERT are examples of document embedding methods.
  • Graph Embeddings: Used for graph-structured data (e.g., social networks, knowledge graphs). Graph neural networks (GNNs) create graph embeddings.
  • Entity Embeddings: Represent entities (e.g., users, products) in recommendation systems.
  • Time Series Embeddings: Useful for time-dependent data (e.g., stock prices, sensor readings).

Applications of Embeddings

  • Search Engines: Google uses embeddings to match text queries to relevant documents or web pages.
  • Recommendation Systems: Embeddings help recommend products, movies, or music based on user preferences.
  • Chatbots: Chatbots use embeddings to understand user input and generate contextually relevant responses.
  • Image Search and Classification: Image embeddings enable efficient image retrieval and classification.
  • Social Media: Platforms like Snapchat use embeddings to serve personalized ads.
  • Natural Language Understanding: Embeddings enhance tasks like sentiment analysis, named entity recognition, and text summarization.

Commonly used embeddings for natural language and coding

Embedding Perk Drawback
Cohere Embeddings High-quality embeddings for various NLP tasks Fewer model options compared to OpenAI
Competitive pricing Community support still growing
Fast and efficient API responses
Straightforward integration
Responsive support and growing community
----------------------- ----------------------- ----------------------
OpenAI Embeddings State-of-the-art accuracy and versatility Generally higher pricing
Comprehensive documentation and strong community support Can experience slower performance during peak times
Wide range of models, allowing fine-tuned customizations
Frequent updates and continuous improvements
Highly scalable for various application sizes
---------------------- ----------------------- ----------------------
Google Universal Sentence Encoder Easy to use with strong performance on semantic similarity Limited customization options
Pre-trained on a large, diverse dataset May not be as cutting-edge as newer models
Free to use
Good integration with TensorFlow and other Google tools
---------------------- ----------------------- ----------------------
GloVe (Global Vectors for Word Representation) Efficient and effective for many NLP tasks Static embeddings, do not handle polysemy
Pre-trained on large corpora, widely used in research Not as accurate as contextual embeddings
Easy to integrate with various ML frameworks No longer state-of-the-art
Free to use
---------------------- ----------------------- ----------------------
FastText Embeddings Handles out-of-vocabulary words by using subword information Static embeddings, do not handle polysemy
Pre-trained models available for many languages Not as accurate as contextual embeddings
Good balance of efficiency and performance
Free to use
---------------------- ----------------------- ----------------------
BERT Contextual embeddings that handle polysemy effectively Computationally intensive and slower inference
State-of-the-art performance for many NLP tasks Larger models require significant resources
Pre-trained models available for various tasks May require fine-tuning for specific tasks
Strong support and community
Highly versatile and adaptable
---------------------- ----------------------- ----------------------
Doc2Vec Captures semantic meaning of entire documents Requires substantial training data
Good for document-level tasks Slower training compared to word embeddings
Handles variable length text Not as widely supported as other embeddings
Free to use
---------------------- ----------------------- ----------------------
Code Embeddings Specialized for source code representation Limited to programming-related tasks
Useful for code search and understanding Can be complex to train and fine-tune
Captures syntactic and semantic code features Not as mature as text embeddings
---------------------- ----------------------- ----------------------
Graph Embeddings Captures relationships and structures in graph data Requires knowledge of graph theory
Effective for network analysis and link prediction Computationally intensive
Useful for recommendation systems and social networks Complexity in model selection and training
Free to use

Choosing the right embedding model

When choosing an embedding model for our project there are a few things one must take into consideration:

  1. The information we are inputting:

    • Depending on the type of information we our model will be receiving and/or will be trained with, we must choose the correct embedding. As we have discussed above, we can have from simple words, to entire documents, images and even relational data such as graphs;
    • It might be relevant to also provide some context to the information received. For example with BERT encoding the model is firstly trained on a generic language data model, and then trained a second time with context information, like medical-info, law-related-information and so on (https://www.featureform.com/post/the-definitive-guide-to-embeddings)
  2. Types of embeddings:

    • Dense Embeddings:

      • Dense embeddings are continuous, real-valued vectors that capture overall semantic meaning.
      • Suitable for tasks like dense retrieval and semantic search.
      • Examples include embeddings from models like OpenAI's Ada or sentence transformers.
    • Sparse Embeddings:

      • Sparse embeddings emphasize relevant information by having most values as zero.
      • Beneficial for specialized domains with rare terms (e.g., medical field).
      • Overcome limitations of Bag-of-Words (BOW) models.
image
Sparse Vectors vs Dense Vectors14
  • Multi-Vector Embeddings (ColBERT):
    • Late interaction models where query and document representations interact after encoding.
    • Efficient for large document collections due to pre-computed document representations.
image
ColBERT — Late Interaction Models15
  • Long Context Embeddings:

    • Address challenges in embedding long documents.
    • Models like BGE-M3 allow encoding sequences up to 8,192 tokens.
  • Variable Dimension Embeddings (Matryoshka Representation Learning):

    • Nested lower-dimensional embeddings (like Matryoshka Dolls).
    • Efficiently pack information at logarithmic granularities.
    • Models like OpenAI's text-embedding-3-small and Nomic's Embed v1.5 use this approach.
  • Code Embeddings:

    • Transform how developers interact with codebases.
    • Semantic understanding for code snippets and functionalities.
    • Models like OpenAI's text-embedding-3-small and jina-embeddings-v2-base-code facilitate code search and assistance.
  1. How to Measure Embedding Performance:

    • Retrieval Metrics and MTEB Benchmark:

      • Retrieval metrics are used to evaluate the performance of embeddings.
      • The Massive Text Embedding Benchmark (MTEB) is widely recognized for this purpose.
      • MTEB evaluates embeddings using datasets containing a corpus, queries, and mappings to relevant documents.
      • The goal is to identify pertinent documents based on similarity scores calculated using cosine similarity.
      • Metrics like nDCG@10 are commonly used to assess performance.
    • Limitations of MTEB:

      • While MTEB provides insights into top embedding models, it doesn't determine the best choice for specific domains or tasks.
      • It's essential to evaluate embeddings on your own dataset to find the optimal model.
    • Chunk Attribution:

      • In scenarios where raw text is available, assessing retrieval-at-generation (RAG) performance on user queries is crucial.
      • Chunk attribution helps identify which retrieved chunks or documents were used by the model to generate an answer.
      • An attribution score of 0 indicates that necessary documents weren't retrieved.
      • The average score represents the ratio of utilized chunks at a run level.
  2. Choosing the Right Embedding Model for RAG Systems:

    The process of selecting an optimal embedding model for a Retrieval-Augmented Generation (RAG) system can be enhanced by using chunk attribution to identify which model best fits a specific use case. Galileo’s GenAI Studio offers a practical demonstration using 10-K annual financial reports from Nvidia over the past four years.

  • Data Preparation

    • Retrieval and Parsing: The 10-K reports are parsed using the PyPDF library, producing approximately 700 large text chunks.
    • Question Generation: GPT-turbo with a zero-shot instruction prompt generates a question for each chunk. A subset of 100 chunks is randomly selected to ensure questions cover all reports.
  • Evaluation Metrics

    • RAG Metrics:
      • Chunk Attribution: Boolean metric indicating whether a chunk contributed to the response.
      • Chunk Utilization: Measures the extent of chunk text used in responses.
      • Completeness: Assesses how much of the provided context was used in generating a response.
      • Context Adherence: Evaluates if the LLM’s output aligns with the given context.
  • Safety Metrics:

    • Private Identifiable Information (PII): Flags instances of PII such as credit card numbers and email addresses.
    • Toxicity: Binary classification to detect hateful or toxic information.
    • Tone: Classifies response tone into nine emotional categories.
  • System Metrics:

    • Latency: Measures the response time of LLM calls.
  • Workflow for Model Evaluation

    A function is created to run various sweep parameters, testing different embedding models to identify the optimal one. Steps include:

    1. Loading the embedding model.
    2. Managing the vector index.
    3. Vectorizing chunks and adding them to the index.
    4. Loading the chain and defining tags.
    5. Preparing Galileo callback with metrics and tags.
    6. Running the chain with questions to generate answers.
  • Failure Analysis

    Instances with an attribution score of 0 (indicating retrieval failure) can be easily identified. For example, failures occurred when chunks mentioned income tax but did not reference the specific year in question.

Practical embeddings associated with each of the platforms provided in Langchain

  1. OpenAI:

    • text-embedding-3-small
    • text-embedding-3-large
    • text-embedding-ada-002
  2. Cohere:

    • embed-english-light-v2.0
    • embed-english-light-v3.0
    • embed-english-v2.0
    • embed-english-v3.0
    • embed-multilingual-light-v3.0
    • embed-multilingual-v2.0
    • embed-multilingual-v3.0
  3. Anthropic:

    • Voyage AI: While Anthropic itself doesn't provide embeddings, Voyage AI offers a wide variety of options. Their models consider factors like dataset size, architecture, inference performance, and customization.
    • voyage-large-2
    • voyage-code-2
    • voyage-2
    • voyage-lite-02-instruct
  4. Ollama:

    • mxbai-embed-large
    • nomic-embed-text
    • all-minilm

References

1: Getting Started With Embeddings - Hugging Face. https://huggingface.co/blog/getting-started-with-embeddings.

2: What are embeddings in machine learning? | Cloudflare. https://www.cloudflare.com/learning/ai/what-are-embeddings/.

3: Embeddings in Machine Learning: Types, Models & Best Practices. https://swimm.io/learn/large-language-models/embeddings-in-machine-learning-types-models-and-best-practices.

4: The Full Guide to Embeddings in Machine Learning | Encord. https://encord.com/blog/embeddings-machine-learning/.

5: Understanding Word Embeddings: The Building Blocks of NLP and GPTs. https://www.freecodecamp.org/news/understanding-word-embeddings-the-building-blocks-of-nlp-and-gpts/.

6: Word Embedding Methods in Natural Language Processing: a Review. https://doaj.org/article/918fe89bfb7a472fa1ac48c6a8c5d212.

7: Introducing text and code embeddings | OpenAI. https://openai.com/blog/introducing-text-and-code-embeddings/.

8: Understanding Encoders and Embeddings in Large Language Models ... - Medium. https://medium.com/@sharifghafforov00/understanding-encoders-and-embeddings-in-large-language-models-llms-1e81101b2f87.

9: Neural Network Embeddings Explained - Towards Data Science. https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526.

10: https://openai.com/_next/static/chunks/1420.023ea14fc18e2250.js%29.

11: Mastering RAG: How to Select an Embedding Model (https://www.rungalileo.io/blog/mastering-rag-how-to-select-an-embedding-model)

12: What are Vector Embeddings? (https://qdrant.tech/articles/what-are-embeddings/)

13: What are Vector Embeddings? (https://www.techtarget.com/searchenterpriseai/definition/vector-embeddings)

14: Dense Vectors in Natural Language Processing (https://medium.com/@yasindusanjeewa8/dense-vectors-in-natural-language-processing-06818dff5cd7)

15: ColBERT — A Late Interaction Model For Semantic Search (https://medium.com/@zz1409/colbert-a-late-interaction-model-for-semantic-search-da00f052d30e)

⚠️ **GitHub.com Fallback** ⚠️