Embeddings - runtimerevolution/labs GitHub Wiki
An embedding is a numerical representation of a real-world object, such as words, images, or videos. These representations capture the semantic meaning of the object, making them useful for various applications. For example, consider a sentence like "What is the main benefit of voting?" We can create an embedding for this sentence, represented as a vector (e.g., [0.84, 0.42, ..., 0.02]). This vector encodes the meaning of the sentence.
![]() |
|---|
| The process for turning raw data into embeddings and placing them into the vector space12 |
Embeddings can represent a wide variety of objects and data types. Common examples of things that can be embedded include:
![]() |
|---|
| Types of objects that can be embedded13 |
- Word Embeddings: These represent words in a continuous vector space. Word2Vec, GloVe, and FastText are popular word embedding techniques.
- Image Embeddings: Images can be embedded into vectors, allowing us to compare and analyze them. Convolutional neural networks (CNNs) often generate image embeddings.
- Document Embeddings: These capture the meaning of entire documents or paragraphs. Doc2Vec and BERT are examples of document embedding methods.
- Graph Embeddings: Used for graph-structured data (e.g., social networks, knowledge graphs). Graph neural networks (GNNs) create graph embeddings.
- Entity Embeddings: Represent entities (e.g., users, products) in recommendation systems.
- Time Series Embeddings: Useful for time-dependent data (e.g., stock prices, sensor readings).
- Search Engines: Google uses embeddings to match text queries to relevant documents or web pages.
- Recommendation Systems: Embeddings help recommend products, movies, or music based on user preferences.
- Chatbots: Chatbots use embeddings to understand user input and generate contextually relevant responses.
- Image Search and Classification: Image embeddings enable efficient image retrieval and classification.
- Social Media: Platforms like Snapchat use embeddings to serve personalized ads.
- Natural Language Understanding: Embeddings enhance tasks like sentiment analysis, named entity recognition, and text summarization.
| Embedding | Perk | Drawback |
|---|---|---|
| Cohere Embeddings | High-quality embeddings for various NLP tasks | Fewer model options compared to OpenAI |
| Competitive pricing | Community support still growing | |
| Fast and efficient API responses | ||
| Straightforward integration | ||
| Responsive support and growing community | ||
| ----------------------- | ----------------------- | ---------------------- |
| OpenAI Embeddings | State-of-the-art accuracy and versatility | Generally higher pricing |
| Comprehensive documentation and strong community support | Can experience slower performance during peak times | |
| Wide range of models, allowing fine-tuned customizations | ||
| Frequent updates and continuous improvements | ||
| Highly scalable for various application sizes | ||
| ---------------------- | ----------------------- | ---------------------- |
| Google Universal Sentence Encoder | Easy to use with strong performance on semantic similarity | Limited customization options |
| Pre-trained on a large, diverse dataset | May not be as cutting-edge as newer models | |
| Free to use | ||
| Good integration with TensorFlow and other Google tools | ||
| ---------------------- | ----------------------- | ---------------------- |
| GloVe (Global Vectors for Word Representation) | Efficient and effective for many NLP tasks | Static embeddings, do not handle polysemy |
| Pre-trained on large corpora, widely used in research | Not as accurate as contextual embeddings | |
| Easy to integrate with various ML frameworks | No longer state-of-the-art | |
| Free to use | ||
| ---------------------- | ----------------------- | ---------------------- |
| FastText Embeddings | Handles out-of-vocabulary words by using subword information | Static embeddings, do not handle polysemy |
| Pre-trained models available for many languages | Not as accurate as contextual embeddings | |
| Good balance of efficiency and performance | ||
| Free to use | ||
| ---------------------- | ----------------------- | ---------------------- |
| BERT | Contextual embeddings that handle polysemy effectively | Computationally intensive and slower inference |
| State-of-the-art performance for many NLP tasks | Larger models require significant resources | |
| Pre-trained models available for various tasks | May require fine-tuning for specific tasks | |
| Strong support and community | ||
| Highly versatile and adaptable | ||
| ---------------------- | ----------------------- | ---------------------- |
| Doc2Vec | Captures semantic meaning of entire documents | Requires substantial training data |
| Good for document-level tasks | Slower training compared to word embeddings | |
| Handles variable length text | Not as widely supported as other embeddings | |
| Free to use | ||
| ---------------------- | ----------------------- | ---------------------- |
| Code Embeddings | Specialized for source code representation | Limited to programming-related tasks |
| Useful for code search and understanding | Can be complex to train and fine-tune | |
| Captures syntactic and semantic code features | Not as mature as text embeddings | |
| ---------------------- | ----------------------- | ---------------------- |
| Graph Embeddings | Captures relationships and structures in graph data | Requires knowledge of graph theory |
| Effective for network analysis and link prediction | Computationally intensive | |
| Useful for recommendation systems and social networks | Complexity in model selection and training | |
| Free to use |
When choosing an embedding model for our project there are a few things one must take into consideration:
-
The information we are inputting:
- Depending on the type of information we our model will be receiving and/or will be trained with, we must choose the correct embedding. As we have discussed above, we can have from simple words, to entire documents, images and even relational data such as graphs;
- It might be relevant to also provide some context to the information received. For example with BERT encoding the model is firstly trained on a generic language data model, and then trained a second time with context information, like medical-info, law-related-information and so on (https://www.featureform.com/post/the-definitive-guide-to-embeddings)
-
Types of embeddings:
-
- Dense embeddings are continuous, real-valued vectors that capture overall semantic meaning.
- Suitable for tasks like dense retrieval and semantic search.
- Examples include embeddings from models like OpenAI's Ada or sentence transformers.
-
- Sparse embeddings emphasize relevant information by having most values as zero.
- Beneficial for specialized domains with rare terms (e.g., medical field).
- Overcome limitations of Bag-of-Words (BOW) models.
-
![]() |
|---|
| Sparse Vectors vs Dense Vectors14 |
- Multi-Vector Embeddings (ColBERT):
- Late interaction models where query and document representations interact after encoding.
- Efficient for large document collections due to pre-computed document representations.
![]() |
|---|
| ColBERT — Late Interaction Models15 |
-
Long Context Embeddings:
- Address challenges in embedding long documents.
- Models like BGE-M3 allow encoding sequences up to 8,192 tokens.
-
Variable Dimension Embeddings (Matryoshka Representation Learning):
- Nested lower-dimensional embeddings (like Matryoshka Dolls).
- Efficiently pack information at logarithmic granularities.
- Models like OpenAI's text-embedding-3-small and Nomic's Embed v1.5 use this approach.
-
Code Embeddings:
- Transform how developers interact with codebases.
- Semantic understanding for code snippets and functionalities.
- Models like OpenAI's text-embedding-3-small and jina-embeddings-v2-base-code facilitate code search and assistance.
-
How to Measure Embedding Performance:
-
Retrieval Metrics and MTEB Benchmark:
- Retrieval metrics are used to evaluate the performance of embeddings.
- The Massive Text Embedding Benchmark (MTEB) is widely recognized for this purpose.
- MTEB evaluates embeddings using datasets containing a corpus, queries, and mappings to relevant documents.
- The goal is to identify pertinent documents based on similarity scores calculated using cosine similarity.
- Metrics like nDCG@10 are commonly used to assess performance.
-
Limitations of MTEB:
- While MTEB provides insights into top embedding models, it doesn't determine the best choice for specific domains or tasks.
- It's essential to evaluate embeddings on your own dataset to find the optimal model.
-
Chunk Attribution:
- In scenarios where raw text is available, assessing retrieval-at-generation (RAG) performance on user queries is crucial.
- Chunk attribution helps identify which retrieved chunks or documents were used by the model to generate an answer.
- An attribution score of 0 indicates that necessary documents weren't retrieved.
- The average score represents the ratio of utilized chunks at a run level.
-
-
Choosing the Right Embedding Model for RAG Systems:
The process of selecting an optimal embedding model for a Retrieval-Augmented Generation (RAG) system can be enhanced by using chunk attribution to identify which model best fits a specific use case. Galileo’s GenAI Studio offers a practical demonstration using 10-K annual financial reports from Nvidia over the past four years.
-
Data Preparation
- Retrieval and Parsing: The 10-K reports are parsed using the PyPDF library, producing approximately 700 large text chunks.
- Question Generation: GPT-turbo with a zero-shot instruction prompt generates a question for each chunk. A subset of 100 chunks is randomly selected to ensure questions cover all reports.
-
Evaluation Metrics
-
RAG Metrics:
- Chunk Attribution: Boolean metric indicating whether a chunk contributed to the response.
- Chunk Utilization: Measures the extent of chunk text used in responses.
- Completeness: Assesses how much of the provided context was used in generating a response.
- Context Adherence: Evaluates if the LLM’s output aligns with the given context.
-
RAG Metrics:
-
Safety Metrics:
- Private Identifiable Information (PII): Flags instances of PII such as credit card numbers and email addresses.
- Toxicity: Binary classification to detect hateful or toxic information.
- Tone: Classifies response tone into nine emotional categories.
-
System Metrics:
- Latency: Measures the response time of LLM calls.
-
Workflow for Model Evaluation
A function is created to run various sweep parameters, testing different embedding models to identify the optimal one. Steps include:
- Loading the embedding model.
- Managing the vector index.
- Vectorizing chunks and adding them to the index.
- Loading the chain and defining tags.
- Preparing Galileo callback with metrics and tags.
- Running the chain with questions to generate answers.
-
Failure Analysis
Instances with an attribution score of 0 (indicating retrieval failure) can be easily identified. For example, failures occurred when chunks mentioned income tax but did not reference the specific year in question.
Practical embeddings associated with each of the platforms provided in Langchain
-
- text-embedding-3-small
- text-embedding-3-large
- text-embedding-ada-002
-
- embed-english-light-v2.0
- embed-english-light-v3.0
- embed-english-v2.0
- embed-english-v3.0
- embed-multilingual-light-v3.0
- embed-multilingual-v2.0
- embed-multilingual-v3.0
-
- Voyage AI: While Anthropic itself doesn't provide embeddings, Voyage AI offers a wide variety of options. Their models consider factors like dataset size, architecture, inference performance, and customization.
- voyage-large-2
- voyage-code-2
- voyage-2
- voyage-lite-02-instruct
-
- mxbai-embed-large
- nomic-embed-text
- all-minilm
1: Getting Started With Embeddings - Hugging Face. https://huggingface.co/blog/getting-started-with-embeddings.
2: What are embeddings in machine learning? | Cloudflare. https://www.cloudflare.com/learning/ai/what-are-embeddings/.
3: Embeddings in Machine Learning: Types, Models & Best Practices. https://swimm.io/learn/large-language-models/embeddings-in-machine-learning-types-models-and-best-practices.
4: The Full Guide to Embeddings in Machine Learning | Encord. https://encord.com/blog/embeddings-machine-learning/.
5: Understanding Word Embeddings: The Building Blocks of NLP and GPTs. https://www.freecodecamp.org/news/understanding-word-embeddings-the-building-blocks-of-nlp-and-gpts/.
6: Word Embedding Methods in Natural Language Processing: a Review. https://doaj.org/article/918fe89bfb7a472fa1ac48c6a8c5d212.
7: Introducing text and code embeddings | OpenAI. https://openai.com/blog/introducing-text-and-code-embeddings/.
8: Understanding Encoders and Embeddings in Large Language Models ... - Medium. https://medium.com/@sharifghafforov00/understanding-encoders-and-embeddings-in-large-language-models-llms-1e81101b2f87.
9: Neural Network Embeddings Explained - Towards Data Science. https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526.
10: https://openai.com/_next/static/chunks/1420.023ea14fc18e2250.js%29.
11: Mastering RAG: How to Select an Embedding Model (https://www.rungalileo.io/blog/mastering-rag-how-to-select-an-embedding-model)
12: What are Vector Embeddings? (https://qdrant.tech/articles/what-are-embeddings/)
13: What are Vector Embeddings? (https://www.techtarget.com/searchenterpriseai/definition/vector-embeddings)
14: Dense Vectors in Natural Language Processing (https://medium.com/@yasindusanjeewa8/dense-vectors-in-natural-language-processing-06818dff5cd7)
15: ColBERT — A Late Interaction Model For Semantic Search (https://medium.com/@zz1409/colbert-a-late-interaction-model-for-semantic-search-da00f052d30e)



