How to choose an embedding model - trankhoidang/RAG-wiki GitHub Wiki

How to choose an embedding model

✏️ Page Contributors: Khoi Tran Dang

🕛 Creation date: 26/06/2024

📥 Last Update: 10/07/2024

Text Embedding models from Hugging Face

Hugging Face is is well known AI platform for building, training and deploying machine learning models. It offers many publicly available pre-trained models for natural language tasks, especially in our case: embedding models.

The quality of text embeddings depends on the embedding model used and there are thousands of them to choose from. For this reason, MTEB (Massive Text Embedding Benchmark) is provided to help us choosing an embedding model.

The benchmark consists of 58 datasets across 8 tasks and contains up to ~ 120 languages. It is also updated frequently and the comparison are provided under leaderboard format.

Private embedding models

Companies like OpenAI, Cohere, Google, ... provide private embedding API (with cost), and release frequently new model enhancements.

Typical examples in the RAG community include:

  • text-embedding-ada-002 (OpenAI)
  • text-embedding-3-large (OpenAI)
  • embed-multilingual-v3.0 (Cohere)
  • text-multilingual-embedding-preview-0409 (Google)
  • ...

Considerations when picking an text embedding model

  • Model design
    • some models are designed for broad understanding of text, some are fine-tuned into downstream task
    • some models are better for retrieval, some models are better at reranking, ...
    • some models support multi-languages, some have limited support
  • Performance on general benchmark
    • MTEB for example
  • Performance on YOUR data
    • in most times, your data matters and it is different from general benchmark
  • Embedding dimensions and max input chunk length
  • Latency and resources requirements
    • speed and throughput can vary, for different model size, local or API service
  • Usage/API cost
    • costs of API calls to cloud-based embedding service
    • costs of computational and storage of local embedding models
  • Flexibility
    • can you easily fine-tune the model on your data?
  • Privacy
    • privacy issues

In the end, it comes to a trade-off between quality, speed and cost, with some additional issues related to deployment, privacy, and security. It depends mostly on your use case to pick the best embedding model.

Guideline to choose a text embedding model

Below we give a guideline as a starting point for choosing an embedding model.

  • identify your use case

    • does your use case consist of specific domains?
    • what language will you need?
    • is it just text, or other modalities also?
    • how many resources do you have? (small or big model)

    Generally, a public general-purpose text embedding model is sufficient as a starting point.

  • select a baseline model from MTEB leaderboard (for retrieval task)

    • take into account the model size and memory storage
    • smaller embedding dimensions is more efficient, does your use case need big embedding dimension?
    • maximum number of tokens is also important to look at, a too small number can affect your chunking strategy
  • (extra) pick some models (public or private), evaluate against your baseline on your use case

  • (extra++) see if you need to fine-tune your embedding model

Important notes

  • The embedding model is used in different stages of RAG (embedding of documents, query, reranking, tool embedding, conversational embedding) with different objectives. The analysis between different considerations should be reconsidered for each subtask.

Further reading

← Previous: S05_Embedding

Next: S06_Indexing →

⚠️ **GitHub.com Fallback** ⚠️