How to choose an embedding model - trankhoidang/RAG-wiki GitHub Wiki
✏️ Page Contributors: Khoi Tran Dang
🕛 Creation date: 26/06/2024
📥 Last Update: 10/07/2024
Hugging Face is is well known AI platform for building, training and deploying machine learning models. It offers many publicly available pre-trained models for natural language tasks, especially in our case: embedding models.
The quality of text embeddings depends on the embedding model used and there are thousands of them to choose from. For this reason, MTEB (Massive Text Embedding Benchmark) is provided to help us choosing an embedding model.
The benchmark consists of 58 datasets across 8 tasks and contains up to ~ 120 languages. It is also updated frequently and the comparison are provided under leaderboard format.
Companies like OpenAI, Cohere, Google, ... provide private embedding API (with cost), and release frequently new model enhancements.
Typical examples in the RAG community include:
- text-embedding-ada-002 (OpenAI)
- text-embedding-3-large (OpenAI)
- embed-multilingual-v3.0 (Cohere)
- text-multilingual-embedding-preview-0409 (Google)
- ...
- Model design
- some models are designed for broad understanding of text, some are fine-tuned into downstream task
- some models are better for retrieval, some models are better at reranking, ...
- some models support multi-languages, some have limited support
- Performance on general benchmark
- MTEB for example
- Performance on YOUR data
- in most times, your data matters and it is different from general benchmark
- Embedding dimensions and max input chunk length
- Latency and resources requirements
- speed and throughput can vary, for different model size, local or API service
- Usage/API cost
- costs of API calls to cloud-based embedding service
- costs of computational and storage of local embedding models
- Flexibility
- can you easily fine-tune the model on your data?
- Privacy
- privacy issues
In the end, it comes to a trade-off between quality, speed and cost, with some additional issues related to deployment, privacy, and security. It depends mostly on your use case to pick the best embedding model.
Below we give a guideline as a starting point for choosing an embedding model.
-
identify your use case
- does your use case consist of specific domains?
- what language will you need?
- is it just text, or other modalities also?
- how many resources do you have? (small or big model)
Generally, a public general-purpose text embedding model is sufficient as a starting point.
-
select a baseline model from MTEB leaderboard (for retrieval task)
- take into account the model size and memory storage
- smaller embedding dimensions is more efficient, does your use case need big embedding dimension?
- maximum number of tokens is also important to look at, a too small number can affect your chunking strategy
-
(extra) pick some models (public or private), evaluate against your baseline on your use case
- to do this, please refer to Further Reading
-
(extra++) see if you need to fine-tune your embedding model
- The embedding model is used in different stages of RAG (embedding of documents, query, reranking, tool embedding, conversational embedding) with different objectives. The analysis between different considerations should be reconsidered for each subtask.