S05_Embedding - trankhoidang/RAG-wiki GitHub Wiki
✏️️ Page Contributors: Khoi Tran Dang
🕛 Creation date: 25/06/2024
📥 Last Update: 10/07/2024
Embeddings refer to machine-readable numerical vector representations of data. They can encapsulate information from various sources, including text (words and documents) and non-textual data (audio, images and more). Here we will discuss about textual embeddings.
In a RAG pipeline, when we talk about embeddings, we often refer to the preparation step of the documents and the query for the retrieval phase.
Specifically, chunks of document are embedded into numerical vectors. These vectors embeddings are used for similarity search against the user query (also embedded into vectors), to retrieve the most "relevant" chunks.
The process of converting document chunks into numerical vectors is critical, as the quality of these embeddings directly impacts the accuracy of the retrieval process. High-quality embeddings ensure that the system retrieves the most "pertinent" information, thereby enhancing the overall performance of the RAG system.
There are 2 main types of embeddings: dense and sparse.
- Dense embeddings try to capture the nuances and the meaning of text into vectors. With dense embeddings, the similarity search among these embeddings are then called semantic search.
- Sparse embeddings are vectors with mostly zero values. In RAG, sparse embeddings base on the relative word weights per document, mainly used for key-word search.
Dense embeddings are created by what we called embeddings models. There are two types of embeddings models:
- static: a word has a fix representation no matter what are the surrounding words. Examples include: Word2Vec, Glove, FastText, ....
- dynamic: take into account the surrounding words for better contextualized representations. Examples include: RNN-based embedding models and Transformer-based embedding models.
For RAG, the most performant embedding models are typically from the transformer-based family of models. Usually, there are no one universal embedding model, but each model is trained for different criteria.
Different embedding models are provided by HuggingFace (public models) or companies (private models) like OpenAI, Cohere, Amazon Bedrock, ... For more information on what embedding model to use, please check How to choose an embedding model?.
Sparse embeddings do not take into account the semantic nuances, but rather base on the frequency of the word within the document and some additional statistical factors. One example of the difference between dense and sparse is that, sparse embeddings do not inherently differentiate synonyms, since it's technically two words with different syntax.
Sparse embeddings are beneficial when dealing with specific terminologies or rare key-words, for example in the medical field or law field.
Sparse embeddings are generated using algorithms like:
- bag-of-word
- TF-IDF
- BM25
- SPLADE
In RAG applications, BM25 and SPLADE are the typical method to use when generating sparse embeddings.
- How to choose an embedding model?
- In a not-only-text-based-RAG, we might need a multimodal embedding model but not only a text embedding model. For more information, please check Multimodal RAG approach 1
(Lucas comment: Careful all the links to the other pages are not working.. In addition, I would add a table of pros and cons that compare Dense and Sparse embedding and develop more on when one should use which ones (or both))