Multimodal RAG approaches - trankhoidang/RAG-wiki GitHub Wiki
✏️ Page Contributors: Khoi Tran Dang
🕛 Creation date: 05/07/2024
📥 Last Update: 05/07/2024
There are three approaches of Multimodal RAG that we will see below:
- Approach 1: Embed all modalities into a unified vector space
- Approach 2: Building separate stores for each modality
- Approach 3: Ground all modalities to a single modality and then embed
The first approach, embedding all modalities into a unified vector space, uses models like CLIP or ImageBind to represent various data types in a single vector space. This involves embedding images, text, and table text into this unified space and retrieving relevant modalities with their corresponding relevancy scores.
The second approach, building separate spaces for each modality, creates distinct spaces for each modality. Text and table text are embedded using a text embedding model, while images are embedded using an image embedding model like CLIP. For a given query, the top-K image chunks and top-K text/table chunks are retrieved. To obtain the final K chunks, a multimodal reranking model or fusion techniques are required.
The third approach, grounding all modalities to a single modality, focuses on converting images into text before embedding. After conversion, text data and image summaries are embedded, and the top-K relevant chunks are retrieved.
After retrieval, you can choose to either use the retrieved image as the context with a multimodal LLM for generation or take an additional step to summarize the image as text and use the summary as context for input into a unimodal LLM.
Approach | Pros | Cons |
---|---|---|
Embed all modalities into a unified vector space | Simplifies retrieval by using a single multimodal embedding model. | Requires an embedding model that performs well across various data types and complexity levels. Requires a context length sufficient for embedding texts; the 77-token limit of CLIP models is often too short for many applications. |
Building separate spaces for each modality | Allows for optimization of embedding models for each modality. | Increases complexity, requiring efficient fusion and reranking strategies, and a suitable multimodal reranker. |
Ground all modalities to a single modality | Text descriptions and metadata enhance understanding and retrieval accuracy. Easier to adapt advanced text-based RAG techniques. |
High preprocessing costs. Text descriptions may miss essential information in the images. |
← Previous: Introducing Multimodal RAG