Multimodal RAG - trankhoidang/RAG-wiki GitHub Wiki
✏️ Page Contributors: Khoi Tran Dang
🕛 Creation date: 05/07/2024
📥 Last Update: 10/07/2024
Multimodal Retrieval-Augmented Generation (RAG) is an emerging technology that extends traditional text-based RAG by incorporating diverse data types such as images, audio, video, and code. This creates a more comprehensive and contextually rich retrieval and generation system.
- Enhanced Context Understanding: Like human perception, which integrates multiple senses, multimodal RAG develops a nuanced understanding of queries and context by incorporating text, images, audio, and other data types.
- Broader Applications: Useful in various fields such as multimedia content creation and the analysis of medical images and patient records.
- Improved Retrieval Accuracy: By cross-referencing and validating information across multiple data modalities, it enhances the precision and reliability of information retrieval.
Multimodal RAG involves integrating and processing different modalities across various stages:
- Input query
- Knowledge base
- Retrieved context
- Context used for Augmentation
- Output answer
An illustrative diagram showing the different modalities involved in various stages of Multimodal RAG is available here: hymie122/RAG-Survey: Collecting awesome papers of RAG for AIGC. (github.com)
While promising, multimodal RAG faces multiple challenges:
- Capturing information from each modality
- Example: general multimodal embedding models might not capture the nuances of different image types, such as general images, plot images, and images with embedded text
- Aligning representations across modalities
- Example: representation of image does not align with its text description
- Example: representation of a table (in text) does not align with the query (in text)?
- Extracting different modalities from unstructured data
- Example: many enterprises data consist of PDFs, where extracting text, tables, and images is not trivial. For more information, please check PDF Parsing for RAG
- Limited access to advanced multimodal models
- Computational costs and other costs (pre-processing, storage costs, maintenance costs)
- Training and Fine-Tuning
- Multimodal models require more resources and complex optimization compared to text-only models.
- Cross-Modality imbalance
- Addressing the dominance of one modality over another (e.g., text vs. images) necessitates specialized techniques.
- Evaluation
- Assessing multimodal RAG is challenging due to the integration of multiple data types and the lack of standardized benchmarks, which are still under development.
A fascinating schema on RAG applications, particularly in multimodality, is discussed here: hymie122/RAG-Survey: Collecting awesome papers of RAG for AIGC. (github.com)
Next pages in Multimodal RAG section: