Multimodal RAG - trankhoidang/RAG-wiki GitHub Wiki

Introducing Multimodal RAG

✏️ Page Contributors: Khoi Tran Dang

🕛 Creation date: 05/07/2024

📥 Last Update: 10/07/2024

Overview

Multimodal Retrieval-Augmented Generation (RAG) is an emerging technology that extends traditional text-based RAG by incorporating diverse data types such as images, audio, video, and code. This creates a more comprehensive and contextually rich retrieval and generation system.

  • Enhanced Context Understanding: Like human perception, which integrates multiple senses, multimodal RAG develops a nuanced understanding of queries and context by incorporating text, images, audio, and other data types.
  • Broader Applications: Useful in various fields such as multimedia content creation and the analysis of medical images and patient records.
  • Improved Retrieval Accuracy: By cross-referencing and validating information across multiple data modalities, it enhances the precision and reliability of information retrieval.

Supported modalities in different RAG phases

Multimodal RAG involves integrating and processing different modalities across various stages:

  • Input query
  • Knowledge base
  • Retrieved context
  • Context used for Augmentation
  • Output answer

An illustrative diagram showing the different modalities involved in various stages of Multimodal RAG is available here: hymie122/RAG-Survey: Collecting awesome papers of RAG for AIGC. (github.com)

Challenges

While promising, multimodal RAG faces multiple challenges:

  • Capturing information from each modality
    • Example: general multimodal embedding models might not capture the nuances of different image types, such as general images, plot images, and images with embedded text
  • Aligning representations across modalities
    • Example: representation of image does not align with its text description
    • Example: representation of a table (in text) does not align with the query (in text)?
  • Extracting different modalities from unstructured data
    • Example: many enterprises data consist of PDFs, where extracting text, tables, and images is not trivial. For more information, please check PDF Parsing for RAG
  • Limited access to advanced multimodal models
  • Computational costs and other costs (pre-processing, storage costs, maintenance costs)
  • Training and Fine-Tuning
    • Multimodal models require more resources and complex optimization compared to text-only models.
  • Cross-Modality imbalance
    • Addressing the dominance of one modality over another (e.g., text vs. images) necessitates specialized techniques.
  • Evaluation
    • Assessing multimodal RAG is challenging due to the integration of multiple data types and the lack of standardized benchmarks, which are still under development.

Applications

A fascinating schema on RAG applications, particularly in multimodality, is discussed here: hymie122/RAG-Survey: Collecting awesome papers of RAG for AIGC. (github.com)

Further reading

Next pages in Multimodal RAG section:

← Previous: Introducing RAG

Next: Approaches in Multimodal RAG →

⚠️ **GitHub.com Fallback** ⚠️