Data Collection for RAG

✏️ Page Contributors: Khoi Tran Dang

🕛 Creation date: 24/06/2024

📥 Last Update: 05/07/2024

RAG (and its knowledge data) is use-case dependent

The key aspect of RAG is how it leverages the external knowledge base, which can be as small as one PDF for simple chat-with-pdf applications or large-scale datasets proper for industrial and organizational use. From one use case to another, the knowledge base can significantly differ in terms of:

Conception - What should it consist of?
- legal documents for legal search
- medical records for healthcare applications
- scientific papers for fact answering
- customer service logs for customer support
- product catalogs for e-commerce recommendations
- ...
Data sources - Where to collect the data?
- document dataset (online or local, public or private)
- web pages
- databases
- APIs
- cloud-storage services (e.g. google drive, AWS3, ...)
- online communities (e.g. discord, reddit, ...)
Data formats - How the raw data is stored?
- files (.txt, .pdf, .csv, .xlsx, .docx, .epub, .hwp, .ipynb, .jpeg, .jpg, .md, .mp3, .mp4, .png, .ppt, .pptm, .pptx, .json, .mbox, ...)
- database record

RAG is data-based and since each dataset is different, there is (currently) no one-size-fits-all solution of a RAG workflow. When collecting data for RAG, it is important to ensure:

Data quality (is it relevant to the specific use-case? does it cover the scope of the application? is is accurate? updated? no redundant?)
Data source reliability
Data privacy and security
Data ethical considerations

Different data involved in RAG

Here, it is worth noting that there are also other types of data in a RAG workflow, which include:

User input data
External knowledge base (including raw data and data after transformations)
System-generated data (including various types of logs and outputs of each component)
Evaluation datasets
Dataset used for further fine-tuning or other advanced strategies.

In most cases, data consists of text (and possibly tables) in RAG applications. However, data can also include images, audios, videos and more. While we won't go into detail here, these topics of multimodality will be discussed in the section [TO LINK Multimodal RAG].

User input data

User input data is generally a prompt or a query (question) provided by user to the system. Additionally, it can also be uploaded files, such as documents, spreadsheets, images or other media, which the system adds to the knowledge base for RAG.

External knowledge base

Raw data from External knowledge bases can consists of simple .txt files, text content of web page, text extracted from images, PDFs, tables from spread sheets, code files, and more, depending on the application use case.

Once identifying and having access to the raw data, the first step in RAG pipeline involves: Once the raw data is identified and accessible, the initial steps in a RAG pipeline typically involve:

extracting/parsing: This step focus on reading and extracting content (predominantly text) from the raw data. Parsing refers to converting data from one format to another, a common requirement in RAG when extracting meaningful content from structured, semi-structured, or unstructured data.
transforming/cleaning: This step involves refining the extracted data by modifying or removing inaccuracies, duplicates or irrelevant information.
loading/ingesting: the cleaned data is loaded into manageable objects, before the stage of chunking, embedding and indexing.

System-generated data

We won't detail logs data here, but one can find more information in [TO LINK Productionized-level RAG implementation].

Output of the retrieval component - a list of passages within the external knowledge base that are relevant to the query, is crucial for processing, and can be stored for later evaluation or monitoring.

In most cases, the final output of the system is the answer provided to the given user query, typically in text format. Depending on the use case, it may also include references, metadata, or other useful information.

Evaluation datasets

Evaluation and benchmarking is an important block of RAG development. In short, RAG evaluation datasets consists of wisely selected question, ground truth answer (optional) and ground truth relevant contexts (optional).

For more information on evaluation datasets, please refer to [TO LINK RAG evaluation datasets].

Dataset used for further fine tuning or other advanced strategies

For some advanced techniques, it might be possible that some extra curated datasets for proxies tasks are needed. For more information, please refer to [Fine-tuning embeddings], [Fine-tuning Retrievers], [Fine-tuning LLMs for RAG], or [Query classification].

← Previous: Stages of RAG

Next: S02_Data Parsing →

S01_Data Collection - trankhoidang/RAG-wiki GitHub Wiki

Data Collection for RAG

RAG (and its knowledge data) is use-case dependent

Different data involved in RAG

User input data

External knowledge base

System-generated data

Evaluation datasets

Dataset used for further fine tuning or other advanced strategies

⚠️ GitHub.com Fallback ⚠️

S01_Data Collection - trankhoidang/RAG-wiki GitHub Wiki

Data Collection for RAG

RAG (and its knowledge data) is use-case dependent

Different data involved in RAG

User input data

External knowledge base

System-generated data

Evaluation datasets

Dataset used for further fine tuning or other advanced strategies

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️