S02_Data Parsing - trankhoidang/RAG-wiki GitHub Wiki

Document Extracting/Parsing for RAG

✏️ Page Contributors: Khoi Tran Dang

🕛 Creation date: 24/06/2024

📥 Last Update: 26/06/2024

No matter how good our algorithms are, if the initial data is of bad quality, RAG pipeline will likely fail. In the same time, no matter how good our initial data is, if we cannot extract meaningful content from it, RAG pipeline will also likely fail.

Once collected the external database for your application, next step is to extract the data from the different data sources. In the extraction phase, data can be categorized as:

  • structured
  • semi-structured
  • unstructured

Note: Extraction in RAG typically involves text extraction, tables extractions in text-based RAG. Image extraction can also be expected in multimodal RAG.

Structured, Semi-structured and Unstructured data

Structured data follows a predefined data schema. Typical examples include relational databases, Json, spreadsheets like Excel or CSV.

  • how to extract: use data manipulations tools like Pandas or custom database reader with correct database scheme.
  • extraction quality: often nearly perfect, assuming respect to the predefined schema.

Semi-structured data does not adhere to a strict schema, but have some organizational properties that make the extraction easier. Typical examples include XML, HTML (with tags).

  • how to extract: Use XML/HTML parsers with to-define rule-based extraction (what tags to use for example).
  • extraction quality: can be high, but may vary, depending on the consistency of the tags, as well as the strategy to use which tag.

Unstructured data does not adhere to a schema and it is based on character and binary data, such as text files, images, audios, PDFs and more. It have however some intrinsic patterns, or visual clues based on which we can extract data from it.

  • how to extract: methods range from simple text readers, OCR for images, rule-based PDF parsers to more sophisticated pipeline-based PDF parsers.
  • extraction quality: high for simple text files but varies significantly for other document types; become increasingly difficult when dealing with complex data structure (such as complex PDF with embedded tables or images).

Extraction Tools (RAG's perspective)

Simple case

We can either use corresponding tools for each file format, either use frameworks like LlamaIndex or Langchain, which integrate many data loaders/connectors to support RAG implementation.

Complex case

Challenges of extraction involve complex PDFs, or even structured data like Excel, but with inconsistent schema.

There are multiple in-developing tools/APIs for parsing complex data for RAG. Some provide parsing tasks only, some aim for ETL pipeline in favor of RAG, and some aim for complete RAG workflow. However, for now there are no uniform idea for which tools to use.

List of possible tools:

Depending on our data and objectives, we may need to build our own parsing pipeline. This requires a major effort and a profound understanding of different approaches.

PDF Parsing for RAG

PDF parsing for RAG is a substantial subject. For more information on different approaches of dealing with PDFs Parsing, please refer to PDFs Parsing for RAG.

← Previous: S01_Data Collection

Next: S03_Data Cleaning →

⚠️ **GitHub.com Fallback** ⚠️