How to select chunking strategy - trankhoidang/RAG-wiki GitHub Wiki
✏️ Page Contributors: Khoi Tran Dang
🕛 Creation date: 25/06/2024
📥 Last Update: 25/06/2024
As Gred Kamradt said here, the goal of chunking is to prepare your data for anticipated tasks, and the right question to ask is "What is the optimal way for me to pass data my language model needs for its tasks?".
Chunking strategies and its effectiveness depends on the dataset and the use case. Thus, for some initial insights, it might be important to consider:
-
What is the data you work on?
- long documents like articles, short messages, uniform structure or not?
-
What is your use case?
- Fact checking? Question Answering? Summarization? ...
-
What is the average length and complexity of the queries?
- the length of chunk for better alignment with the query (and then better retrieval)
-
What is your embedding model (and its token capacity)?
- what chunk size is best suitable for that embedding model?
-
What is the length of the LLM Context window?
- what are the number of retrieved chunks to pass into the LLM, which limits the chunk size?
-
(Extra) Do you want to tailor different chunking strategy for different document structure? Do you have the budget for its implementation and also its maintenance?
Depending on the use case, document may have consistent and uniform structure or not. Below shown an example of a guideline to determine the chunking strategy based on the document structure degree.
A discusson on chunking approach by document structure is given here: Developing a RAG solution - Chunking phase - Azure Architecture Center | Microsoft Learn. To note that this is not true for all case, but just an example of insight from which we may want to start.
- There is no optimal way for all cases and you can combine different chunking methods if you have multiple downstream tasks.
- It is important to compare quantitatively different strategies with different parameters tuning if possible.
- In contrast, depending on your budget, it might not be the first thing you want to optimize within the RAG workflow.
Each strategy has parameters that can affect the effectiveness of chunking.
The most common parameter to adjust is the chunk size. Typically, you measure both retrieval and generation performance in relation to your chunking strategy and its parameters. The evaluation metrics may vary depending on your specific objectives.
Examples of such workflows:
-
Optimizing RAG with Advanced Chunking Techniques (antematter.io)
-
Mastering RAG: Advanced Chunking Techniques for LLM Applications - Galileo (rungalileo.io)