chunking strategies - trankhoidang/RAG-wiki GitHub Wiki

Chunking strategies

โœ๏ธ Page Contributors: Khoi Tran Dang

๐Ÿ•› Creation date: 01/07/2024

๐Ÿ“ฅ Last Update: 01/07/2024

5 levels of chunking

The simplest strategy is to split document into fixed-size chunks (later called level 1).

  • Pros: straightforward, useful for brief and consistent texts such as some social media posts.
  • Cons: does not take into account any context, also not text structure (boundaries such as paragraphs, or sentences).

One could image a subsequent strategy of split documents into chunks of random size.

  • Pros: might (or not) capture more context.
  • Cons: can break uncontrollably the chunks in a non-logical manner, leading to less meaningful text chunks.

This video gives overview of 5 different levels of chunking:

In short, the five levels include:

  • level 1: Character splitting

    • Idea: split text based on character count or token count
    • Parameters: chunk_size, chunk_overlap
  • level 2: Recursive character splitting

    • Idea: progressively split the text into smaller chunks by defining some separators (at the level of paragraph, sentences, word or character) to account for the structure of the text
    • Parameters: ["\n\n", "\n", " ", ""], chunk_size
  • level 3: Document specific splitting

    • Idea: chunking strategy varies upon different data formats (PDFs, code files, markdown files, json files, ...)
  • level 4: Semantic splitting

    • Idea: take into account the actual meaning of the text by using semantic text embeddings
    • Parameters: the number of sentences to group and compute semantic similarity, the text embedding model, similarity threshold
  • level 5: Agentic splitting

    • Idea: instruct the LLM to do chunking "like-a-humain", i.e. start at the top, with the first part as first chunk, going down and decide if the text should be merged into the current chunk or create a new chunk
    • Parameters: the LLM model and instructions given to the LLM, algorithm for proposition

Chunk size is the maximum length of one text chunk. Typically, the size refers to the number of tokens or characters. This parameter is crucial to tune and monitor in almost every chunking strategies.

Chunk overlap is the portion of the text that is shared between consecutive chunks. The idea is to better capture the context around the chunks, however, it requires more storage, and might contain redundant/irrelevant information.

One can play with parameters and test some Langchain implementations of first 3 levels here:

Other chunking strategies

Another strategy would be to employ natural language processing tools like NLTK or Spacy to more effectively segment sentences, considering linguistic structures.

Hybrid chunking refers to the combination of multiple strategies to optimize the results. Examples include:

  • fixed-size chunking for short and concise text like headlines and semantic chunking for paragraphs.
  • use fixed-size chunking for quick indexing, followed by semantic chunking during retrieval phase. (mentioned here)

Some other methods of chunking are used as a preparation step for advanced retrieval strategies. The main idea is to differentiate chunks to retrieve and chunks fed to the LLM, since the optimal chunk for retrieval might be different to the optimal chunk to use as context.

Typical examples include:

  • Sentence window chunking: split document into sentences, retain the surrounding sentences (called window) in the metadata of the chunk, retrieved based on the chunk embedding (which is very specific), then at the augmentation phase, replace the chunk with the hold window.
  • Hierarchical chunking: split texts into hierarchies of chunk sizes (for example: 3 layers โ€“ 128, 512, 2048 tokens), Retrieve using the smallest chunk size, then if multiple child chunks within a parent chunk are retrieved, use the parent chunk for response synthesis

We will see more information in the [TO LINK advanced retrieval small-to-big].

Unstructured.io offer also some methods for chunking after the phase of document layout recognition. Some methods are given in Chunking strategies - Unstructured:

  • "by_title"
  • "by_page"
  • "by_similarity"

There are also new chunking methods currently studied, one example would be:

Further reading

  • Chunking strategies and its effectiveness depends on the dataset and the use case. Thus, choosing the right chunking strategy with its corresponding parameters is an essential subject. For more information, please check How to select chunking strategy for RAG.

โ† Previous: S04_Chunking

Next: S05_Embedding โ†’

โš ๏ธ **GitHub.com Fallback** โš ๏ธ