LLM - Crack and Chunk Data

llm_rag_crack_and_chunk

Overview

Creates chunks no larger than chunk_size from input_data, extracted document titles are prepended to each chunk

LLM models have token limits for the prompts passed to them, this is a limiting factor at embedding time and even more limiting at prompt completion time as only so much context can be passed along with instructions to the LLM and user queries. Chunking allows splitting source data of various formats into small but coherent snippets of information which can be 'packed' into LLM prompts when asking for answers to user query related to the source documents.

Supported formats: md, txt, html/htm, pdf, ppt(x), doc(x), xls(x), py

Version: 0.0.85

Inputs

Input AzureML Data

Name	Description	Type	Default	Optional	Enum
input_data	Uri Folder containing files to be chunked.	uri_folder

Files to handle from source

Name	Description	Type	Default	Optional	Enum
input_glob	Limit files opened from `input_data`, defaults to '*/'.	string		True
allowed_extensions	Comma separated list of extensions to include, if not provided the default list of supported extensions will be used. e.g. '.md,.txt,.html,.py,.pdf.'	string		True

Chunking options

Name	Description	Type	Default	Optional	Enum
chunk_size	Maximum number of tokens to put in each chunk.	integer	768
chunk_overlap	Number of tokens to overlap between chunks.	integer	0
doc_intel_connection_id	Connection id for Document Intelligence service. If provided, will be used to extract content from .pdf document.	string		True
data_source_url	Base URL to join with file paths to create full source file URL for chunk metadata.	string		True
document_path_replacement_regex	A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)(\.[^.]+)$", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url.	string		True
max_sample_files	Number of files to chunk. Specify -1 to chunk all documents in input path.	integer	-1
use_rcts	Whether to use RecursiveCharacterTextSplitter to split documents into chunks	string	True		['True', 'False']
output_format	Format of the output chunk file	string	jsonl		['csv', 'jsonl']

Outputs

Name	Description	Type
output_chunks	Uri Folder containing chunks. Each chunk will be a separate file in the folder	uri_folder

Environment

azureml:llm-rag-embeddings:76

components llm_rag_crack_and_chunk - Azure/azureml-assets GitHub Wiki

LLM - Crack and Chunk Data

llm_rag_crack_and_chunk

Overview

Tags

Inputs

Outputs

Environment

⚠️ GitHub.com Fallback ⚠️

components llm_rag_crack_and_chunk - Azure/azureml-assets GitHub Wiki

LLM - Crack and Chunk Data

llm_rag_crack_and_chunk

Overview

Tags

Inputs

Outputs

Environment

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️