components llm_rag_crack_and_chunk - Azure/azureml-assets GitHub Wiki
Creates chunks no larger than chunk_size
from input_data
, extracted document titles are prepended to each chunk
LLM models have token limits for the prompts passed to them, this is a limiting factor at embedding time and even more limiting at prompt completion time as only so much context can be passed along with instructions to the LLM and user queries. Chunking allows splitting source data of various formats into small but coherent snippets of information which can be 'packed' into LLM prompts when asking for answers to user query related to the source documents.
Supported formats: md, txt, html/htm, pdf, ppt(x), doc(x), xls(x), py
Version: 0.0.77
Preview
View in Studio: https://ml.azure.com/registries/azureml/components/llm_rag_crack_and_chunk/version/0.0.77
Input AzureML Data
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
input_data | Uri Folder containing files to be chunked. | uri_folder |
Files to handle from source
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
input_glob | Limit files opened from input_data , defaults to '**/*'. |
string | True | ||
allowed_extensions | Comma separated list of extensions to include, if not provided the default list of supported extensions will be used. e.g. '.md,.txt,.html,.py,.pdf.' | string | True |
Chunking options
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
chunk_size | Maximum number of tokens to put in each chunk. | integer | 768 | ||
chunk_overlap | Number of tokens to overlap between chunks. | integer | 0 | ||
doc_intel_connection_id | Connection id for Document Intelligence service. If provided, will be used to extract content from .pdf document. | string | True | ||
data_source_url | Base URL to join with file paths to create full source file URL for chunk metadata. | string | True | ||
document_path_replacement_regex | A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)(\.[^.]+)$", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url. | string | True | ||
max_sample_files | Number of files to chunk. Specify -1 to chunk all documents in input path. | integer | -1 | ||
use_rcts | Whether to use RecursiveCharacterTextSplitter to split documents into chunks | string | True | ['True', 'False'] | |
output_format | Format of the output chunk file | string | jsonl | ['csv', 'jsonl'] |
Name | Description | Type |
---|---|---|
output_chunks | Uri Folder containing chunks. Each chunk will be a separate file in the folder | uri_folder |
azureml:llm-rag-embeddings@latest