components llm_ingest_dataset_to_acs_basic - Azure/azureml-assets GitHub Wiki

LLM - Dataset to ACS Pipeline

llm_ingest_dataset_to_acs_basic

Overview

Single job pipeline to chunk data from AzureML data asset, and create ACS embeddings index

Version: 0.0.85

Tags

Preview

View in Studio: https://ml.azure.com/registries/azureml/components/llm_ingest_dataset_to_acs_basic/version/0.0.85

Inputs

llm_model config

Name Description Type Default Optional Enum
llm_config JSON describing the LLM provider and model details to use for prompt generation. string {"type": "azure_open_ai", "model_name": "gpt-35-turbo", "deployment_name": "gpt-35-turbo", "temperature": 0, "max_tokens": 2000}
llm_connection Azure OpenAI workspace connection ARM ID string True
acs_config JSON describing the acs index to create or update. string
acs_connection Azure Cognitive Search workspace connection ARM ID string True

register settings

Name Description Type Default Optional Enum
embeddings_dataset_name Name of the vector index string EmbeddingsOutput True

compute settings

Name Description Type Default Optional Enum
serverless_instance_count Instance count to use for the serverless compute integer 1 True
serverless_instance_type The Instance Type to be used for the serverless compute string Standard_E8s_v3 True

data to import

Name Description Type Default Optional Enum
input_data Input AzureML data asset UriFolder to bring in data from. uri_folder

Data Chunker

Name Description Type Default Optional Enum
chunk_size Chunk size (by token) to pass into the text splitter before performing embeddings integer 1024
chunk_overlap Overlap of content (by token) between the chunks integer 0
input_glob Glob pattern to filter files from the input folder. e.g. 'articles/**/*'' string True
max_sample_files Number of files read in during QA test data generation integer -1 True
data_source_url The url which can be appended to file names to form citation links for documents string
document_path_replacement_regex A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)(\.[^.]+)$", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url. string True

Embeddings components

Name Description Type Default Optional Enum
embeddings_container Folder to contain generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks. uri_folder True
embeddings_model The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}' string azure_open_ai://deployment/text-embedding-ada-002/model/text-embedding-ada-002
embedding_connection Azure OpenAI workspace connection ARM ID for embeddings string True

Outputs

Name Description Type
acs_index Folder containing the ACS MLIndex. Deserialized using azureml.rag.mlindex.MLIndex(uri). uri_folder

defaults: compute: azureml:cpu-cluster

Name Description Type
⚠️ **GitHub.com Fallback** ⚠️