components llm_ingest_dataset_to_faiss_basic - Azure/azureml-assets GitHub Wiki

LLM - Dataset to FAISS Pipeline

llm_ingest_dataset_to_faiss_basic

Overview

Single job pipeline to chunk data from AzureML data asset, and create FAISS embeddings index

Version: 0.0.91

Tags

Preview

View in Studio: https://ml.azure.com/registries/azureml/components/llm_ingest_dataset_to_faiss_basic/version/0.0.91

Inputs

llm_model config

Name	Description	Type	Default	Optional	Enum
llm_config	JSON describing the LLM provider and model details to use for prompt generation.	string	{"type": "azure_open_ai", "model_name": "gpt-35-turbo", "deployment_name": "gpt-35-turbo", "temperature": 0, "max_tokens": 2000}
llm_connection	Azure OpenAI workspace connection ARM ID	string		True

register settings

Name	Description	Type	Default	Optional	Enum
embeddings_dataset_name	Name of the vector index	string	VectorIndexDS	True

compute settings

Name	Description	Type	Default	Optional	Enum
serverless_instance_count	Instance count to use for the serverless compute	integer	1	True
serverless_instance_type	The Instance Type to be used for the serverless compute	string	Standard_E8s_v3	True

data to import

Name	Description	Type	Default	Optional	Enum
input_data		uri_folder

Data Chunker

Name	Description	Type	Default	Optional	Enum
chunk_size	Chunk size (by token) to pass into the text splitter before performing embeddings	integer	1024
chunk_overlap	Overlap of content (by token) between the chunks	integer	0
input_glob	Glob pattern to filter files from the input folder. e.g. 'articles/*/''	string		True
max_sample_files	Number of files read in during QA test data generation	integer	-1	True
data_source_url	The url which can be appended to file names to form citation links for documents	string
document_path_replacement_regex	A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)(\.[^.]+)$", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url.	string		True

Embeddings components

Name	Description	Type	Default	Optional	Enum
embeddings_container	Folder to contain generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks.	uri_folder		True
embeddings_model	The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}'	string	azure_open_ai://deployment/text-embedding-ada-002/model/text-embedding-ada-002
embedding_connection	Azure OpenAI workspace connection ARM ID for embeddings	string		True

Outputs

Name	Description	Type
faiss_index	Folder containing the FAISS MLIndex. Deserialized using azureml.rag.mlindex.MLIndex(uri).	uri_folder

defaults: compute: azureml:cpu-cluster

Name	Description	Type

⚠️ GitHub.com Fallback ⚠️