components llm_ingest_dataset_to_faiss_user_id - Azure/azureml-assets GitHub Wiki
Single job pipeline to chunk data from AzureML data asset, and create FAISS embeddings index
Version: 0.0.81
Preview
View in Studio: https://ml.azure.com/registries/azureml/components/llm_ingest_dataset_to_faiss_user_id/version/0.0.81
llm_model config
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
llm_config | JSON describing the LLM provider and model details to use for prompt generation. | string | {"type": "azure_open_ai", "model_name": "gpt-35-turbo", "deployment_name": "gpt-35-turbo", "temperature": 0, "max_tokens": 2000} | ||
llm_connection | Azure OpenAI workspace connection ARM ID | string | True |
register settings
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
embeddings_dataset_name | Name of the vector index | string | VectorIndexDS | True |
compute settings
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
serverless_instance_count | Instance count to use for the serverless compute | integer | 1 | True | |
serverless_instance_type | The Instance Type to be used for the serverless compute | string | Standard_E8s_v3 | True |
data to import
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
input_data | uri_folder |
Data Chunker
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
chunk_size | Chunk size (by token) to pass into the text splitter before performing embeddings | integer | 1024 | ||
chunk_overlap | Overlap of content (by token) between the chunks | integer | 0 | ||
input_glob | Glob pattern to filter files from the input folder. e.g. 'articles/**/*'' | string | True | ||
max_sample_files | Number of files read in during QA test data generation | integer | -1 | True | |
data_source_url | The url which can be appended to file names to form citation links for documents | string | |||
document_path_replacement_regex | A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)(\.[^.]+)$", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url. | string | True |
Embeddings components
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
embeddings_container | Folder to contain generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks. | uri_folder | True | ||
embeddings_model | The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}' | string | azure_open_ai://deployment/text-embedding-ada-002/model/text-embedding-ada-002 | ||
embedding_connection | Azure OpenAI workspace connection ARM ID for embeddings | string | True |
Name | Description | Type |
---|---|---|
faiss_index | Folder containing the FAISS MLIndex. Deserialized using azureml.rag.mlindex.MLIndex(uri). | uri_folder |
defaults: compute: azureml:cpu-cluster
Name | Description | Type |
---|