components llm_rag_generate_embeddings_parallel - Azure/azureml-assets GitHub Wiki

LLM - Generate Embeddings Parallel

llm_rag_generate_embeddings_parallel

Overview

Generates embeddings vectors for data chunks read from chunks_source.

chunks_source is expected to contain csv files containing two columns:

  • "Chunk" - Chunk of text to be embedded
  • "Metadata" - JSON object containing metadata for the chunk

If previous_embeddings is supplied, input chunks are compared to existing chunks in the Embeddings Container and only changed/new chunks are embedded, existing chunks being reused.

Version: 0.0.76

Tags

Preview

View in Studio: https://ml.azure.com/registries/azureml/components/llm_rag_generate_embeddings_parallel/version/0.0.76

Inputs

Name Description Type Default Optional Enum
chunks_source Folder containing chunks to be embedded. uri_folder

If adding to previously generated Embeddings

Name Description Type Default Optional Enum
embeddings_container Folder containing previously generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks. uri_folder True

Embeddings settings

Name Description Type Default Optional Enum
embeddings_model The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}' string hugging_face://model/sentence-transformers/all-mpnet-base-v2
deployment_validation Uri file containing information on if the Azure OpenAI deployments, if used, have been validated uri_file True

Outputs

Name Description Type
embeddings Where to save data with embeddings. This should be a subfolder of previous embeddings if supplied, typically named using '${name}'. e.g. /my/prev/embeddings/${name} uri_folder
processed_file_names Text file containing the names of the files that were processed uri_file
⚠️ **GitHub.com Fallback** ⚠️