components oss_distillation_generate_data_batch_preprocess - Azure/azureml-assets GitHub Wiki

OSS Distillation Generate Data Batch Scoring Preprocess

oss_distillation_generate_data_batch_preprocess

Overview

Component to prepare data to invoke teacher model enpoint in batch

Version: 0.0.1

View in Studio: https://ml.azure.com/registries/azureml/components/oss_distillation_generate_data_batch_preprocess/version/0.0.1

Inputs

Inputs

Name	Description	Type	Default	Optional	Enum
train_file_path	Path to the registered training data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file
validation_file_path	Path to the registered validation data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file		True
teacher_model_endpoint_name	Teacher model endpoint name	string		True
teacher_model_endpoint_url	Teacher model endpoint url	string		True
teacher_model_endpoint_key	Teacher model endpoint key	string		True
teacher_model_max_new_tokens	Teacher model max_new_tokens inference parameter	integer	128
teacher_model_temperature	Teacher model temperature inference parameter	number	0.2
teacher_model_top_p	Teacher model top_p inference parameter	number	0.1
teacher_model_frequency_penalty	Teacher model frequency penalty inference parameter	number	0.0
teacher_model_presence_penalty	Teacher model presence penalty inference parameter	number	0.0
teacher_model_stop	Teacher model stop inference parameter	string		True
enable_chain_of_thought	Enable Chain of thought for data generation	string	false	True
enable_chain_of_density	Enable Chain of density for text summarization	string	false	True
max_len_summary	Maximum Length Summary for text summarization	integer	80	True
data_generation_task_type	Data generation task type. Supported values are: 1. NLI: Generate Natural Language Inference data 2. CONVERSATION: Generate conversational data (multi/single turn) 3. NLU_QA: Generate Natural Language Understanding data for Question Answering data 4. MATH: Generate Math data for numerical responses 5. SUMMARIZATION: Generate Key Summary for an Article	string			['NLI', 'CONVERSATION', 'NLU_QA', 'MATH', 'SUMMARIZATION']

Output of validation component.

Name	Description	Type	Default	Optional	Enum
validation_info	Validation status.	uri_file		True

Outputs

Name	Description	Type
generated_train_payload_path	directory containing the payload to be sent to the model.	mltable
generated_validation_payload_path	directory containing the payload to be sent to the model.	mltable
hash_train_data	jsonl file containing the hash for each payload.	uri_file
hash_validation_data	jsonl file containing the hash for each payload.	uri_file
batch_config_connection	Config file path that contains deployment configurations	uri_file

Environment

azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/76

⚠️ GitHub.com Fallback ⚠️