components oss_distillation_generate_data_batch_preprocess - Azure/azureml-assets GitHub Wiki

OSS Distillation Generate Data Batch Scoring Preprocess

oss_distillation_generate_data_batch_preprocess

Overview

Component to prepare data to invoke teacher model enpoint in batch

Version: 0.0.1

View in Studio: https://ml.azure.com/registries/azureml/components/oss_distillation_generate_data_batch_preprocess/version/0.0.1

Inputs

Inputs

Name Description Type Default Optional Enum
train_file_path Path to the registered training data asset. The supported data formats are jsonl, json, csv, tsv and parquet. uri_file
validation_file_path Path to the registered validation data asset. The supported data formats are jsonl, json, csv, tsv and parquet. uri_file True
teacher_model_endpoint_name Teacher model endpoint name string True
teacher_model_endpoint_url Teacher model endpoint url string True
teacher_model_endpoint_key Teacher model endpoint key string True
teacher_model_max_new_tokens Teacher model max_new_tokens inference parameter integer 128
teacher_model_temperature Teacher model temperature inference parameter number 0.2
teacher_model_top_p Teacher model top_p inference parameter number 0.1
teacher_model_frequency_penalty Teacher model frequency penalty inference parameter number 0.0
teacher_model_presence_penalty Teacher model presence penalty inference parameter number 0.0
teacher_model_stop Teacher model stop inference parameter string True
enable_chain_of_thought Enable Chain of thought for data generation string false True
enable_chain_of_density Enable Chain of density for text summarization string false True
max_len_summary Maximum Length Summary for text summarization integer 80 True
data_generation_task_type Data generation task type. Supported values are: 1. NLI: Generate Natural Language Inference data 2. CONVERSATION: Generate conversational data (multi/single turn) 3. NLU_QA: Generate Natural Language Understanding data for Question Answering data 4. MATH: Generate Math data for numerical responses 5. SUMMARIZATION: Generate Key Summary for an Article string ['NLI', 'CONVERSATION', 'NLU_QA', 'MATH', 'SUMMARIZATION']

Output of validation component.

Name Description Type Default Optional Enum
validation_info Validation status. uri_file True

Outputs

Name Description Type
generated_train_payload_path directory containing the payload to be sent to the model. mltable
generated_validation_payload_path directory containing the payload to be sent to the model. mltable
hash_train_data jsonl file containing the hash for each payload. uri_file
hash_validation_data jsonl file containing the hash for each payload. uri_file
batch_config_connection Config file path that contains deployment configurations uri_file

Environment

azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/76

⚠️ **GitHub.com Fallback** ⚠️