components oss_distillation_generate_data - Azure/azureml-assets GitHub Wiki
Component to generate data from teacher model enpoint
Version: 0.0.8
View in Studio: https://ml.azure.com/registries/azureml/components/oss_distillation_generate_data/version/0.0.8
Inputs
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
train_file_path | Path to the registered training data asset. The supported data formats are jsonl , json , csv , tsv and parquet . |
uri_file | |||
validation_file_path | Path to the registered validation data asset. The supported data formats are jsonl , json , csv , tsv and parquet . |
uri_file | True | ||
teacher_model_endpoint_name | Teacher model endpoint name | string | True | ||
teacher_model_endpoint_url | Teacher model endpoint URL | string | True | ||
teacher_model_endpoint_key | Teacher model endpoint key | string | True | ||
teacher_model_max_new_tokens | Teacher model max_new_tokens inference parameter | integer | 128 | ||
teacher_model_temperature | Teacher model temperature inference parameter | number | 0.2 | ||
teacher_model_top_p | Teacher model top_p inference parameter | number | 0.1 | ||
teacher_model_frequency_penalty | Teacher model frequency penalty inference parameter | number | 0.0 | ||
teacher_model_presence_penalty | Teacher model presence penalty inference parameter | number | 0.0 | ||
teacher_model_stop | Teacher model stop inference parameter | string | True | ||
request_batch_size | No of data records to hit teacher model endpoint in one go | integer | 10 | ||
min_endpoint_success_ratio | The minimum value of (successful_requests / total_requests) required for classifying inference as successful. If (successful_requests / total_requests) < min_endpoint_success_ratio, the experiment will be marked as failed. By default it is 0.7 (0 means all requests are allowed to fail while 1 means no request should fail.) | number | 0.7 | ||
enable_chain_of_thought | Enable Chain of thought for data generation | string | false | True | |
enable_chain_of_density | Enable Chain of density for text summarization | string | false | True | |
max_len_summary | Maximum Length Summary for text summarization | integer | 80 | True | |
data_generation_task_type | Data generation task type. Supported values are: 1. NLI: Generate Natural Language Inference data 2. CONVERSATION: Generate conversational data (multi/single turn) 3. NLU_QA: Generate Natural Language Understanding data for Question Answering data 4. MATH: Generate Math data for numerical responses 5. SUMMARIZATION: Generate Key Summary for an Article | string | ['NLI', 'CONVERSATION', 'NLU_QA', 'MATH', 'SUMMARIZATION'] |
Output of validation component.
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
validation_output | Validation status. | uri_file |
Name | Description | Type |
---|---|---|
generated_train_file_path | Generated train data | uri_file |
generated_validation_file_path | Generated validation data | uri_file |
azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/76