components dataset_preprocessor - Azure/azureml-assets GitHub Wiki

Dataset Preprocessor

dataset_preprocessor

Overview

Dataset Preprocessor

Version: 0.0.9

View in Studio: https://ml.azure.com/registries/azureml/components/dataset_preprocessor/version/0.0.9

Inputs

Name Description Type Default Optional Enum
dataset Path to load the dataset. uri_file False
template_input JSON serialized dictionary to perform preprocessing on the dataset. Must contain key-value pair where key is the name of the column enclosed in " " and associated dict value is presented using jinja template logic which will be used to extract respective value from the dataset. Example format: {"<user_column_name>": {{key in the json file for this column}}, ....}. The processed output will be dumped to a jsonl file in this format: {"<user_column_name>": "", ....}. string True
script_path Path to the custom preprocessor python script provided by user. If both this input and template_inputare provided, then,template_input` is ignored. This [base template] (https://github.com/Azure/azureml-assets/tree/main/assets/aml-benchmark/scripts/custom_dataset_preprocessors/base_preprocessor_template.py) should be used to create a custom preprocessor script. uri_file True
encoder_config JSON serialized dictionary to perform mapping. Must contain key-value pair "column_name": "<actual_column_name>" whose value needs mapping, followed by key-value pairs containing idtolabel or labeltoid mappers. Example format: {"column_name":"label", "0":"NEUTRAL", "1":"ENTAILMENT", "2":"CONTRADICTION"} string True

Outputs

Name Description Type
output_dataset Path to the output the processed .jsonl file. uri_file

Environment

azureml://registries/azureml/environments/model-evaluation/labels/latest

⚠️ **GitHub.com Fallback** ⚠️