components dataset_preprocessor - Azure/azureml-assets GitHub Wiki

Dataset Preprocessor

dataset_preprocessor

Overview

Dataset Preprocessor

Version: 0.0.11

View in Studio: https://ml.azure.com/registries/azureml/components/dataset_preprocessor/version/0.0.11

Inputs

Name	Description	Type	Default	Optional	Enum
dataset	Path to load the dataset.	uri_file		False
template_input	JSON serialized dictionary to perform preprocessing on the dataset. Must contain key-value pair where key is the name of the column enclosed in " " and associated dict value is presented using jinja template logic which will be used to extract respective value from the dataset. Example format: {"<user_column_name>": {{key in the json file for this column}}, ....}. The processed output will be dumped to a jsonl file in this format: {"<user_column_name>": "", ....}.	string		True
script_path	Path to the custom preprocessor python script provided by user. If both this input and template_input`are provided, then,`template_input` is ignored. This [base template] (https://github.com/Azure/azureml-assets/tree/main/assets/aml-benchmark/scripts/custom_dataset_preprocessors/base_preprocessor_template.py) should be used to create a custom preprocessor script.	uri_file		True
encoder_config	JSON serialized dictionary to perform mapping. Must contain key-value pair "column_name": "<actual_column_name>" whose value needs mapping, followed by key-value pairs containing idtolabel or labeltoid mappers. Example format: {"column_name":"label", "0":"NEUTRAL", "1":"ENTAILMENT", "2":"CONTRADICTION"}	string		True

Outputs

Name	Description	Type
output_dataset	Path to the output the processed .jsonl file.	uri_file

Environment

azureml://registries/azureml/environments/model-evaluation/labels/latest

⚠️ GitHub.com Fallback ⚠️