components text_generation_datapreprocess - Azure/azureml-assets GitHub Wiki

Text Generation DataPreProcess

text_generation_datapreprocess

Overview

Component to preprocess data for text generation task

Version: 0.0.79

View in Studio: https://ml.azure.com/registries/azureml/components/text_generation_datapreprocess/version/0.0.79

Inputs

Text Generation task arguments

Name	Description	Type	Default	Optional	Enum
text_key	key for text in an example. format your data keeping in mind that text is concatenated with ground_truth while finetuning in the form - text + groundtruth. for eg. "text"="knock knock\n", "ground_truth"="who's there"; will be treated as "knock knock\nwho's there"	string		False
ground_truth_key	key for ground_truth in an example. we take separate column for ground_truth to enable use cases like summarization, translation, question_answering, etc. which can be repurposed in form of text-generation where both text and ground_truth are needed. This separation is useful for calculating metrics. for eg. "text"="Summarize this dialog:\n{input_dialogue}\nSummary:\n", "ground_truth"="{summary of the dialogue}"	string		True
batch_size	Number of examples to batch before calling the tokenization function	integer	1000	True

Tokenization params

Name	Description	Type	Default	Optional	Enum
pad_to_max_length	If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their `max_seq_length`. If no `max_seq_length` is specified, the padding is done up to the model's max length.	string	false	True	['true', 'false']
max_seq_length	Default is -1 which means the padding is done up to the model's max length. Else will be padded to `max_seq_length`.	integer	-1	True

Inputs

Name	Description	Type	Default	Optional	Enum
train_file_path	Path to the registered training data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file		True
validation_file_path	Path to the registered validation data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file		True
test_file_path	Path to the registered test data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`.	uri_file		True
train_mltable_path	Path to the registered training data asset in `mltable` format.	mltable		True
validation_mltable_path	Path to the registered validation data asset in `mltable` format.	mltable		True
test_mltable_path	Path to the registered test data asset in `mltable` format.	mltable		True

Dataset parameters

Name	Description	Type	Default	Optional	Enum
model_selector_output	output folder of model selector containing model metadata like config, checkpoints, tokenizer config	uri_folder		False

Validation parameters

Name	Description	Type	Default	Optional	Enum
system_properties	Validation parameters propagated from pipeline.	string		True

Outputs

Name	Description	Type
output_dir	The folder contains the tokenized output of the train, validation and test data along with the tokenizer files used to tokenize the data	uri_folder

Environment

azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/105

⚠️ GitHub.com Fallback ⚠️