components text_generation_datapreprocess - Azure/azureml-assets GitHub Wiki
Component to preprocess data for text generation task
Version: 0.0.79
View in Studio: https://ml.azure.com/registries/azureml/components/text_generation_datapreprocess/version/0.0.79
Text Generation task arguments
| Name | Description | Type | Default | Optional | Enum |
|---|---|---|---|---|---|
| text_key | key for text in an example. format your data keeping in mind that text is concatenated with ground_truth while finetuning in the form - text + groundtruth. for eg. "text"="knock knock\n", "ground_truth"="who's there"; will be treated as "knock knock\nwho's there" | string | False | ||
| ground_truth_key | key for ground_truth in an example. we take separate column for ground_truth to enable use cases like summarization, translation, question_answering, etc. which can be repurposed in form of text-generation where both text and ground_truth are needed. This separation is useful for calculating metrics. for eg. "text"="Summarize this dialog:\n{input_dialogue}\nSummary:\n", "ground_truth"="{summary of the dialogue}" | string | True | ||
| batch_size | Number of examples to batch before calling the tokenization function | integer | 1000 | True |
Tokenization params
| Name | Description | Type | Default | Optional | Enum |
|---|---|---|---|---|---|
| pad_to_max_length | If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their max_seq_length. If no max_seq_length is specified, the padding is done up to the model's max length. |
string | false | True | ['true', 'false'] |
| max_seq_length | Default is -1 which means the padding is done up to the model's max length. Else will be padded to max_seq_length. |
integer | -1 | True |
Inputs
| Name | Description | Type | Default | Optional | Enum |
|---|---|---|---|---|---|
| train_file_path | Path to the registered training data asset. The supported data formats are jsonl, json, csv, tsv and parquet. |
uri_file | True | ||
| validation_file_path | Path to the registered validation data asset. The supported data formats are jsonl, json, csv, tsv and parquet. |
uri_file | True | ||
| test_file_path | Path to the registered test data asset. The supported data formats are jsonl, json, csv, tsv and parquet. |
uri_file | True | ||
| train_mltable_path | Path to the registered training data asset in mltable format. |
mltable | True | ||
| validation_mltable_path | Path to the registered validation data asset in mltable format. |
mltable | True | ||
| test_mltable_path | Path to the registered test data asset in mltable format. |
mltable | True |
Dataset parameters
| Name | Description | Type | Default | Optional | Enum |
|---|---|---|---|---|---|
| model_selector_output | output folder of model selector containing model metadata like config, checkpoints, tokenizer config | uri_folder | False |
Validation parameters
| Name | Description | Type | Default | Optional | Enum |
|---|---|---|---|---|---|
| system_properties | Validation parameters propagated from pipeline. | string | True |
| Name | Description | Type |
|---|---|---|
| output_dir | The folder contains the tokenized output of the train, validation and test data along with the tokenizer files used to tokenize the data | uri_folder |
azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/105