components chat_completion_datapreprocess - Azure/azureml-assets GitHub Wiki
Component to preprocess data for chat completion task. See docs to learn more.
Version: 0.0.65
View in Studio: https://ml.azure.com/registries/azureml/components/chat_completion_datapreprocess/version/0.0.65
Task arguments
Sample example
[ { "messages": [ { "content": "how can identity protection services help protect me against identity theft", "role": "user" }, { "content": "Identity protection services can help protect you against identity theft in several ways:\n\n1. Monitoring: Many identity protection services monitor your credit reports, public records, and other sources for signs of identity theft. If they detect any suspicious activity, they will alert you so you can take action.\n2. Credit freeze: Some identity protection services can help you freeze your credit, which makes it more difficult for thieves to open new accounts in your name.\n3. Identity theft insurance: Some identity protection services offer insurance that can help you recover financially if you become a victim of identity theft.\n4. Assistance: Many identity protection services offer assistance if you become a victim of identity theft. They can help you file a police report, contact credit bureaus, and other steps to help you restore your identity.\n\nOverall, identity protection services can provide you with peace of mind and help you take proactive steps to protect your identity. However, it's important to note that no service can completely guarantee that you will never become a victim of identity theft. It's still important to take steps to protect your own identity, such as being cautious with personal information and regularly monitoring your credit reports.", "role": "assistant" } ] } ]
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
batch_size | Number of examples to batch before calling the tokenization function | integer | 1000 | True |
Tokenization params
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
pad_to_max_length | If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their max_seq_length . If no max_seq_length is specified, the padding is done up to the model's max length. |
string | false | True | ['true', 'false'] |
max_seq_length | Controls the maximum length to use when pad_to_max_length parameter is set to true . Default is -1 which means the padding is done up to the model's max length. Else will be padded to max_seq_length . |
integer | -1 | True |
Data inputs Please note that either train_file_path
or train_mltable_path
needs to be passed. In case both are passed, mltable path
will take precedence. The validation and test paths are optional and an automatic split from train data happens if they are not passed. If both validation and test files are missing, 10% of train data will be assigned to each of them and the remaining 80% will be used for training If anyone of the file is missing, 20% of the train data will be assigned to it and the remaining 80% will be used for training
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
train_file_path | Path to the registered training data asset. The supported data formats are jsonl , json , csv , tsv and parquet . |
uri_file | True | ||
validation_file_path | Path to the registered validation data asset. The supported data formats are jsonl , json , csv , tsv and parquet . |
uri_file | True | ||
test_file_path | Path to the registered test data asset. The supported data formats are jsonl , json , csv , tsv and parquet . |
uri_file | True | ||
train_mltable_path | Path to the registered training data asset in mltable format. |
mltable | True | ||
validation_mltable_path | Path to the registered validation data asset in mltable format. |
mltable | True | ||
test_mltable_path | Path to the registered test data asset in mltable format. |
mltable | True |
Model input
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
model_selector_output | output folder of model selector containing model metadata like config, checkpoints, tokenizer config | uri_folder | False |
Name | Description | Type |
---|---|---|
output_dir | The folder contains the tokenized output of the train, validation and test data along with the tokenizer files used to tokenize the data | uri_folder |
azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/81