Token Classification Pipeline

token_classification_pipeline

Overview

Pipeline component to finetune Hugging Face pretrained models for token classification task. The component supports optimizations such as LoRA, Deepspeed and ONNXRuntime for performance enhancement. See docs to learn more.

Version: 0.0.80

View in Studio: https://ml.azure.com/registries/azureml/components/token_classification_pipeline/version/0.0.80

Inputs

Name	Description	Type	Default	Optional
instance_type_model_import	Instance type to be used for model_import component in case of serverless compute, eg. standard_d12_v2. The parameter compute_model_import must be set to 'serverless' for instance_type to be used	string	Standard_d12_v2	True
instance_type_preprocess	Instance type to be used for preprocess component in case of serverless compute, eg. standard_d12_v2. The parameter compute_preprocess must be set to 'serverless' for instance_type to be used	string	Standard_d12_v2	True
instance_type_finetune	Instance type to be used for finetune component in case of serverless compute, eg. standard_nc24rs_v3. The parameter compute_finetune must be set to 'serverless' for instance_type to be used	string	Standard_nc24rs_v3	True
instance_type_model_evaluation	Instance type to be used for model_evaluation components in case of serverless compute, eg. standard_nc24rs_v3. The parameter compute_model_evaluation must be set to 'serverless' for instance_type to be used	string	Standard_nc24rs_v3	True
shm_size_finetune	Shared memory size to be used for finetune component. It is useful while using Nebula (via DeepSpeed) which uses shared memory to save model and optimizer states.	string	5g	True
num_nodes_finetune	number of nodes to be used for finetuning (used for distributed training)	integer	1	True
number_of_gpu_to_use_finetuning	number of gpus to be used per node for finetuning, should be equal to number of gpu per node in the compute SKU used for finetune	integer	1	True

Model Import parameters (See docs to learn more)

Name	Description	Type	Optional
huggingface_id	The string can be any valid Hugging Face id from the Hugging Face models webpage. Models from Hugging Face are subject to third party license terms available on the Hugging Face model details page. It is your responsibility to comply with the model's license terms. Special characters like \ and ' are invalid in the parameter value.	string	True
pytorch_model_path	Pytorch model asset path. Special characters like \ and ' are invalid in the parameter value.	custom_model	True
mlflow_model_path	MLflow model asset path. Special characters like \ and ' are invalid in the parameter value.	mlflow_model	True

Data PreProcess parameters (See docs to learn more)

Name	Description	Type	Default	Optional	Enum
task_name	NamedEntityRecognition task type	string	NamedEntityRecognition	False	['NamedEntityRecognition']
token_key	Key for tokens in each example line. Special characters like \ and ' are invalid in the parameter value.	string		False
tag_key	Key for tags in each example line. Special characters like \ and ' are invalid in the parameter value.	string		False
batch_size	Number of examples to batch before calling the tokenization function	integer	1000	True

pad_to_max_length: type: string enum: - "true" - "false" default: "false" optional: true description: If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their max_seq_length. If no max_seq_length is specified, the padding is done up to the model's max length.

Name	Description	Type	Default	Optional
max_seq_length	Controls the maximum length to use when pad_to_max_length parameter is set to `true`. Default is -1 which means the padding is done up to the model's max length. Else will be padded to `max_seq_length`.	integer	-1	True
train_file_path	Path to the registered training data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`. Special characters like \ and ' are invalid in the parameter value.	uri_file		True
validation_file_path	Path to the registered validation data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`. Special characters like \ and ' are invalid in the parameter value.	uri_file		True
test_file_path	Path to the registered test data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`. Special characters like \ and ' are invalid in the parameter value.	uri_file		True
train_mltable_path	Path to the registered training data asset in `mltable` format. Special characters like \ and ' are invalid in the parameter value.	mltable		True
validation_mltable_path	Path to the registered validation data asset in `mltable` format. Special characters like \ and ' are invalid in the parameter value.	mltable		True
test_mltable_path	Path to the registered test data asset in `mltable` format. Special characters like \ and ' are invalid in the parameter value.	mltable		True

Finetune parameters (See docs to learn more)

Name	Description	Type	Default	Optional	Enum
apply_lora	If "true" enables lora.	string	false	True	['true', 'false']
merge_lora_weights	If "true", the lora weights are merged with the base Hugging Face model weights before saving.	string	true	True	['true', 'false']
lora_alpha	alpha attention parameter for lora.	integer	128	True
lora_r	lora dimension	integer	8	True
lora_dropout	lora dropout value	number	0.0	True
num_train_epochs	Number of epochs to run for finetune.	integer	1	True
max_steps	If set to a positive number, the total number of training steps to perform. Overrides 'epochs'. In case of using a finite iterable dataset the training may stop before reaching the set number of steps when all data is exhausted.	integer	-1	True
per_device_train_batch_size	Per gpu batch size used for training. The effective training batch size is per_device_train_batch_size * num_gpus * num_nodes.	integer	1	True
per_device_eval_batch_size	Per gpu batch size used for validation. The default value is 1. The effective validation batch size is per_device_eval_batch_size * num_gpus * num_nodes.	integer	1	True
auto_find_batch_size	If set to "true" and if the provided 'per_device_train_batch_size' goes into Out Of Memory (OOM) auto_find_batch_size will find the correct batch size by iteratively reducing batch size by a factor of 2 till the OOM is fixed	string	false	True	['true', 'false']
optim	Optimizer to be used while training	string	adamw_torch	True	['adamw_torch', 'adafactor']
learning_rate	Start learning rate used for training.	number	2e-05	True
warmup_steps	Number of steps for the learning rate scheduler warmup phase.	integer	0	True
weight_decay	Weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer	number	0.0	True
adam_beta1	beta1 hyperparameter for the AdamW optimizer	number	0.9	True
adam_beta2	beta2 hyperparameter for the AdamW optimizer	number	0.999	True
adam_epsilon	epsilon hyperparameter for the AdamW optimizer	number	1e-08	True
gradient_accumulation_steps	Number of updates steps to accumulate the gradients for, before performing a backward/update pass	integer	1	True
eval_accumulation_steps	Number of predictions steps to accumulate before moving the tensors to the CPU, will be passed as None if set to -1	integer	-1	True
lr_scheduler_type	learning rate scheduler to use.	string	linear	True	['linear', 'cosine', 'cosine_with_restarts', 'polynomial', 'constant', 'constant_with_warmup']
precision	Apply mixed precision training. This can reduce memory footprint by performing operations in half-precision.	string	32	True	['32', '16']
seed	Random seed that will be set at the beginning of training	integer	42	True
enable_full_determinism	Ensure reproducible behavior during distributed training. Check this link https://pytorch.org/docs/stable/notes/randomness.html for more details.	string	false	True	['true', 'false']
dataloader_num_workers	Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process.	integer	0	True
ignore_mismatched_sizes	Not setting this flag will raise an error if some of the weights from the checkpoint do not have the same size as the weights of the model.	string	true	True	['true', 'false']
max_grad_norm	Maximum gradient norm (for gradient clipping)	number	1.0	True
evaluation_strategy	The evaluation strategy to adopt during training. If set to "steps", either the `evaluation_steps_interval` or `eval_steps` needs to be specified, which helps to determine the step at which the model evaluation needs to be computed else evaluation happens at end of each epoch.	string	epoch	True	['epoch', 'steps']
evaluation_steps_interval	The evaluation steps in fraction of an epoch steps to adopt during training. Overwrites eval_steps if not 0.	number	0.0	True
eval_steps	Number of update steps between two evals if evaluation_strategy='steps'	integer	500	True
logging_strategy	The logging strategy to adopt during training. If set to "steps", the `logging_steps` will decide the frequency of logging else logging happens at the end of epoch..	string	steps	True	['epoch', 'steps']
logging_steps	Number of update steps between two logs if logging_strategy='steps'	integer	10	True
metric_for_best_model	metric to use to compare two different model checkpoints	string	loss	True	['loss', 'f1', 'accuracy', 'precision', 'recall']
resume_from_checkpoint	If set to "true", resumes the training from last saved checkpoint. Along with loading the saved weights, saved optimizer, scheduler and random states will be loaded if exist. The default value is "false"	string	false	True	['true', 'false']
save_total_limit	If a positive value is passed, it will limit the total number of checkpoints saved. The value of -1 saves all the checkpoints, otherwise if the number of checkpoints exceed the save_total_limit, the older checkpoints gets deleted.	integer	-1	True
apply_early_stopping	If set to "true", early stopping is enabled.	string	false	True	['true', 'false']
early_stopping_patience	Stop training when the metric specified through metric_for_best_model worsens for early_stopping_patience evaluation calls.This value is only valid if apply_early_stopping is set to true.	integer	1	True
early_stopping_threshold	Denotes how much the specified metric must improve to satisfy early stopping conditions. This value is only valid if apply_early_stopping is set to true.	number	0.0	True
apply_deepspeed	If set to true, will enable deepspeed for training	string	false	True	['true', 'false']
deepspeed	Deepspeed config to be used for finetuning. Special characters like \ and ' are invalid in the parameter value.	uri_file		True
deepspeed_stage	This parameter configures which DEFAULT deepspeed config to be used - stage2 or stage3. The default choice is stage2. Note that, this parameter is ONLY applicable when user doesn't pass any config information via deepspeed port.	string	2	True	['2', '3']
apply_ort	If set to true, will use the ONNXRunTime training	string	false	True	['true', 'false']

Model Evaluation parameters

Name	Description	Type	Default	Optional	Enum
evaluation_config	Additional parameters for Computing Metrics. Special characters like \ and ' are invalid in the parameter value.	uri_file		True
evaluation_config_params	Additional parameters as JSON serielized string	string		True

Compute parameters

Name	Description	Type	Default	Optional
compute_model_import	compute to be used for model_import eg. provide 'FT-Cluster' if your compute is named 'FT-Cluster'. Special characters like \ and ' are invalid in the parameter value. If compute cluster name is provided, instance_type field will be ignored and the respective cluster will be used	string	serverless	True
compute_preprocess	compute to be used for preprocess eg. provide 'FT-Cluster' if your compute is named 'FT-Cluster'. Special characters like \ and ' are invalid in the parameter value. If compute cluster name is provided, instance_type field will be ignored and the respective cluster will be used	string	serverless	True
compute_finetune	compute to be used for finetune eg. provide 'FT-Cluster' if your compute is named 'FT-Cluster'. Special characters like \ and ' are invalid in the parameter value. If compute cluster name is provided, instance_type field will be ignored and the respective cluster will be used	string	serverless	True
compute_model_evaluation	compute to be used for model_eavaluation eg. provide 'FT-Cluster' if your compute is named 'FT-Cluster'. Special characters like \ and ' are invalid in the parameter value. If compute cluster name is provided, instance_type field will be ignored and the respective cluster will be used	string	serverless	True

Outputs

Name	Description	Type
pytorch_model_folder	output folder containing best model as defined by metric_for_best_model. Along with the best model, output folder contains checkpoints saved after every evaluation which is defined by the evaluation_strategy. Each checkpoint contains the model weight(s), config, tokenizer, optimzer, scheduler and random number states.	uri_folder
mlflow_model_folder	output folder containing best finetuned model in mlflow format.	mlflow_model

evaluation_result: type: uri_folder description: Test Data Evaluation Results

Name	Description	Type

components token_classification_pipeline - Azure/azureml-assets GitHub Wiki

Token Classification Pipeline

token_classification_pipeline

Overview

Inputs

Outputs

⚠️ GitHub.com Fallback ⚠️

components token_classification_pipeline - Azure/azureml-assets GitHub Wiki

Token Classification Pipeline

token_classification_pipeline

Overview

Inputs

Outputs

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️