components dataset_sampler - Azure/azureml-assets GitHub Wiki
Samples a dataset containing JSONL file(s).
Version: 0.0.9
View in Studio: https://ml.azure.com/registries/azureml/components/dataset_sampler/version/0.0.9
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
dataset | Path to the input directory or .jsonl file from which the data will be sampled. | uri_folder | False | ||
sampling_style | The sampling method to use. Use head to sample from beginning of the file, tail to sample from the end of the file, random to sample randomly and duplicate to append the input file to itself until the correct output size is reached. |
string | head | False | ['random', 'head', 'tail', 'duplicate'] |
sampling_ratio | Portion of the dataset to be sampled. If sampling style is not duplicate , must be a float in (0,1]; must be null if n_samples is specified. NOTE: If the sampling_style is duplicate , the component will duplicate the data in a "round robin" fashion, going over the input several times. This operation is very slow! So be cautious when using for large datasets. |
number | True | ||
n_samples | Absolute number of samples to be taken (alternative to sampling_ratio ); must be null if sampling_ratio is specified. |
integer | True | ||
random_seed | Random seed for sampling mode; if not specified, 0 is used. Used only when sampling_style is random . |
integer | True |
Name | Description | Type |
---|---|---|
output_dataset | Path to the jsonl file where the sampled dataset will be saved. | uri_file |
azureml://registries/azureml/environments/model-evaluation/labels/latest