components dataset_sampler - Azure/azureml-assets GitHub Wiki

Dataset Sampler

dataset_sampler

Overview

Samples a dataset containing JSONL file(s).

Version: 0.0.11

View in Studio: https://ml.azure.com/registries/azureml/components/dataset_sampler/version/0.0.11

Inputs

Name Description Type Default Optional Enum
dataset Path to the input directory or .jsonl file from which the data will be sampled. uri_folder False
sampling_style The sampling method to use. Use head to sample from beginning of the file, tail to sample from the end of the file, random to sample randomly and duplicate to append the input file to itself until the correct output size is reached. string head False ['random', 'head', 'tail', 'duplicate']
sampling_ratio Portion of the dataset to be sampled. If sampling style is not duplicate, must be a float in (0,1]; must be null if n_samples is specified. NOTE: If the sampling_style is duplicate, the component will duplicate the data in a "round robin" fashion, going over the input several times. This operation is very slow! So be cautious when using for large datasets. number True
n_samples Absolute number of samples to be taken (alternative to sampling_ratio); must be null if sampling_ratio is specified. integer True
random_seed Random seed for sampling mode; if not specified, 0 is used. Used only when sampling_style is random. integer True

Outputs

Name Description Type
output_dataset Path to the jsonl file where the sampled dataset will be saved. uri_file

Environment

azureml://registries/azureml/environments/model-evaluation/labels/latest

⚠️ **GitHub.com Fallback** ⚠️