Cell Sampling - VizierDB/vizier-scala GitHub Wiki

Sampling cells allow you to generate partial samples of a datset. Vizier currently supports three forms of sampling:

Basic
Manually Stratified
Automatically Stratified

Basic Sample

This cell generates a new dataset consisting of a randomly selected subset of cells in the input. Samples are chosen based on a uniform sampling rate across all cells.

Input Dataset: The dataset to sample
Sampling Rate: The fraction of records to include in the sample (1 = all records, 0 = no records)
Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with _sample appended)

Manually Stratified Sample

This cell generates a new dataset consisting of a subset of the cells in the input. Samples are chosen based on a manually provided rate that varies based on a categorical attribute.

Input Dataset: The dataset to sample
Column: The categorical attribute used to select a sampling rate
Strata: Sampling rates for each value of Column
- Column Value: Of the records where Column has this value...
- Sampling Rate: ...include this fraction.
Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with _sample appended)

Automatically Stratified Sample

This cell generates a new dataset consisting of a subset of the cells in the input. Samples are chosen to ensure equal representation from each value of a specified categorical attribute. Note that if too few records exist for one or more values of the categorical attribute, the cell will generate an error.

Input Dataset: The dataset to sample.
Column: The categorical attribute used to select a sampling rate. Every distinct value of this column will have (roughly) even representation in the final sample.
Sampling Rate: The fraction of records to include in the sample (1 = all records, 0 = no records). If this value is too high (i.e., some categories would be under-represented in the result), an error message will indicate the maximum value of this field.
Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with _sample appended)