Cell Sampling - VizierDB/vizier-scala GitHub Wiki
Sampling cells allow you to generate partial samples of a datset. Vizier currently supports three forms of sampling:
- Basic
- Manually Stratified
- Automatically Stratified
Basic Sample
This cell generates a new dataset consisting of a randomly selected subset of cells in the input. Samples are chosen based on a uniform sampling rate across all cells.
- Input Dataset: The dataset to sample
- Sampling Rate: The fraction of records to include in the sample (1 = all records, 0 = no records)
- Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with
_sample
appended)
Manually Stratified Sample
This cell generates a new dataset consisting of a subset of the cells in the input. Samples are chosen based on a manually provided rate that varies based on a categorical attribute.
- Input Dataset: The dataset to sample
- Column: The categorical attribute used to select a sampling rate
- Strata: Sampling rates for each value of Column
- Column Value: Of the records where Column has this value...
- Sampling Rate: ...include this fraction.
- Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with
_sample
appended)
Automatically Stratified Sample
This cell generates a new dataset consisting of a subset of the cells in the input. Samples are chosen to ensure equal representation from each value of a specified categorical attribute. Note that if too few records exist for one or more values of the categorical attribute, the cell will generate an error.
- Input Dataset: The dataset to sample.
- Column: The categorical attribute used to select a sampling rate. Every distinct value of this column will have (roughly) even representation in the final sample.
- Sampling Rate: The fraction of records to include in the sample (1 = all records, 0 = no records). If this value is too high (i.e., some categories would be under-represented in the result), an error message will indicate the maximum value of this field.
- Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with
_sample
appended)