sample_aggregate - south-coast-science/scs_analysis GitHub Wiki

docs > software repositories > scs_analysis > commands > filtering and aggregating data


DESCRIPTION

The sample_aggregate utility provides regression midpoints for data delivered on stdin, over specified units of time, or over the entire dataset. It can perform this operation for one or many or all nodes of the input documents.

When each time checkpoint is encountered in the input stream, the midpoint values - together with min and max, if requested - are computed and reported. These values are marked with the datetime indicating the end of that period. When the input stream is closed, any remaining values are reported and marked with the next checkpoint.

Checkpoints are specified in the form HH:MM:SS, in a format similar to that for Unix crontab:

Specification Meaning
** all values
NN exactly matching NN
/N repeated every N

For example, **:/5:30 indicates 30 seconds past the minute, every 5 minutes, during every hour.

If no checkpoint is specified, then a single document is output - this is the aggregation of all input data.

Data sources are specified as a path into the input JSON document in the same format as the node command. Any number of paths can be specified, including none (process all paths). If a path to an internal node in the JSON document is specified, then all of the leaf-node descendants of that node will be processed.

Note that the leaf-node paths to be processed are obtained from the paths provided on the command line, and the actual paths found in the first JSON document. Paths that do not exist in the first document are ignored.

The input JSON document must contain a field labelled 'rec', providing an ISO 8601 localised datetime. If this field is not present then the document is skipped. Note that the timezone of the output rec datetimes is the same as the input rec values. Rows with successive duplicate rec values are ignored.

Leaf node values may be numeric or strings. Numeric values are processed according to a simple linear regression. String values are processed using a simple categorical regression.

Input Min Mid Max
values are all equal value value value
values are different minimum value None maximum value

If the input document does not contain a specified path - or if the value is null - then the value is ignored.

If a checkpoint is specified and the --exclude-remainder flag is used, then all of the input documents after the last complete checkpoint period are ignored. Where the checkpoint is specified, a --rule flag is available. If used, individual aggregates are rejected if less than 75% of the expected data points are present. In this case, a timedelta must be supplied, indicating the expected interval between the input samples. The interval may be found using the aws_topic_history utility.

WARNING: The The sample_aggregate utility uses the first input document to determine the data type for the regressions. If csv_reader is being used to supply data, then the csv_reader's --nullify flag should be used - this will prevent numeric fields being incorrectly identified as strings.

SYNOPSIS

sample_aggregate.py [-p HH:MM:SS [-x] [-r { [DD-]HH:MM[:SS]] | :SS }]] [-m] [-i ISO] [-v] [PATH_1..PATH_N]

Options
--version show program's version number and exit
-h, --help show this help message and exit
-p CHECKPOINT, --checkpoint=CHECKPOINT a time specification i.e. **:/05:00
-x, --exclude-remainder ignore data points after the last complete period
-r INTERVAL, --rule=INTERVAL apply 75% rule with sampling INTERVAL
-m, --min-max report min and max in addition to midpoint
-i ISO, --iso-path=ISO path for ISO 8601 datetime field (default 'rec')
-v, --verbose report narrative to stderr

EXAMPLES

csv_reader.py -n gases.csv | sample_aggregate.py -v -r :10 -c **:/15:00
aws_topic_history.py -v -c super -p 00:00:00 -s 2023-12-01T00:00:00Z -e 2024-01-01T00:00:00Z south-coast-science-production/reference/loc/531/particulates | node.py tag rec ver src val.sfr val.sht exg | sample_aggregate.py | csv_writer.py -e 531-particulates-2023-12.csv

DOCUMENT EXAMPLE - OUTPUT

Without min-max:

{"tag": "scs-be2-2", "rec": "2018-12-11T10:45:00Z", "val": {"hmd": 51.4, "tmp": 21.1}}
{"tag": "scs-be2-2", "rec": "2018-12-11T10:50:00Z", "val": {"hmd": 51.4, "tmp": 21.2}}

With min-max:

{"tag": "scs-be2-2", "rec": "2018-12-11T10:45:00Z", "val": {"hmd": {"min": 51.4, "mid": 51.4, "max": 51.4}, "tmp": {"min": 21.1, "mid": 21.1, "max": 21.2}}}
{"tag": "scs-be2-2", "rec": "2018-12-11T10:50:00Z", "val": {"hmd": {"min": 51.4, "mid": 51.4, "max": 51.4}, "tmp": {"min": 21.1, "mid": 21.2, "max": 21.2}}}

SEE ALSO

scs_analysis/aws_topic_history
scs_analysis/csv_reader
scs_analysis/node
scs_analysis/node_shift
scs_analysis/sample_average

RESOURCES

ISO 8601