sample_aggregate - south-coast-science/scs_analysis GitHub Wiki

docs > software repositories > scs_analysis > commands > filtering and aggregating data

DESCRIPTION

The sample_aggregate utility provides regression midpoints for data delivered on stdin, over specified units of time, or over the entire dataset. It can perform this operation for one or many or all nodes of the input documents.

When each time checkpoint is encountered in the input stream, the midpoint values - together with min and max, if requested - are computed and reported. These values are marked with the datetime indicating the end of that period. When the input stream is closed, any remaining values are reported and marked with the next checkpoint.

Checkpoints are specified in the form HH:MM:SS, in a format similar to that for Unix crontab:

Specification	Meaning
**	all values
NN	exactly matching NN
/N	repeated every N

For example, **:/5:30 indicates 30 seconds past the minute, every 5 minutes, during every hour.

If no checkpoint is specified, then a single document is output - this is the aggregation of all input data.

Data sources are specified as a path into the input JSON document in the same format as the node command. Any number of paths can be specified, including none (process all paths). If a path to an internal node in the JSON document is specified, then all of the leaf-node descendants of that node will be processed.

Note that the leaf-node paths to be processed are obtained from the paths provided on the command line, and the actual paths found in the first JSON document. Paths that do not exist in the first document are ignored.

The input JSON document must contain a field labelled 'rec', providing an ISO 8601 localised datetime. If this field is not present then the document is skipped. Note that the timezone of the output rec datetimes is the same as the input rec values. Rows with successive duplicate rec values are ignored.

Leaf node values may be numeric or strings. Numeric values are processed according to a simple linear regression. String values are processed using a simple categorical regression.

Input	Min	Mid	Max
values are all equal	value	value	value
values are different	minimum value	None	maximum value

If the input document does not contain a specified path - or if the value is null - then the value is ignored.

If a checkpoint is specified and the --exclude-remainder flag is used, then all of the input documents after the last complete checkpoint period are ignored. Where the checkpoint is specified, a --rule flag is available. If used, individual aggregates are rejected if less than 75% of the expected data points are present. In this case, a timedelta must be supplied, indicating the expected interval between the input samples. The interval may be found using the aws_topic_history utility.

WARNING: The The sample_aggregate utility uses the first input document to determine the data type for the regressions. If csv_reader is being used to supply data, then the csv_reader's --nullify flag should be used - this will prevent numeric fields being incorrectly identified as strings.

SYNOPSIS

sample_aggregate.py [-p HH:MM:SS [-x] [-r { [DD-]HH:MM[:SS]] | :SS }]] [-m] [-i ISO] [-v] [PATH_1..PATH_N]

Options
--version	show program's version number and exit
-h, --help	show this help message and exit
-p CHECKPOINT, --checkpoint=CHECKPOINT	a time specification i.e. **:/05:00
-x, --exclude-remainder	ignore data points after the last complete period
-r INTERVAL, --rule=INTERVAL	apply 75% rule with sampling INTERVAL
-m, --min-max	report min and max in addition to midpoint
-i ISO, --iso-path=ISO	path for ISO 8601 datetime field (default 'rec')
-v, --verbose	report narrative to stderr

EXAMPLES

csv_reader.py -n gases.csv | sample_aggregate.py -v -r :10 -c **:/15:00

aws_topic_history.py -v -c super -p 00:00:00 -s 2023-12-01T00:00:00Z -e 2024-01-01T00:00:00Z south-coast-science-production/reference/loc/531/particulates | node.py tag rec ver src val.sfr val.sht exg | sample_aggregate.py | csv_writer.py -e 531-particulates-2023-12.csv

DOCUMENT EXAMPLE - OUTPUT

Without min-max:

{"tag": "scs-be2-2", "rec": "2018-12-11T10:45:00Z", "val": {"hmd": 51.4, "tmp": 21.1}}
{"tag": "scs-be2-2", "rec": "2018-12-11T10:50:00Z", "val": {"hmd": 51.4, "tmp": 21.2}}

With min-max:

{"tag": "scs-be2-2", "rec": "2018-12-11T10:45:00Z", "val": {"hmd": {"min": 51.4, "mid": 51.4, "max": 51.4}, "tmp": {"min": 21.1, "mid": 21.1, "max": 21.2}}}
{"tag": "scs-be2-2", "rec": "2018-12-11T10:50:00Z", "val": {"hmd": {"min": 51.4, "mid": 51.4, "max": 51.4}, "tmp": {"min": 21.1, "mid": 21.2, "max": 21.2}}}

RESOURCES

ISO 8601