sample_aggregate - bruno-beloff/scs_analysis GitHub Wiki
docs > software repositories > scs_analysis > commands > filtering and aggregating data
DESCRIPTION
The sample_aggregate utility provides regression midpoints for data delivered on stdin, over specified units of time, or over the entire dataset. It can perform this operation for one or many or all nodes of the input documents.
When each time checkpoint is encountered in the input stream, the midpoint values - together with min and max, if requested - are computed and reported. These values are marked with the datetime indicating the end of that period. When the input stream is closed, any remaining values are reported and marked with the next checkpoint.
Checkpoints are specified in the form HH:MM:SS, in a format similar to that for Unix crontab:
Specification | Meaning |
---|---|
** | all values |
NN | exactly matching NN |
/N | repeated every N |
For example, **:/5:30 indicates 30 seconds past the minute, every 5 minutes, during every hour.
If no checkpoint is specified, then a single document is output - this is the aggregation of all input data.
Data sources are specified as a path into the input JSON document in the same format as the node command. Any number of paths can be specified, including none (process all paths). If a path to an internal node in the JSON document is specified, then all of the leaf-node descendants of that node will be processed.
Note that the leaf-node paths to be processed are obtained from the paths provided on the command line, and the actual paths found in the first JSON document. Paths that do not exist in the first document are ignored.
The input JSON document must contain a field labelled 'rec', providing an ISO 8601 localised datetime. If this field is not present then the document is skipped. Note that the timezone of the output rec datetimes is the same as the input rec values. Rows with successive duplicate rec values are ignored.
Leaf node values may be numeric or strings. Numeric values are processed according to a simple linear regression. String values are processed using a simple categorical regression.
Input | Min | Mid | Max |
---|---|---|---|
values are all equal | value | value | value |
values are different | minimum value | None | maximum value |
If the input document does not contain a specified path - or if the value is null - then the value is ignored.
If a checkpoint is specified and the --exclude-remainder flag is used, then all of the input documents after the last complete checkpoint period are ignored. Where the checkpoint is specified, a --rule flag is available. If used, individual aggregates are rejected if less than 75% of the expected data points are present. In this case, a timedelta must be supplied, indicating the expected interval between the input samples. The interval may be found using the aws_topic_history utility.
WARNING: The The sample_aggregate utility uses the first input document to determine the data type for the regressions. If csv_reader is being used to supply data, then the csv_reader's --nullify flag should be used - this will prevent numeric fields being incorrectly identified as strings.
SYNOPSIS
sample_aggregate.py [-p HH:MM:SS [-x] [-r { [DD-]HH:MM[:SS]] | :SS }]] [-m] [-i ISO] [-v] [PATH_1..PATH_N]
Options | |
---|---|
--version | show program's version number and exit |
-h, --help | show this help message and exit |
-p CHECKPOINT, --checkpoint=CHECKPOINT | a time specification i.e. **:/05:00 |
-x, --exclude-remainder | ignore data points after the last complete period |
-r INTERVAL, --rule=INTERVAL | apply 75% rule with sampling INTERVAL |
-m, --min-max | report min and max in addition to midpoint |
-i ISO, --iso-path=ISO | path for ISO 8601 datetime field (default 'rec') |
-v, --verbose | report narrative to stderr |
EXAMPLES
csv_reader.py -n gases.csv | sample_aggregate.py -v -r :10 -c **:/15:00
aws_topic_history.py -v -c super -p 00:00:00 -s 2023-12-01T00:00:00Z -e 2024-01-01T00:00:00Z south-coast-science-production/reference/loc/531/particulates | node.py tag rec ver src val.sfr val.sht exg | sample_aggregate.py | csv_writer.py -e 531-particulates-2023-12.csv
DOCUMENT EXAMPLE - OUTPUT
Without min-max:
{"tag": "scs-be2-2", "rec": "2018-12-11T10:45:00Z", "val": {"hmd": 51.4, "tmp": 21.1}}
{"tag": "scs-be2-2", "rec": "2018-12-11T10:50:00Z", "val": {"hmd": 51.4, "tmp": 21.2}}
With min-max:
{"tag": "scs-be2-2", "rec": "2018-12-11T10:45:00Z", "val": {"hmd": {"min": 51.4, "mid": 51.4, "max": 51.4}, "tmp": {"min": 21.1, "mid": 21.1, "max": 21.2}}}
{"tag": "scs-be2-2", "rec": "2018-12-11T10:50:00Z", "val": {"hmd": {"min": 51.4, "mid": 51.4, "max": 51.4}, "tmp": {"min": 21.1, "mid": 21.2, "max": 21.2}}}
SEE ALSO
scs_analysis/aws_topic_history
scs_analysis/csv_reader
scs_analysis/node
scs_analysis/node_shift
scs_analysis/sample_average