csv_segmentor - bruno-beloff/scs_analysis GitHub Wiki

docs > software repositories > scs_analysis > commands > filtering and aggregating data


DESCRIPTION

The csv_segmentor utility is used to segment the input stream of JSON documents into CSV files whose rows have contiguous datetime values.

Contiguity is defined by the --max-interval flag. If the time interval between a document and the previous document is greater than this interval, then the current CSV file is closed, and a new file is opened. File names (and sub-directories) as specified by the --file-prefix flag. The datetime of the first row of CSV file is appended to the prefix.

The input documents must contain a field carrying an ISO 8601 datetime. If the field in a given document is empty or malformed, the document is ignored. If the field is not present in any document, the csv_segmentor utility terminates.

The csv_segmentor utility generates a report giving the specifications of each contiguous block. If no file prefix is given, then the CSV files are not generated, but the report is still produced.

SYNOPSIS

csv_segmentor.py -m { [[DD-]HH:]MM[:SS] | :SS } [-i ISO] [-f FILE_PREFIX] [-v]

Options
--version show program's version number and exit
-h, --help show this help message and exit
-m MAX_INTERVAL, --max-interval=MAX_INTERVAL maximum permitted interval
-i ISO, --iso-path=ISO path for ISO 8601 datetime field (default 'rec')
-f FILE_PREFIX, --file-prefix=FILE_PREFIX file prefix for contiguous CSVs
-v, --verbose report narrative to stderr

EXAMPLES

csv_reader.py -v scs-bgx-508-gases-2020-Q1.csv | csv_segmentor.py -m 06:00 -f segments/scs-bgx-508-gases-2020-Q1 -v | csv_writer.py -v segments/scs-bgx-508-gases-2020-Q1-report.csv

DOCUMENT EXAMPLE - REPORT OUTPUT

{"start": "2019-01-01T00:00:01Z", "end": "2019-01-04T10:04:51Z", "prev-interval": "", "max-interval": "00-00:00:11", "count": 29550}
{"start": "2019-01-04T10:29:21Z", "end": "2019-01-04T10:37:41Z", "prev-interval": "00-00:24:30", "max-interval": "00-00:00:10", "count": 51}
{"start": "2019-01-04T11:35:49Z", "end": "2019-01-04T11:41:19Z", "prev-interval": "00-00:58:07", "max-interval": "00-00:00:10", "count": 34}

FILES

Output file names are of the form: FILE-PREFIX_BLOCK-START-DATETIME.csv

SEE ALSO

scs_analysis/csv_collator
scs_analysis/csv_reader
scs_analysis/csv_writer

RESOURCES

ISO 8601