sample_duplicates - bruno-beloff/scs_analysis GitHub Wiki

docs > software repositories > scs_analysis > commands > filtering and aggregating data

DESCRIPTION

The sample_duplicates utility is used to find duplicate values in a sequence of input JSON documents, optionally for a specified node path. It is particularly useful in searching for duplicate recording datetimes.

If an input document does not contain the specified path, then it is ignored.

In the default mode, the utility outputs the rows that were duplicates (or contained duplicate field values). If the --exclude flag is set, then sample_duplicates generates a version of the input data that contains no duplicates.

In the --counts mode, the output report is sequence of JSON dictionaries with a field for each value where duplicates were found, whose value is the number of matching documents.

SYNOPSIS

sample_duplicates.py [{ -x | -c }] [-v] [PATH]

Options
--version	show program's version number and exit
-h, --help	show this help message and exit
-x, --exclude	output non-duplicate documents only
-c, --counts	only list the count of matching documents
-v, --verbose	report narrative to stderr

EXAMPLES

csv_reader.py climate.csv | sample_duplicates.py -v val.hmd

DOCUMENT EXAMPLE - OUTPUT

default mode:

{"val": {"hmd": 17.5, "tmp": 25.7}, "rec": "2019-02-25T15:28:18Z", "tag": "scs-bgx-303"} {"val": {"hmd": 17.5, "tmp": 25.7}, "rec": "2019-02-25T15:31:18Z", "tag": "scs-bgx-303"}

counts mode:

{"17.5": 2}