sample_duplicates - bruno-beloff/scs_analysis GitHub Wiki

docs > software repositories > scs_analysis > commands > filtering and aggregating data


DESCRIPTION

The sample_duplicates utility is used to find duplicate values in a sequence of input JSON documents, optionally for a specified node path. It is particularly useful in searching for duplicate recording datetimes.

If an input document does not contain the specified path, then it is ignored.

In the default mode, the utility outputs the rows that were duplicates (or contained duplicate field values). If the --exclude flag is set, then sample_duplicates generates a version of the input data that contains no duplicates.

In the --counts mode, the output report is sequence of JSON dictionaries with a field for each value where duplicates were found, whose value is the number of matching documents.

SYNOPSIS

sample_duplicates.py [{ -x | -c }] [-v] [PATH]

Options
--version show program's version number and exit
-h, --help show this help message and exit
-x, --exclude output non-duplicate documents only
-c, --counts only list the count of matching documents
-v, --verbose report narrative to stderr

EXAMPLES

csv_reader.py climate.csv | sample_duplicates.py -v val.hmd

DOCUMENT EXAMPLE - OUTPUT

default mode:

{"val": {"hmd": 17.5, "tmp": 25.7}, "rec": "2019-02-25T15:28:18Z", "tag": "scs-bgx-303"} {"val": {"hmd": 17.5, "tmp": 25.7}, "rec": "2019-02-25T15:31:18Z", "tag": "scs-bgx-303"}

counts mode:

{"17.5": 2}