sample_duplicates - bruno-beloff/scs_analysis GitHub Wiki
docs > software repositories > scs_analysis > commands > filtering and aggregating data
DESCRIPTION
The sample_duplicates utility is used to find duplicate values in a sequence of input JSON documents, optionally for a specified node path. It is particularly useful in searching for duplicate recording datetimes.
If an input document does not contain the specified path, then it is ignored.
In the default mode, the utility outputs the rows that were duplicates (or contained duplicate field values). If the --exclude flag is set, then sample_duplicates generates a version of the input data that contains no duplicates.
In the --counts mode, the output report is sequence of JSON dictionaries with a field for each value where duplicates were found, whose value is the number of matching documents.
SYNOPSIS
sample_duplicates.py [{ -x | -c }] [-v] [PATH]
Options | |
---|---|
--version | show program's version number and exit |
-h, --help | show this help message and exit |
-x, --exclude | output non-duplicate documents only |
-c, --counts | only list the count of matching documents |
-v, --verbose | report narrative to stderr |
EXAMPLES
csv_reader.py climate.csv | sample_duplicates.py -v val.hmd
DOCUMENT EXAMPLE - OUTPUT
default mode:
{"val": {"hmd": 17.5, "tmp": 25.7}, "rec": "2019-02-25T15:28:18Z", "tag": "scs-bgx-303"} {"val": {"hmd": 17.5, "tmp": 25.7}, "rec": "2019-02-25T15:31:18Z", "tag": "scs-bgx-303"}
counts mode:
{"17.5": 2}