Keys - Titousensei/sisyphus GitHub Wiki
Keys are a powerful family of classes used by Sisyphus. They can be used as indexes (like database indexes), look-up tables, counters, or set operations.
Keys are based on NativeHash family of classes: they store hashes, of the native type long (8 bytes), with an optional value, which can be an int, a long or a double. Since the data in Keys uses native types, they are extremely memory-efficient. (For comparison, java HashSet can hold 10M Long values per GB.)
-
KeySet: just a set of hashes (long), without a lookup value. KeySets are typically used for uniqueness detection, or to remember a list of hashes. KeySets can hold 126M hashes per GB (8.1 bytes per entry in average).
-
KeyMap: the most versatile Key, which keep hash->int relationships. Since the value is an int, it can represent a count of things. That's why KeyMap provides arithmetic operations for the values, such as increment. KeyMap is used to keep counters, indices, 32-bits crcs or ids. KeyMaps can hold 84M hash->int pairs per GB (12.1 bytes per entry in average).
-
KeyBinding: used to keep id-to-id bindings (mappings). KeyBinding keeps hash->hash relationships, and thus does not provide arithmetic operations for its values. KeyBinding can hold 63M hash-hash pairs per GB (16.1 bytes per entry in average).
-
KeyDouble: used for arithmetic calculations in a double. Provides increment and multiply of values. KeyDouble can hold 63M hash->double pairs per GB (16.1 bytes per entry in average).
All these classes use the same base class Key, which provide the basic add, remove, contains and get operations. They can be serialized to disk and loaded from disk (uses java serialization, not compressed). Typically, a 100M-entries KeyBinding is saved in 20s, and loaded in 40s. All Keys can be casted to a KeySet using src.asKeySet(): this will create a new KeySet with a pointer to the hash part of src (a KeyMap, KeyBinding or KeyDouble), without allocating new memory (shared with the original object).
When serializing to disk, the filename recommendations are:
- _key.<keyname>
- _map.<keyname>.<valuename>
- _bind.<keyname>.<valuename>
- _double.<keyname>.<valuename> (This not currently enforced, but might be in the future.)
Alternatively, there is a utility method to load from a TSV file (slower than deserialization):
- Key.loadFromFile(): load 1 or 2 columns from a TSV file
All Keys are iterable, and can be used as an Input using their corresponding Input wrapper: InputKey, InputKeyMap, InputKeyBinding, InputKeyDouble. Their schema is carried automatically.
Keys can also be populated using their corresponding Output wrappers: OutputKey, OutputKeyMap, OutputKeyBinding, OutputKeyDouble.
Set operations can be done with the following utility methods:
- key.filterExclude(Key ref): remove from "key" all hashes in "ref".
- key.filterOnly(Key ref): remove from "key" all hashes not in "ref".
- key.filterExclude(Test tst): remove from "key" all entries for which "tst" returns true.
KeyMap also has a utility function to do a "groupby+count" of the values of a source KeyMap or a KeyBinding. CountValues increment the values where the hashes are the value of the source. The column name of the destination hash must be the same as source value.
- dst.countValues(KeyMap src)
- dst.countValues(KeyBinding src)
Finally, all the Key classes have a main method to query a serialized file. Example:
./run com.thefind.sisyphus.KeyMap <filepath> [<command>] [<list of hashes or values>]
where <command> can be one of:
- key <list of hashes> (default): lookup the entries with hashes in this list
- value <list of values>: find the entries with values in this list
- count <list of values>: "groupby+count" the entries with values in this list
- prompt: open a shell to lookup hashes