Collision ratio comparison - Cyan4973/xxHash GitHub Wiki
Collision study
The tool provided in tests/collisions
is able to directly probe the collision efficiency of 64-bit hashes, by generating a massive amount of hashes, it then counts the number of collisions produced. The test requires a large amount of RAM, and can last a while.
The generated input patterns are relatively close to each others (low hamming distances), with only a few bits of difference. This use case is expected to be challenging to low quality hashes.
The test result indicates the expected nb of collisions for the amount of generated hash values. The tested hash algorithms do not have to produce a strictly identical nb of collisions. Being in the vicinity is good enough. For example, with 100G 64-bit hashes, the expected average amount of collisions is 312.5, but anything between about 280 and 350 collisions is statistically indistinguishable from expectation, so they are all "good", and can't be ranked solely on this information.
Testing 64-bit hashes on large inputs :
An input of 256 bytes is presumed to always trigger the "long" mode for hash algorithms featuring multiple modes depending on input length.
Algorithm | Input Len | Nb Hashes | Expected | Nb Collisions | Notes |
---|---|---|---|---|---|
XXH3 | 256 | 100 Gi | 312.5 | 314 | |
XXH64 | 256 | 100 Gi | 312.5 | 294 | |
XXH128 low 64-bit | 256 | 100 Gi | 312.5 | 314 | |
XXH128 high 64-bit | 512 | 100 Gi | 312.5 | 291 | |
blake2b low 64-bit | 256 | 100 Gi | 312.5 | 296 | very long test : 48h ! |
md5 low 64-bit | 256 | 100 Gi | 312.5 | 301 | very long test : 58h ! |
sha256 low 64 | 256 | 25 Gi | 19.5 | 14 | very long test |
city64 | 256 | 100 Gi | 312.5 | 303 | |
fnv64 | 256 | 100 Gi | 312.5 | 303 | long test : 23h30, compared with 3h20 for XXH3 |
mmh64a | 256 | 100 Gi | 312.5 | 310 | |
highwayhash64 | 256 | 100 Gi | 312.5 | 321 | very long test : 35h ! |
t1ha2 | 256 | 100 Gi | 312.5 | 761 | a bit too many, loses ~1bit of distribution |
wyhash | 256 | 100 Gi | 312.5 | 1202 | a bit too many, loses ~2bits of distribution |
mumv2 | 256 | 100 Gi | 312.5 | 1229 | a bit too many, loses ~2bits of distribution |
fletcher4 | 256 | 1 Gi | 0.0 | 404908304 | is that a hash ? |
The last part of the table contains hash algorithms with less-than-ideal collision probabilities. t1ha2
's distribution is closer to 63-bit, while mumv2
and wyhash
are closer to 62-bit. Such level of inefficiency is not "horrible", these algorithms can still be considered "reasonable" for many practical usages. Below that, on the other hand, listed algorithms should not be considered for any scenario.
Testing 64-bit non-portable hashes :
AES
hashes are based on Intel's AES-NI
instruction extension, introduced alongside SSE4.2
. These algorithms can't run on different platforms.
Algorithm | Type | Input Len | Nb Hashes | Expected | Nb Collisions | Notes |
---|---|---|---|---|---|---|
aquahash low64 | AES | 256 | 100 Gi | 312.5 | 3488136 | unusable |
falkhash low64 | AES | 256 | 100 Gi | 312.5 | 18430 | not good |
meowhash v0.4 low64 | AES | 256 | 100 Gi | 312.5 | 18200 | not good |
ahash | AES | 256 | 100 Gi | 312.5 | 758 | okay (loses -1.5 bit dist.) |
meowhash v0.5 low64 | AES | 256 | 100 Gi | 312.5 | 300 | fine |
AES
implying cryptographic properties, one could believe that it's going to be good enough for simpler properties such as dispersion and randomness.
This assumption appears incorrect, and AES
do not offer any such guarantee "by default".
The interesting Meowhash example shows that, with proper care, an AES
-based hash can provide the expected amount of collisions as predicted by the birthday paradox. But such outcome should not be considered "trivial" to achieve, and one would be well inspired to consult an encryption expert in order to produce a good quality hash using AES
instructions.
Testing 64-bit hashes on small inputs :
The test on input 8 bytes => 8 bytes output can check if there is some "space reduction" happening at this length. If there is no space reduction, then the transformation is a perfect bijection, which guarantees no collisions for 2 different inputs.
Algorithm | Input Len | Nb Hashes | Expected | Nb Collisions | Notes |
---|---|---|---|---|---|
XXH3 | 8 | 100 Gi | 312.5 | 0 | XXH3 is bijective for len==8 . It guarantees no collision. |
XXH64 | 8 | 100 Gi | 312.5 | 0 | XXH64 is also bijective for len==8 |
mmh64a | 8 | 100 Gi | 312.5 | 0 | bijective |
city64 | 8 | 100 Gi | 312.5 | 278 | not bijective, but an acceptable collision ratio |
mumv2 | 8 | 100 Gi | 312.5 | 544 | a bit too large, loses ~1bit of distribution |
t1ha2 | 8 | 100 Gi | 312.5 | 607 | a bit too large, loses ~1bit of distribution |
wyhash | 8 | 100 Gi | 312.5 | 652 | a bit too large, loses ~1bit of distribution |
ahash | 8 | 30 Gi | 28.1 | 3841 | pretty bad, loses ~7bit of distribution |
Other lengths :
Algorithm | Input Len | Nb Hashes | Expected | Nb Collisions | Notes |
---|---|---|---|---|---|
XXH3 | 16 | 100 Gi | 312.5 | 332 | |
XXH3 | 32 | 100 Gi | 312.5 | 287 | |
XXH3 | 1024 | 100 Gi | 312.5 | 300 |
Testing 128-bit hashes :
The only acceptable score for these tests is always 0
. If the hash algorithm offers 128-bit of dispersion, the probability for a single collision to show up is smaller than winning the national lottery twice in a row.
In contrast with the 64-bit tests, due to resource limitation, the test does not provide a precise 128-bit
collision estimation. It merely identifies algorithms with < 80-bit of effective distribution.
Algorithm | Input Len | Nb Hashes | Expected | Nb Collisions | Notes |
---|---|---|---|---|---|
XXH128 | 256 | 100 Gi | 0.0 | 0 | |
cityhash128 | 256 | 100 Gi | 0.0 | 0 | |
t1ha2 128-bit | 256 | 100 Gi | 0.0 | 14 | distribution equivalent to ~70 bits |
Tests on small input lengths
Algorithm | Input Len | Nb Hashes | Expected | Nb Collisions | Notes |
---|---|---|---|---|---|
XXH128 | 16 | 25 Gi | 0.0 | 0 | range 9-16 |
XXH128 | 32 | 25 Gi | 0.0 | 0 | range 17-128 |
XXH128 | 100 | 13 Gi | 0.0 | 0 | range 17-128 |
XXH128 | 200 | 13 Gi | 0.0 | 0 | range 129-240 |
XXH128 | 1024 | 100 Gi | 0.0 | 0 |