Collision ratio comparison - Cyan4973/xxHash GitHub Wiki

Collision study

The tool provided in tests/collisions is able to directly probe the collision efficiency of 64-bit hashes, by generating a massive amount of hashes, it then counts the number of collisions produced. The test requires a large amount of RAM, and can last a while.

The generated input patterns are relatively close to each others (low hamming distances), with only a few bits of difference. This use case is expected to be challenging to low quality hashes.

The test result indicates the expected nb of collisions for the amount of generated hash values. The tested hash algorithms do not have to produce a strictly identical nb of collisions. Being in the vicinity is good enough. For example, with 100G 64-bit hashes, the expected average amount of collisions is 312.5, but anything between about 280 and 350 collisions is statistically indistinguishable from expectation, so they are all "good", and can't be ranked solely on this information.

Testing 64-bit hashes on large inputs :

An input of 256 bytes is presumed to always trigger the "long" mode for hash algorithms featuring multiple modes depending on input length.

Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH3 256 100 Gi 312.5 314
XXH64 256 100 Gi 312.5 294
XXH128 low 64-bit 256 100 Gi 312.5 314
XXH128 high 64-bit 512 100 Gi 312.5 291
blake2b low 64-bit 256 100 Gi 312.5 296 very long test : 48h !
md5 low 64-bit 256 100 Gi 312.5 301 very long test : 58h !
sha256 low 64 256 25 Gi 19.5 14 very long test
city64 256 100 Gi 312.5 303
fnv64 256 100 Gi 312.5 303 long test : 23h30, compared with 3h20 for XXH3
mmh64a 256 100 Gi 312.5 310
highwayhash64 256 100 Gi 312.5 321 very long test : 35h !
t1ha2 256 100 Gi 312.5 761 a bit too many, loses ~1bit of distribution
wyhash 256 100 Gi 312.5 1202 a bit too many, loses ~2bits of distribution
mumv2 256 100 Gi 312.5 1229 a bit too many, loses ~2bits of distribution
fletcher4 256 1 Gi 0.0 404908304 is that a hash ?

The last part of the table contains hash algorithms with less-than-ideal collision probabilities. t1ha2's distribution is closer to 63-bit, while mumv2 and wyhash are closer to 62-bit. Such level of inefficiency is not "horrible", these algorithms can still be considered "reasonable" for many practical usages. Below that, on the other hand, listed algorithms should not be considered for any scenario.

Testing 64-bit non-portable hashes :

AES hashes are based on Intel's AES-NI instruction extension, introduced alongside SSE4.2. These algorithms can't run on different platforms.

Algorithm Type Input Len Nb Hashes Expected Nb Collisions Notes
aquahash low64 AES 256 100 Gi 312.5 3488136 unusable
falkhash low64 AES 256 100 Gi 312.5 18430 not good
meowhash v0.4 low64 AES 256 100 Gi 312.5 18200 not good
ahash AES 256 100 Gi 312.5 758 okay (loses -1.5 bit dist.)
meowhash v0.5 low64 AES 256 100 Gi 312.5 300 fine

AES implying cryptographic properties, one could believe that it's going to be good enough for simpler properties such as dispersion and randomness. This assumption appears incorrect, and AES do not offer any such guarantee "by default". The interesting Meowhash example shows that, with proper care, an AES-based hash can provide the expected amount of collisions as predicted by the birthday paradox. But such outcome should not be considered "trivial" to achieve, and one would be well inspired to consult an encryption expert in order to produce a good quality hash using AES instructions.

Testing 64-bit hashes on small inputs :

The test on input 8 bytes => 8 bytes output can check if there is some "space reduction" happening at this length. If there is no space reduction, then the transformation is a perfect bijection, which guarantees no collisions for 2 different inputs.

Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH3 8 100 Gi 312.5 0 XXH3 is bijective for len==8. It guarantees no collision.
XXH64 8 100 Gi 312.5 0 XXH64 is also bijective for len==8
mmh64a 8 100 Gi 312.5 0 bijective
city64 8 100 Gi 312.5 278 not bijective, but an acceptable collision ratio
mumv2 8 100 Gi 312.5 544 a bit too large, loses ~1bit of distribution
t1ha2 8 100 Gi 312.5 607 a bit too large, loses ~1bit of distribution
wyhash 8 100 Gi 312.5 652 a bit too large, loses ~1bit of distribution
ahash 8 30 Gi 28.1 3841 pretty bad, loses ~7bit of distribution
Other lengths :
Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH3 16 100 Gi 312.5 332
XXH3 32 100 Gi 312.5 287
XXH3 1024 100 Gi 312.5 300

Testing 128-bit hashes :

The only acceptable score for these tests is always 0. If the hash algorithm offers 128-bit of dispersion, the probability for a single collision to show up is smaller than winning the national lottery twice in a row. In contrast with the 64-bit tests, due to resource limitation, the test does not provide a precise 128-bit collision estimation. It merely identifies algorithms with < 80-bit of effective distribution.

Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH128 256 100 Gi 0.0 0
cityhash128 256 100 Gi 0.0 0
t1ha2 128-bit 256 100 Gi 0.0 14 distribution equivalent to ~70 bits

Tests on small input lengths

Algorithm Input Len Nb Hashes Expected Nb Collisions Notes
XXH128 16 25 Gi 0.0 0 range 9-16
XXH128 32 25 Gi 0.0 0 range 17-128
XXH128 100 13 Gi 0.0 0 range 17-128
XXH128 200 13 Gi 0.0 0 range 129-240
XXH128 1024 100 Gi 0.0 0