Hash Codes

Introduction:

We studied hash tables that are used to associate data with integer keys comprised of w bits. Often, we have keys that are not integers. They could be strings, objects, arrays or other compound structures. In order to use hash tables for these types of data, we must map these data types to w-bit hash codes. Hash code mapping should have the following properties:

If x and y are equal then x.hashCode() and y.hashCode() are equal.
If x and y are not equal, then the probability that x.hasCode()=y.hashCode() should be small (close to 1/2^w).

The first property ensures that if we store x in a hash table and later look up a value, y, equal to x, then we will find x. The second property is to minimise the loss from converting our objects to integers. It establishes that unequeal objects generally have different hash codes and are therefore likely to be stored at different locations in a hash table.

Hash Codes for Primitive Data Types

It is generally simple to find hash codes for the smaller primitives such as byte, char, float and int. These data types always have a binary representation and this often consists of w or fewer bits. In Java byte is an 8-bit type and float is a 32-bit type. In these cases, we can simply treat these bits as the representation of an integer in the range {0,...,2^w-1}. If two values are different, they get different hash codes. If they are the same, they get the same hash code.

Several primitive data types are made up of more than w bits, usually cw bits for some constant integer c. In Java, both long and double types are examples of this with c=2. These data types can be treated as compound objects made of c parts.

Hash Codes for Compound Objects

For a compound object, we want to create a hash code by combining the individual hash codes of its constituent parts. This is not quite as simple as it sounds. Although there are many hacks for this (such as combining the hash codes via bitwise xor operations), many of them will have a large set of objects that have the same hash code. However, there are simple and robust methods available through arithmetic with 2w bits of precision. As an example, suppose we have an object made up of parts P₀,...,P_r-1 whose hash codes are x₀,...,x_r-1. We can then choose mutually independent random w-bit integers z₀,...,z_r-1 and a random 2w-bit odd integer, z, and compute a hash code for the object with

hashCodes-HashingforCompoundObjects

It should be noted that this hash code has a final step of multiplying by z and dividing by 2^w that uses the multiplicative hash function from the Chained Hash Tables section. It takes the 2w-bit intermediate result and reduces it to a w-bit final result. Here is a code example of this method applied to a simple compound object with three parts x₀, x₁, x₂:

int hashCode(){ //random numbers from rand.org long[] z = {0x2058cc50L, 0xcb19137eL, 0x2cb6b6fdL}; long zz = 0xbea0107e5067d19dL; //convert (unsigned) hashcodes to long long h0 = x0.hashCode() && ((1L<<32)-1); long h1 = x1.hashCode() && ((1L<<32)-1); long h2 = x2.hashCode() && ((1L<<32)-1); return (int)(((z[0]*h0 + z[1]*h1 + z[2]*h2)*zz)>>>32); }

The following theorem shows that, in addition to being straightforward to implement, this method is provably robust:

Theorem h-Codes.1

Lt x₀,...,x_r-1 and y₀,...,y_r-1 each be sequences of w-bit integers in {0,...,2^w-1} and assume x_i≠y_i for at least one index i∈{0,...,r-1}. Then Pr{h(x₀,...,x_r-1) = h(y₀,...,y_r-1)}≤3/2^w.

Proof:

Firstly, define:

hashCodes-Theorem1-proof-1

Where

hashCodes-Theorem1-proof-2

Assuming, without loss of generality that x_i > y_i, then z_i(x_i-y_i) = t, since each of z_i and (x_i-y_i) is at most 2^w-1, their product is at most 2^2w-2^w-1+1 < 2^2w-1. By assumption, x_i-y_i≠0, so t has at most one solution in z_i. So, since z_i and t are independent (z₀,...,z_r-1 are mutually independent), then the probability of selectin z_i so that h'(x₀,...,x_r-1)=h'(y₀,...,y_r-1) is at most 1/2^w.

The last step of the hash function is to apply multiplicative hashing to reduce the 2w-bit intermediate result h'(x₀,...,x_r-1) to a w-bit final result h(x₀,...,x_r-1). According to Theorem h-Codes.1 if h'(x₀,...,x_r-1)≠h'(y₀,...,y_r-1), then Pr{h(x₀,...,x_r-1)=h(y₀,...,y_r-1)}≤2/2^w.

In summation,

hashCodes-Theorem1-proof-summary

Hash Codes for Arrays and Strings

The previous method works well for objects with a fixed, constant number of components. Yet, since it requires a random w-bit integer z_i for each component, it breaks down when we want to use it with objects of a variable number of components. It is possible to use a pseudorandom sequence to generate as many z_i's as we need, but then the z_i's aren't mutually independent and it becomes difficult to prove that the pseudorandom numbers don't interact badly with the hash function being used. Particularly, the values of t and z_i in the proof of Theorem h-Codes.1 are no longer independent.

So, a more rigorous approach could be to base our hash codes on polynomials over prime fields; which are regular polynomials that are evaluated modulo some prime number, p. This method is based on the following theorem, stating that polynomials over prime fields behave similarly to usual polynomials:

Theorem h-Codes.2:

Let p be a prime number, and let f(z)=x₀z⁰+x₁z¹+...+x_r-1z^r-1 be a non-trivial polynomial with coefficients x_i∈{0,...,p-1}. Then the equation f(z)modp=0 has at most r-1 solutions for z∈{0,...,p-1}.

To use this theorem, we hash a sequence of integers x₀,...,x_r-1 with each x_i∈{0,...,p-2} using a random integer z∈{0,...,p-1} via the formula

h(x₀,...,x_r-1)=(x₀z⁰+...+x_r-1z^r-1 + (p-1)z^r)mod p.

It should be noted that the extra (p-1)z^r term at the end of the formula exists as the last element, x_r, in the sequence x₀,...,x_r. The element differs from every other element in the sequence (each of which exist in the set {0,...,p-2}). We can think of p-1 as a marker for the end of the sequence.

The next theorem, which considers the case of two sequences of the same length, shows that this hash function has a good return for the minor amount of randomisation needed to choose z:

Theorem h-Codes.3:

Let 2^w+1 be a prime, let x₀,...,x_r-1 and y₀,...,y_r-1 each be sequences of w-bit integers in {0,...,2^w-1}, and assume x_i≠y_i for at least one index i∈{0,...,r-1}. Then

Pr{h(x₀,...,x_r-1)=h(y₀,...,y_r-1)}≤(r-1)/p.

Proof:

The equation h(x₀,...,x_r-1)=h(y₀,...,y_r-1) can be rewritten as ((x₀-y₀)z⁰+...+(x_r-1-y_r-1)z^r-1)modp=0.

Since x_i≠y_i, this polynomial is non-trivial. Therefore, according to the theorem, it has at most r-1 solutions in z. The probability that we pick z to be one of these solutions is at most (r-1)/p.

It should be noted that this hash function also deals with the case of two sequences of different lengths, even when one of the sequences is a prefix of the other. This is because the function effectively hashes the infinite sequence x₀,...,x_r-1,p-1,0,0,... .

So, it guarantees that if we have two sequences of length r and r' with r>r', then these two sequences differ at index i=r. In this case the formula ((x₀-y₀)z⁰+...+(x_r-1-y_r-1)z^r-1)modp=0 becomes

hashCodes-Theorem3-proof-1

which, according to theorem h-Codes.2, has at most r solutions in z, Combining this with theorem h-Codes.3 proves the following, more generalised theorem.

Theorem h-Codes.4:

Let p>2^w+1 be a prime, let x₀,...,x_r-1 and y₀,...,y_r'-1 be disatinct sequences of w-bit integers in {0,...,2^w-1}. Then Pr{h(x₀,...,x_r-1)=h(y₀,...,y_r-1)}≤max{(r,r')}/p.

Caveat:

It should be noted that the coded implementation does sacrifice some collision probability for the sake of convenience. Particularly, it applies the multiplicative hash function where d=31 to reduce x[i].hashCode() to a 31-bit value. Due to this the additions and multiplications that are done modulo p=2³²³¹+r/(2³²-5) as oppose to the r/(2³²-5) specified in theorem h-Codes.4.

Hash Codes - WilfullMurder/DataStructures-Java GitHub Wiki

Hash Codes

Introduction:

Hash Codes for Primitive Data Types

Hash Codes for Compound Objects

Theorem h-Codes.1

Proof:

Hash Codes for Arrays and Strings

Theorem h-Codes.2:

Theorem h-Codes.3:

Proof:

Theorem h-Codes.4:

Caveat:

⚠️ GitHub.com Fallback ⚠️

Hash Codes - WilfullMurder/DataStructures-Java GitHub Wiki

Hash Codes

Introduction:

Hash Codes for Primitive Data Types

Hash Codes for Compound Objects

Theorem h-Codes.1

Proof:

Hash Codes for Arrays and Strings

Theorem h-Codes.2:

Theorem h-Codes.3:

Proof:

Theorem h-Codes.4:

Caveat:

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️