Hash Codes - WilfullMurder/DataStructures-Java GitHub Wiki

Hash Codes

Introduction:

We studied hash tables that are used to associate data with integer keys comprised of w bits. Often, we have keys that are not integers. They could be strings, objects, arrays or other compound structures. In order to use hash tables for these types of data, we must map these data types to w-bit hash codes. Hash code mapping should have the following properties:

  1. If x and y are equal then x.hashCode() and y.hashCode() are equal.
  2. If x and y are not equal, then the probability that x.hasCode()=y.hashCode() should be small (close to 1/2w).
The first property ensures that if we store x in a hash table and later look up a value, y, equal to x, then we will find x. The second property is to minimise the loss from converting our objects to integers. It establishes that unequeal objects generally have different hash codes and are therefore likely to be stored at different locations in a hash table.

Hash Codes for Primitive Data Types

It is generally simple to find hash codes for the smaller primitives such as byte, char, float and int. These data types always have a binary representation and this often consists of w or fewer bits. In Java byte is an 8-bit type and float is a 32-bit type. In these cases, we can simply treat these bits as the representation of an integer in the range {0,...,2w-1}. If two values are different, they get different hash codes. If they are the same, they get the same hash code.

Several primitive data types are made up of more than w bits, usually cw bits for some constant integer c. In Java, both long and double types are examples of this with c=2. These data types can be treated as compound objects made of c parts.

Hash Codes for Compound Objects

For a compound object, we want to create a hash code by combining the individual hash codes of its constituent parts. This is not quite as simple as it sounds. Although there are many hacks for this (such as combining the hash codes via bitwise xor operations), many of them will have a large set of objects that have the same hash code. However, there are simple and robust methods available through arithmetic with 2w bits of precision. As an example, suppose we have an object made up of parts P0,...,Pr-1 whose hash codes are x0,...,xr-1. We can then choose mutually independent random w-bit integers z0,...,zr-1 and a random 2w-bit odd integer, z, and compute a hash code for the object with

hashCodes-HashingforCompoundObjects

It should be noted that this hash code has a final step of multiplying by z and dividing by 2w that uses the multiplicative hash function from the Chained Hash Tables section. It takes the 2w-bit intermediate result and reduces it to a w-bit final result. Here is a code example of this method applied to a simple compound object with three parts x0, x1, x2:

int hashCode(){ //random numbers from rand.org long[] z = {0x2058cc50L, 0xcb19137eL, 0x2cb6b6fdL}; long zz = 0xbea0107e5067d19dL; //convert (unsigned) hashcodes to long long h0 = x0.hashCode() && ((1L<<32)-1); long h1 = x1.hashCode() && ((1L<<32)-1); long h2 = x2.hashCode() && ((1L<<32)-1); return (int)(((z[0]*h0 + z[1]*h1 + z[2]*h2)*zz)>>>32); }

The following theorem shows that, in addition to being straightforward to implement, this method is provably robust:

Theorem h-Codes.1

Lt x0,...,xr-1 and y0,...,yr-1 each be sequences of w-bit integers in {0,...,2w-1} and assume xi≠yi for at least one index i∈{0,...,r-1}. Then Pr{h(x0,...,xr-1) = h(y0,...,yr-1)}≤3/2w.

Proof:

Firstly, define:

hashCodes-Theorem1-proof-1

Where

hashCodes-Theorem1-proof-2

Assuming, without loss of generality that xi > yi, then zi(xi-yi) = t, since each of zi and (xi-yi) is at most 2w-1, their product is at most 22w-2w-1+1 < 22w-1. By assumption, xi-yi≠0, so t has at most one solution in zi. So, since zi and t are independent (z0,...,zr-1 are mutually independent), then the probability of selectin zi so that h'(x0,...,xr-1)=h'(y0,...,yr-1) is at most 1/2w.

The last step of the hash function is to apply multiplicative hashing to reduce the 2w-bit intermediate result h'(x0,...,xr-1) to a w-bit final result h(x0,...,xr-1). According to Theorem h-Codes.1 if h'(x0,...,xr-1)≠h'(y0,...,yr-1), then Pr{h(x0,...,xr-1)=h(y0,...,yr-1)}≤2/2w.

In summation,

hashCodes-Theorem1-proof-summary

Hash Codes for Arrays and Strings

The previous method works well for objects with a fixed, constant number of components. Yet, since it requires a random w-bit integer zi for each component, it breaks down when we want to use it with objects of a variable number of components. It is possible to use a pseudorandom sequence to generate as many zi's as we need, but then the zi's aren't mutually independent and it becomes difficult to prove that the pseudorandom numbers don't interact badly with the hash function being used. Particularly, the values of t and zi in the proof of Theorem h-Codes.1 are no longer independent.

So, a more rigorous approach could be to base our hash codes on polynomials over prime fields; which are regular polynomials that are evaluated modulo some prime number, p. This method is based on the following theorem, stating that polynomials over prime fields behave similarly to usual polynomials:

Theorem h-Codes.2:

Let p be a prime number, and let f(z)=x0z0+x1z1+...+xr-1zr-1 be a non-trivial polynomial with coefficients xi∈{0,...,p-1}. Then the equation f(z)modp=0 has at most r-1 solutions for z∈{0,...,p-1}.

To use this theorem, we hash a sequence of integers x0,...,xr-1 with each xi∈{0,...,p-2} using a random integer z∈{0,...,p-1} via the formula

h(x0,...,xr-1)=(x0z0+...+xr-1zr-1 + (p-1)zr)mod p.

It should be noted that the extra (p-1)zr term at the end of the formula exists as the last element, xr, in the sequence x0,...,xr. The element differs from every other element in the sequence (each of which exist in the set {0,...,p-2}). We can think of p-1 as a marker for the end of the sequence.

The next theorem, which considers the case of two sequences of the same length, shows that this hash function has a good return for the minor amount of randomisation needed to choose z:

Theorem h-Codes.3:

Let 2w+1 be a prime, let x0,...,xr-1 and y0,...,yr-1 each be sequences of w-bit integers in {0,...,2w-1}, and assume xi≠yi for at least one index i∈{0,...,r-1}. Then

Pr{h(x0,...,xr-1)=h(y0,...,yr-1)}≤(r-1)/p.

Proof:

The equation h(x0,...,xr-1)=h(y0,...,yr-1) can be rewritten as ((x0-y0)z0+...+(xr-1-yr-1)zr-1)modp=0.

Since xi≠yi, this polynomial is non-trivial. Therefore, according to the theorem, it has at most r-1 solutions in z. The probability that we pick z to be one of these solutions is at most (r-1)/p.

It should be noted that this hash function also deals with the case of two sequences of different lengths, even when one of the sequences is a prefix of the other. This is because the function effectively hashes the infinite sequence x0,...,xr-1,p-1,0,0,... .

So, it guarantees that if we have two sequences of length r and r' with r>r', then these two sequences differ at index i=r. In this case the formula ((x0-y0)z0+...+(xr-1-yr-1)zr-1)modp=0 becomes

hashCodes-Theorem3-proof-1

which, according to theorem h-Codes.2, has at most r solutions in z, Combining this with theorem h-Codes.3 proves the following, more generalised theorem.

Theorem h-Codes.4:

Let p>2w+1 be a prime, let x0,...,xr-1 and y0,...,yr'-1 be disatinct sequences of w-bit integers in {0,...,2w-1}. Then Pr{h(x0,...,xr-1)=h(y0,...,yr-1)}≤max{(r,r')}/p.

Caveat:

It should be noted that the coded implementation does sacrifice some collision probability for the sake of convenience. Particularly, it applies the multiplicative hash function where d=31 to reduce x[i].hashCode() to a 31-bit value. Due to this the additions and multiplications that are done modulo p=23231+r/(232-5) as oppose to the r/(232-5) specified in theorem h-Codes.4.

⚠️ **GitHub.com Fallback** ⚠️