Hash Codes - WilfullMurder/DataStructures-Java GitHub Wiki
We studied hash tables that are used to associate data with integer keys comprised of w bits. Often, we have keys that are not integers. They could be strings, objects, arrays or other compound structures. In order to use hash tables for these types of data, we must map these data types to w-bit hash codes. Hash code mapping should have the following properties:
- If x and y are equal then x.hashCode() and y.hashCode() are equal.
- If x and y are not equal, then the probability that x.hasCode()=y.hashCode() should be small (close to 1/2w).
It is generally simple to find hash codes for the smaller primitives such as byte, char, float and int. These data types always have a binary representation and this often consists of w or fewer bits. In Java byte is an 8-bit type and float is a 32-bit type. In these cases, we can simply treat these bits as the representation of an integer in the range {0,...,2w-1}. If two values are different, they get different hash codes. If they are the same, they get the same hash code.
Several primitive data types are made up of more than w bits, usually cw bits for some constant integer c. In Java, both long and double types are examples of this with c=2. These data types can be treated as compound objects made of c parts.
For a compound object, we want to create a hash code by combining the individual hash codes of its constituent parts. This is not quite as simple as it sounds. Although there are many hacks for this (such as combining the hash codes via bitwise xor operations), many of them will have a large set of objects that have the same hash code. However, there are simple and robust methods available through arithmetic with 2w bits of precision. As an example, suppose we have an object made up of parts P0,...,Pr-1 whose hash codes are x0,...,xr-1. We can then choose mutually independent random w-bit integers z0,...,zr-1 and a random 2w-bit odd integer, z, and compute a hash code for the object with
int hashCode(){
//random numbers from rand.org
long[] z = {0x2058cc50L, 0xcb19137eL, 0x2cb6b6fdL};
long zz = 0xbea0107e5067d19dL;
//convert (unsigned) hashcodes to long
long h0 = x0.hashCode() && ((1L<<32)-1);
long h1 = x1.hashCode() && ((1L<<32)-1);
long h2 = x2.hashCode() && ((1L<<32)-1);
return (int)(((z[0]*h0 + z[1]*h1 + z[2]*h2)*zz)>>>32);
}
Lt x0,...,xr-1 and y0,...,yr-1 each be sequences of w-bit integers in {0,...,2w-1} and assume xi≠yi for at least one index i∈{0,...,r-1}. Then Pr{h(x0,...,xr-1) = h(y0,...,yr-1)}≤3/2w.
Firstly, define:
The last step of the hash function is to apply multiplicative hashing to reduce the 2w-bit intermediate result h'(x0,...,xr-1) to a w-bit final result h(x0,...,xr-1). According to Theorem h-Codes.1 if h'(x0,...,xr-1)≠h'(y0,...,yr-1), then Pr{h(x0,...,xr-1)=h(y0,...,yr-1)}≤2/2w.
In summation,
The previous method works well for objects with a fixed, constant number of components. Yet, since it requires a random w-bit integer zi for each component, it breaks down when we want to use it with objects of a variable number of components. It is possible to use a pseudorandom sequence to generate as many zi's as we need, but then the zi's aren't mutually independent and it becomes difficult to prove that the pseudorandom numbers don't interact badly with the hash function being used. Particularly, the values of t and zi in the proof of Theorem h-Codes.1 are no longer independent.
So, a more rigorous approach could be to base our hash codes on polynomials over prime fields; which are regular polynomials that are evaluated modulo some prime number, p. This method is based on the following theorem, stating that polynomials over prime fields behave similarly to usual polynomials:
Let p be a prime number, and let f(z)=x0z0+x1z1+...+xr-1zr-1 be a non-trivial polynomial with coefficients xi∈{0,...,p-1}. Then the equation f(z)modp=0 has at most r-1 solutions for z∈{0,...,p-1}.
To use this theorem, we hash a sequence of integers x0,...,xr-1 with each xi∈{0,...,p-2} using a random integer z∈{0,...,p-1} via the formula
h(x0,...,xr-1)=(x0z0+...+xr-1zr-1 + (p-1)zr)mod p.
It should be noted that the extra (p-1)zr term at the end of the formula exists as the last element, xr, in the sequence x0,...,xr. The element differs from every other element in the sequence (each of which exist in the set {0,...,p-2}). We can think of p-1 as a marker for the end of the sequence.The next theorem, which considers the case of two sequences of the same length, shows that this hash function has a good return for the minor amount of randomisation needed to choose z:
Let 2w+1 be a prime, let x0,...,xr-1 and y0,...,yr-1 each be sequences of w-bit integers in {0,...,2w-1}, and assume xi≠yi for at least one index i∈{0,...,r-1}. Then
Pr{h(x0,...,xr-1)=h(y0,...,yr-1)}≤(r-1)/p.
The equation h(x0,...,xr-1)=h(y0,...,yr-1) can be rewritten as ((x0-y0)z0+...+(xr-1-yr-1)zr-1)modp=0.
Since xi≠yi, this polynomial is non-trivial. Therefore, according to the theorem, it has at most r-1 solutions in z. The probability that we pick z to be one of these solutions is at most (r-1)/p.
It should be noted that this hash function also deals with the case of two sequences of different lengths, even when one of the sequences is a prefix of the other. This is because the function effectively hashes the infinite sequence x0,...,xr-1,p-1,0,0,... .
So, it guarantees that if we have two sequences of length r and r' with r>r', then these two sequences differ at index i=r. In this case the formula ((x0-y0)z0+...+(xr-1-yr-1)zr-1)modp=0 becomes
Let p>2w+1 be a prime, let x0,...,xr-1 and y0,...,yr'-1 be disatinct sequences of w-bit integers in {0,...,2w-1}. Then Pr{h(x0,...,xr-1)=h(y0,...,yr-1)}≤max{(r,r')}/p.
It should be noted that the coded implementation does sacrifice some collision probability for the sake of convenience. Particularly, it applies the multiplicative hash function where d=31 to reduce x[i].hashCode() to a 31-bit value. Due to this the additions and multiplications that are done modulo p=23231+r/(232-5) as oppose to the r/(232-5) specified in theorem h-Codes.4.