Chained Hash Table - WilfullMurder/DataStructures-Java GitHub Wiki
A chained hash table data structure uses hashing with chaining to store data as an array, t, of lists.
An integer, n, keeps track of the total number of items in all lists (see fig.1)
The hash value of a data item x, denoted as hash(x) is a value in the range {0,...,t.length-1}.
All items with hash value i are stored in the list at t[i].
To ensure that lists don't get too long, we maintain the invariant nβ€t.length so that the average number of elements stored in one of these lists is n/t.lengthβ€1.
Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic and integer division. It uses the div operator, which calculates the integral part of the quotient, while discarding the remainder.
Formally for any integers aβ₯0 and bβ₯1, adivb = βa/bβ
In multiplicative hashing, we use a hash table of size 2d for some integer d (called the dimension). The formula for hashing an integer xβ{0,...,2w-1} is:
hash(x) - ((z*x)mod2w)div2w-d
Here, z is a randomly chosen odd integer in {1,...,2w-1}. This hash function can be realized very efficiently by observing that, by default, operations on integers are already done modulo 2w where w is the number of bits in an integer (see fig. 2). Furthermore, integer division by 2w-d is equivalent to dropping the rightmost w-d bits in a binary representation.
The following lemma (CHT.1) shows that multiplicative hashing does a good job of avoiding collisions:
Let x,y be any two values in {0,...,2w-1} with xβ y. Then Pr{hash(x)=hash(y)}β€2/2d
With lemma CHT.1, the performance of remove(x), and find(x) are easy to analyse:
For any data value x, the expected length of the list t[hash(x)] is at most nx+2, where nx is the number of occurences of x in the hash table.
Let π be the (multi-)set of elements stored in the hash table that are not equal to x. For an element yβπ, define the indicator variable
and notice that, by Lemma CHT.1 E[Iy]β€2/2d = 2/t.length. The expected length of the list t[hash(x)] is given by
as required.
In order to prove Lemma CHT.1 we need a result from number theory. In the following proof we use the notation (br,...,b0)2 to denote , where each bi is a bit, either 0 or 1. In other words, (br,...,b0)2 is the integer whose binary representation is given by br,...,b0. We use β to denote a bit of unknown value.
Let π be the set of odd integers in {1,...,2w-1}; let q and i be any two elements in π. Then there is exactly on value zβπ such that zq mod 2w = i.
Suppose, for the sake of contradiction, that there are two such values z and z', with z > z'. Then
zq mod 2w = z'q mod 2w = i
So
(z-z')q mod 2w = 0
But this means that
(z-z')q = k2w (CH.1.1)
for some integer k. Thinking in terms of binary numbers, we have
so that the w trailing bits in the binary representation of (z-z')q are all 0's. Futhermore, kβ 0, since qβ 0 and z-z'β 0. Since q is odd, it has no trailing 0's in its binary representation:
q={β,...,β,1}2.
Since |z-z'| < 2w, z-z' has fewer than w trailing 0's inits binary representation:
Therefore, the product (z-z')q has fewer than w trailing 0's in its binary representation:
Therefore (z-z')q cannot satisfy (z-z')q = k2w, yielding a contradiction and completing the proof.
The utility of Lemma CHT.3 comes from the following observation: If z is chosen uniformly at random from π, then zt is uniformly distributed over π. In the following proof, it helps to think of the binary representation of z, which consists of w-1 random bits followed by a 1.
First we note that the condition hash(x)=hash(y) is equivalent to the statement:
"the highest-order d bits of zx mod 2w and the highest-order bits of zy mod 2w are the same."
A necessary condition of that statement is that the highest-order d bits in the binary representation of z(x-y) mod 2w are either all 0's or all 1's. That is,
(CHT.1.2)
when zx mod 2w > zy mod 2w or
(CHT.1.3)
when zx mod 2w < zy mod 2w. Therefore, we only have to bound the probability that z(x-y) mod 2w looks like CHT.1.1 or CHT.1.2.
Let q be the unique odd integer such that (x-y) mod 2w = zq2r mod 2w has w-r-1 random bits, followed by a 1, followed by r 0's:
We can now finish the proof: If r>w-d, then the d higher-order bits of z(x-y) mod 2w contain both 0's and 1's, so the probability that z(x-y) mod 2w looks like CHT.1.2 or CHT.1.3 is 0. If r = w-d, then the probability of looking like CHT.1.2 is 0, but the probability of looking like CHT.1.3 is 1/2d-1 = 2/2d (since we must have b1,...,bd-1 = 1,...,1). If r < w-d, then we must have bw-r-1,...,bw-r-d = 0,...,0 or bw-r-1,...,bw-r-d = 1,...,1. The probability of each of these cases is 1/2d and they are mutually exclusive, so the probability of either cases is 2/2d. Completing the proof
The following theorem summarizes the performance of a ChainedHashTable data structure:
A ChainedHashTable implements the USet interface. Ignoring the cost of calls to grow(), a ChainedHashTable supports the operations add(x), remove(x), and find(x) in O(1) expected time per operation.
Furthermore, beginning with an empty ChainedHashTable, any sequence of m add(x) and remove(x) operations results in a total of O(m) time spent during all calls to grow().