Chained Hash Table - WilfullMurder/DataStructures-Java GitHub Wiki

Chained Hash Table

A chained hash table data structure uses hashing with chaining to store data as an array, t, of lists.

An integer, n, keeps track of the total number of items in all lists (see fig.1)

An example of a chained hash table with n = 14 and t.length = 16

The hash value of a data item x, denoted as hash(x) is a value in the range {0,...,t.length-1}.

All items with hash value i are stored in the list at t[i].

To ensure that lists don't get too long, we maintain the invariant n≀t.length so that the average number of elements stored in one of these lists is n/t.length≀1.

Multiplicative Hashing

Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic and integer division. It uses the div operator, which calculates the integral part of the quotient, while discarding the remainder.

Formally for any integers aβ‰₯0 and bβ‰₯1, adivb = ⌊a/bβŒ‹

In multiplicative hashing, we use a hash table of size 2d for some integer d (called the dimension). The formula for hashing an integer x∈{0,...,2w-1} is:

hash(x) - ((z*x)mod2w)div2w-d

Here, z is a randomly chosen odd integer in {1,...,2w-1}. This hash function can be realized very efficiently by observing that, by default, operations on integers are already done modulo 2w where w is the number of bits in an integer (see fig. 2). Furthermore, integer division by 2w-d is equivalent to dropping the rightmost w-d bits in a binary representation.

The operation of the multiplicative hash function with w = 32 and d = 8

The following lemma (CHT.1) shows that multiplicative hashing does a good job of avoiding collisions:

Lemma CHT.1:

Let x,y be any two values in {0,...,2w-1} with xβ‰ y. Then Pr{hash(x)=hash(y)}≀2/2d

With lemma CHT.1, the performance of remove(x), and find(x) are easy to analyse:

Lemma CHT.2:

For any data value x, the expected length of the list t[hash(x)] is at most nx+2, where nx is the number of occurences of x in the hash table.

Proofs

Lemma CHT.2:

Let 𝑆 be the (multi-)set of elements stored in the hash table that are not equal to x. For an element yβˆˆπ‘†, define the indicator variable

indicator variable

and notice that, by Lemma CHT.1 E[Iy]≀2/2d = 2/t.length. The expected length of the list t[hash(x)] is given by

proof formula

as required.

In order to prove Lemma CHT.1 we need a result from number theory. In the following proof we use the notation (br,...,b0)2 to denote Sum of bsubi formula, where each bi is a bit, either 0 or 1. In other words, (br,...,b0)2 is the integer whose binary representation is given by br,...,b0. We use ⋆ to denote a bit of unknown value.

Lemma CHT.3:

Let 𝑆 be the set of odd integers in {1,...,2w-1}; let q and i be any two elements in 𝑆. Then there is exactly on value zβˆˆπ‘† such that zq mod 2w = i.

Suppose, for the sake of contradiction, that there are two such values z and z', with z > z'. Then

zq mod 2w = z'q mod 2w = i

So

(z-z')q mod 2w = 0

But this means that

(z-z')q = k2w (CH.1.1)

for some integer k. Thinking in terms of binary numbers, we have

binary representation

so that the w trailing bits in the binary representation of (z-z')q are all 0's. Futhermore, k≠0, since q≠0 and z-z'≠0. Since q is odd, it has no trailing 0's in its binary representation:

q={⋆,...,⋆,1}2.

Since |z-z'| < 2w, z-z' has fewer than w trailing 0's inits binary representation:

trailing 0's

Therefore, the product (z-z')q has fewer than w trailing 0's in its binary representation:

trailing 0's

Therefore (z-z')q cannot satisfy (z-z')q = k2w, yielding a contradiction and completing the proof.

The utility of Lemma CHT.3 comes from the following observation: If z is chosen uniformly at random from 𝑆, then zt is uniformly distributed over 𝑆. In the following proof, it helps to think of the binary representation of z, which consists of w-1 random bits followed by a 1.

Lemma CHT.1:

First we note that the condition hash(x)=hash(y) is equivalent to the statement:

"the highest-order d bits of zx mod 2w and the highest-order bits of zy mod 2w are the same."

A necessary condition of that statement is that the highest-order d bits in the binary representation of z(x-y) mod 2w are either all 0's or all 1's. That is,

(CHT.1.2)

when zx mod 2w > zy mod 2w or

(CHT.1.3)

when zx mod 2w < zy mod 2w. Therefore, we only have to bound the probability that z(x-y) mod 2w looks like CHT.1.1 or CHT.1.2.

Let q be the unique odd integer such that (x-y) mod 2w = zq2r mod 2w has w-r-1 random bits, followed by a 1, followed by r 0's:

We can now finish the proof: If r>w-d, then the d higher-order bits of z(x-y) mod 2w contain both 0's and 1's, so the probability that z(x-y) mod 2w looks like CHT.1.2 or CHT.1.3 is 0. If r = w-d, then the probability of looking like CHT.1.2 is 0, but the probability of looking like CHT.1.3 is 1/2d-1 = 2/2d (since we must have b1,...,bd-1 = 1,...,1). If r < w-d, then we must have bw-r-1,...,bw-r-d = 0,...,0 or bw-r-1,...,bw-r-d = 1,...,1. The probability of each of these cases is 1/2d and they are mutually exclusive, so the probability of either cases is 2/2d. Completing the proof

Summary

The following theorem summarizes the performance of a ChainedHashTable data structure:

Theorem CHTT.1:

A ChainedHashTable implements the USet interface. Ignoring the cost of calls to grow(), a ChainedHashTable supports the operations add(x), remove(x), and find(x) in O(1) expected time per operation.

Furthermore, beginning with an empty ChainedHashTable, any sequence of m add(x) and remove(x) operations results in a total of O(m) time spent during all calls to grow().

⚠️ **GitHub.com Fallback** ⚠️