Linear Hash Tables - WilfullMurder/DataStructures-Java GitHub Wiki

Linear Hash Tables

The Chained Hash Table data structure uses an array of lists, where the ith list stores all elements x such that hash(x)=i. An alternative, called open addressing is to store the elements directly in an array, t, with each array location in t storing at most one value. This approach is taken by the Linear Hash Table described in this section. In some places this data structure is described as open addressing with linear probing.

The main idea behind a Linear Hash Table is that we would, ideally, like to store the element x with hash value i=hash(x) in the table location t[i]. If we cannot do this (because some element is already stored there) then we try to store it at location t[(i+1) mod t.length]; if that's not possible, then we try t[(i+2) mod t.length], and so on, until we find a place for x.

There are three types of entries stored in t:

  1. data values: actual values in the USet that we are representing.
  2. null values: at array locations where no data has ever been stored.
  3. del values: at array locations where data was once stored but has since been deleted.

In addition to the counter, n, that keeps track of the number of elements in the Linear Hash Table, a counter, q, keeps track of the number of elements of types 1 & 3. That is, q ≑ n + del. To make this work efficiently, we need t to be considerably larger than q, so that there are lots of null values in t. The operations on a Linear Hash Table therefore maintain the invariant length t.length β‰₯ 2q.

To summarize, a Linear Hash Table contains an array, t, that stores data elements, and integers n and q that keep track of the number of data elements and non-null values of t, respectively. Because many hash functions only work for table sizes that are a power of 2, we also keep an integer d and maintain the invariant that t.length = 2d

Analysis of Linear Probing

Notice that each operation, add(x), remove(x), or find(x), finishes as soon as (or before) it discovers the first null entry in t. The intuition behind the analysis of linear probing is that, since at least half of the elements in t are equal to null, an operation should not take long to complete because it will very quickly come across a null entry. We should not rely too heavily on this intuition, though, because it would lead us to (the incorrect) conclusion that the expected number of locations in t examined by an operation is at most 2.

To make it possible to analyse linear probing we can make the not unrealistic assumption that all hash values are independently and uniformly distributed in {0,...,t.length-1}. Later we will describe a method, called tabulation hashing, that produces a hash function that is "good enough" for linear probing. We will also assume that all indices into the positions of t are taken modulo t.length, so that t[i] is really shorthand for t[i mod t.length].

We say that run of length k that starts at i occurs when all the table entries t[i],t[i+1],...,t[i+k-1] are non-null and t[i-1]=t[i+k]=null. The number of non-null elements of t is exactly q and the add(x) method ensures that, at all times, q ≀ t.length/2. There are q elements x1,...,xq that have been inserted into t since the last resize() operation. By our assumption, each of these has a hash value, hash(x𝑗), that is uniform and independent of the rest. With this setup, we can prove the main lemma required to analyze linear probing.

Lemma LHT.1:

Fix a value i∈{0,...,t.length-1}. Then the probability that a run of length k starts at i is O(ck) for some constant 0 < c < 1.

Proof

If a run of length k starts at i, then there are exactly k elements in x𝑗 such that hash(x𝑗)∈{i,...,i+k-1}. The probability that this occurs is exactly

Probability of hash(x<sub>𝑗</sub>)∈{i,...,i+k-1}

since, for each choice of k elements, these k elements must hash to one of the k locations and the remaining q-k elements must hash to the other t.length-k table locations. Note that pk is greater than the probability that a run of length k starts at i, since the definition of pk does not include the requiremnent t[i-1] = t[i+k] = null.

In the following derivation we will cheat a little and replace r! with (r/e)r. Stirling's Approximation shows that this is only a factor of O(√r) from the truth. This is done to make the derivation simpler. The value of pk is maximised when t.length is minimum, and the data structure maintains the invariant that t.length≀2q, so

Stirling's approximation 1

Stirling's approximation 2

Stirling's approximation 3

Stirling's approximation 4

Stirling's approximation 5

Stirling's approximation 6

Stirling's approximation 7

Stirling's approximation 8

(In the last step, we use the inequality (1+1/x)x≀ℯ, which holds for all x>0.) Since √ < 0.824360636 < 1, this completes the proof.

Using Lemma LHT.1 to prove upper-bounds on the expected running time of find(x), add(x), and remove(x) is now fairly straightforward. Consider the simplest case, where we execute find(x) for some value x that has never been stored in the Linear Hash Table. In this case, i=hash(x) is a random value in {0,...,t.length-1} independent of the contents of t. If i is part of a run of length k, then the time it takes to execute the find(x) operation is at most O(1+k). Thus, the expected running time can be upper-bounded by

Upper-bounding formula

Note that each run of length k contributes to the inner sum k times for the total contribution k2, so the above sum can be rewritten as

Upper-bounding formula

Upper-bounding simplification 1

Upper-bounding simplification 2

Upper-bounding simplification 3

Upper-bounding simplified

The last step in this derivation comes from the fact that is an exponentially decreasing series. Therefore, we conclude that the expected running time of the find(x) operation for a value x that is not contained in the Linear Hash Table is O(1).

If we ignore the cost of the resize() operation, then the above analysis gives us all we need to analyze the cost of operations on a Linear Hash Table.

First of all, the analysis of find(x) given above applies to the add(x) operation when x is not contained in the table. To analyze the find(x) operation when x is contained in the table, we need only note that this is the same cost of the add(x) operation that previously added x to the table. Finally, the cost of a remove(x) operation is the same as the cost of a find(x) operation.

In summary, if we ignore the cost of calls to resize(), all operations on a Linear Hash Table run in O(1) expected time. Accounting for the cost of resize can be done using the same type of amortized analysis performed for the ArrayStack data structure.

Summary

The following theorem summarizes the performance of the Linear Hash Table data structure:

Theorem LHT 1:

A Linear Hash Table implements the USet interface. Ignoring the cost of calls to resize(), a Linear Hash Table supports the operations add(x), remove(x), and find(x) in O(1) expected time per operation.

Furthermore, beginning with an empty Linear Hash Table, any sequence of m add(x) and remove(x) operations results in a total of O(m) time spent during all calls to resize().

Tabulation Hashing

While analyzing the Linear Hash Table structure, we made a very strong assumption: That for any set of elements, {x1,...,xn}, the hash values hash(x1),...,hash(xn) are independantly and uniformily distributed over the set {0,...,t.length-1}. One way to achieve this is to store a giant array, tab, of length 2w, where each entry is a random w-bit integer, independent of all the other entries. In this way, we could implement hash(x) by extracting a d-bit integer from tab[x.hashCode()]:

int idealHash(T x) { return tab[x.hashCode() >>> w-d];}

Unfortunately, storing an array of size 2w is prohibitive in terms of memory usage. The approach used by tabulation hashing is to, instead, treat w-bit integers as being comprised of w/r integers, each having only r bits. In this way, tabulation hashing only needs w/r arrays each of length 2r. All the entries in these arrays are independent random w-bit integers. To obtain the value hash(x) we split x.hashCode() up into w/r r-bit integers and use these as indices into these arrays. We then combine all these values with the bitwise exclusive-or operator to obtain hash(x). e.g. w= 32, r = 4 then tab is a 2-d array with 4 columns and 232/4 rows.

We can easily verify that, for any x, hash(x) is uniformly distributed over {0,...,2d-1}. With a little work, we can even verify that any pair of values have independent hash values. This implies tabulation hashing could be used in place of multiplicative hashing for the Chained Hash Table implementation.

However, it is not true that any set of n distinct values gives a set of n independent hash values. Nevertheless, when tabulation hashing is used, the bound of Theorem LHT.1 still holds.

⚠️ **GitHub.com Fallback** ⚠️