Hashing - adamtew/CS2420-Exam2 GitHub Wiki

Instructor Notes

Hashing	Concepts
Hash Function	Hash Function
Collisions	Linear Probing
	Clustering
	Quadratic Probing

Hash Function

This function is called a hash function because it “makes hash” of its inputs

A hash function is a function that:

When applied to an Object, returns a number
When applied to equal Objects, returns the same number for each
When applied to unequal Objects, is very unlikely to return the same number for each
This function has no other purpose

Hash function with sliding bits and exclusive or

unsigned int  hash(string key)
{unsigned int hashVal=0;
   for (i=0; i < key.length(); i++)
      hashVal =  (hashVal << 5) ^ key[i]  ^ hashVal;
   return hashVal %TABLESIZE;
}

Properties of a good hash function:

Minimize collisions
Fast to compute
Scatter data evenly through hash table. Data may have patterns to them.
Uses all bits of the key – generally helps in scattering non-random data
If mod is used, base should be prime

Collisions

When two values hash to the same array location, this is called a collision

Solution #1: Search from desired spot for an empty location (with wrap)
Solution #2: Try again with a second hash function ...and a third, and a fourth, and a fifth, ... to find a better location.
Solution #3: Create a linked list of values that hash to this location

Linear Probing

Use a vacant spot in table following HASH(key) – “open” addressing means that we find an unused address and use it. Evaluation

Longer search times
Clustering gets worse and worse (snowballs).
Primary clustering is when two keys that hash onto different values compete for same locations in successive hashes.
Better if successive probe locations are picked in “scrambled” order to lessen clustering.
One problem with the “linear probing” is the tendency to form “clusters”

Clustering

A cluster is a group of items not containing any open slots

The bigger a cluster gets, the more likely it is that new values will hash into the cluster, and make it ever bigger
Clusters cause efficiency to degrade
Primary Clustering

is the tendency for certain open-addressing hash tables collision resolution schemes to create long sequences of filled slots.

Secondary Clustering

when different keys that hash to same locations compete for successive hash locations.

Quadratic Probing

Quadratic probing eliminates primary clustering, but not secondary clustering.

Note: For both linear and quadratic probing, the sequences checked are key independent. Two different keys which hash to same location, keep competing.

Double Hashing

Have two functions, Step gives the increment for Hash.

Define: Step(key) = 1 + key % (TABLESIZE - 2) – gives personalized increment.
Notice, the location of Step is never used directly as the hash value. It is only an auxiliary function.
If TABLESIZE and TABLESIZE–2 are primes, it works better.
Hash(key) = key % TABLESIZE
If Hash(key) is full, successively add increment as specified by Step(key).
example: for key = 38 and TABLESIZE = 13
- Each of the probe sequences visits all the table locations if the size of the table and the size of the increment are relatively prime with respect to each other.

Bucket

Another way to handle collisions is to have each entry in the hash table hold more than one (B) item.

Each entry is called a bucket.
This doesn’t really solve the problem. As soon as the bucket fills up, you still have to deal with collisions.
I think it would be better to just have the original table B times as large.

Separate Chaining

Each location in the hash table is a pointer to a linked list of things that hashed to that location.

Advantages:

Saves space if records are long
Collisions are no problem
Overflow is solved as space is dynamic
Deletion is easy
Disadvantages:
Links require space
Following linked list is more time consuming

Works well, but we have the overhead of pointers.

Rehashing

Rehash – create a larger (or smaller) array. Rehash all old elements into new array.

Efficiency

Hash tables are actually surprisingly efficient

Until the table is about 70% full, the number of probes (places looked at in the table) is typically only 2 or 3
Sophisticated mathematical analysis is required to prove that the expected cost of inserting into a hash table, or looking something up in the hash table, is O(1)
Even if the table is nearly full (leading to occasional long searches), efficiency is usually still quite high
load factor = number-of-elements / TABLESIZE
Should not exceed 2/3.
With chaining, the concern is the average size of a chain.

Deletions

solution: Use lazy deletions

Cuckoo Hashing

Cuckoo hashing worst-case constant lookup time.

The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

There are two hash functions, one for each table.
If a key can’t go in either location, it tries to move the existing key to its alternative spot in the other table.
When a new key is inserted, the new key is inserted in one of its two possible locations, "kicking out", that is, displacing, any key that might already reside in this location.
This displaced key is then inserted in its alternative location, again kicking out any key that might reside there, until a vacant position is found, or the procedure enters an infinite loop. In the latter case, the hash table is rebuilt in-place using new hash functions: