Hashtables

Terminology

Terms related to hashtables you should be familiar with:

The Basic Idea

The basic idea is simple. To store data under a given key (usually an integer or string),

hash is a function that takes any key and returns a valid index into A.

Any function will work, but to get the desired features of hashtables, hash must

If both of these properties are true, and the table is not very full (at least 20% empty) then

This obviously is better than the best binary search tree algorithms we had (AVL and red-black), which were O(log N) time operations.

There is no free lunch, of course. The price we pay is:

Hashtables aren't really constant time. As the textbook clearly points out, access time is a function of density, i.e., how full the table is. This means that we can get as close to a constant time retrieval as we want, if we're willing to make the table big enough.

Making It Work

There are two key problems to solve in making hashtables work well:

Minimizing Collisions

Getting cheap hashing functions is easy. Getting low-collision functions is trickier.

There are two basic approaches:

In the first case, we move characters in the key around and add parts together, not unlike shuffling a deck of cards to get rid of patterns caused by the previous round of play.

In the second case, we use the calculation formula that converts an internal seed into a random number in a pseudo-random number generator, using the key (converted to an integer in some simple way if need be) as the seed.

In either case, the resulting number has to be turned into a valid index with

value % size(A)

and it's best to have size(A) be a prime number.

Handling Collisions

It's impossible to avoid collisions unless we have a perfect hashing function, and we can only get one of those if we know exactly what keys we're going to get in advance and have a table big enough to hold them all.

Therefore, we need collision resolution strategies, which answer the question "when hash(k1) = hash(k2) where should the record with k2 go?" k1 and k2 are called synonyms.

Hash to Lists

One simple approach is to keep lists in A:

As long as there are few collisions, there will usually be only 1 or 2 items in any list, and so retrieval will still be effectively constant time.

This approach is simple and often used. It is wasteful of space, however, compared to the following approaches, and there is no bound on how big the hashtable may get.

Linear Probing

Linear probing is a straw man approach. You never want to use it, but it's a good starting point for understanding the other approaches.

The idea is simple:

The problem is that collisions pile up and form clusters or chains of adjacent synonyms. Furthermore, these clusters can merge with other clusters and make even larger clusters.

Quadratic Rehashing

We can avoid the merging of clusters if we "hopscotch" around more looking for places to put collisions.

Quadratic rehashing modifies linear probing as follows:

This will try i, i + 1, i + 4, i + 9, and so on.

Problems with this approach are:

Double Hashing

To stop synonyms from chasing each other, you can use a secondary hash function to calculate the skip amount. First you hash the key k as usual. If you get a collision, you modify linear probing thus:

Assuming you've picked an independent secondary hash function, the odds of two keys have the same values for both h and h' are small enough to make clustering a non-issue. If the table size is a prime, we also know that eventually we have to find an empty spot, if any exists.

Applications of Hashtables

Keyed Collections/Associative Tables

A number of programming languages, particularly scripting languages like Hypertalk, JavaScript, awk, and Perl, provide what are called associative tables or keyed collections.

This is a very simple and useful abstract data type. A keyed collection C lets us store and retrieve data using keys -- especially strings -- for indices, the way arrays let us use integers for indices:

For example, to count all the words in a file, assuming we have a function that reads the next word as a string, we'd write something like this:

while not eof(stream) do
   counts[readWord(stream)]++

Hard to beat that for simplicity.

A keyed collection would also be ideal for storing the course and section data for our course scheduler for easy retrieval by course name.

Compilers make heavy use of keyed collections to store data about variable and function names.

Rabin-Karp String Search

An interesting use of hashing that doesn't involve storing data is the Rabin-Karp probabilistic string search algorithm.

This is described in fair detail in Section 11.3, but the key idea is this. Normal string search in the worst case is O(N * M) where N is the length of the string and M is the length of the text being searched. This is because you have to compare up to M characters, then, if the comparison fails, you shift right one character in the text, and start again.

Rabin-Karp's algorithm is O(N + M) -- big difference. The algorithm is (roughly)

Let k = h(search-string)
Let m = length(search-string)
Let i = 0
while ( h(substring(text, i, m) != k ) i++

That is, we keep hashing M-long segments of the text to see if they are a synonym of the search string. Notice that we never store anything in a table, we're just using the hashing idea. If we find a synonym, we check to see if it really is a match.

For this to work, we need a hash function that is very cheap and has a low collision rate. The hashing function used in Rabin-Karp takes advantage of the fact that each substring is only one character different than the previous one.

Spell Checkers

Ever wonder how spell checkers can read thousands of words and quickly tell if they're in a dictionary with tens of thousands of words? There are lots of pieces to the trick but one element is the hashtable.

The idea, similar to the Rabin-Karp algorithm, is this:

Obviously there's more to this, including handing word morphology and efficiently compacting N hashtables into something much smaller. The clever compaction comes from noting that the table contains only bits and that we only really care about something that has all N bits true.

But wait! If these spell checkers don't have a dictionary of words available, how do they find words to suggest as corrections? Simple. They apply a set of transformation rules to the misspelled word (transpose every adjacent pair of letters, for every letter c from A - Z, insert c once in each position in the word, drop one letter, etc.). Then they return those transformations that hash successfully into the tables.