CISC 3130

Unit 6 Reading Guide

It's probably useful to start understanding the chapter by checking out what "hash" means. For example, on Wikipedia, a "hash function" is described as any function "that can be used to map data of arbitrary size to data of fixed size." And indeed, at the beginning of 5.1, the book says, "The hash value of a data item x, denoted hash(x) is a value in the range 0...t.length-1." That is, no matter what kind of thing x is, hash() will always produce a value between 0 and t.length-1. Data of arbitrary size (x) is being mapped to data of fixed size (the index of the array).

The first approach to hash tables the book discusses is "hashing with chaining." First question: what's the "chaining"? Hint: what is the type of the array elements?

In the initial discussion of hashing with chaining, the book talks about the function hash() but doesn't actually define the function. What makes for a "good" hash function versus a "bad" hash function? What if hash() is defined as the simple function hash(x) = 1? Is that good or bad?

Note that the code in 5.1.1 is just a bit different from the formula in the text—the text talks about z · x mod 2w, but the code talks about hashCode(x) instead of just x. Don't worry about that for now--imagine the code just says x: section 5.3 will clear this up.

You should understand what Lemmas 5.1 and 5.2 say, as well as Theorem 5.1, but don't worry about the proofs. What do Lemmas 5.1 and 5.2 say? Restate them in your own words, maybe with a little less mathematics. Going back to Chapter 1, what data structure does a hash table implement?

Moving on to 5.2. What is the essential difference in approach between hashing with chaining and linear probing? (What does "probing" mean in this context?)

This is tricky but important: why does the book distinguish between null values, where no value has ever been stored, and del values, where values used to be stored? Why can't we just do something that might be a little more "natural," like just letting null mean "there's no value stored here right now"? (Hint: where is del used in the code?)

The book says, "we need t to be considerably larger than q, so that there are lots of null values in t." OK. Note that this section doesn't have any diagrams. Use the code and the text to diagram what t might look like if, say, q is 16. You might consider what an arbitrary sequence of add() and remove() operations would do to the hash table.

The efficiency analysis (again, don't worry about the proofs) about hash table operations in 5.2.1 and 5.2.2 yields some pretty amazing results. What is the "price" we pay for these good results?

Don't worry much about tabulation hashing for now.

Moving on to 5.3. In the previous two sections, what types of values were we putting into hash tables? in this section, we look how to put any kind of value into hash tables. Think carefully about the two properties of "hash code mappings." Why is each of them important?

As the notation x.hashCode() = y.hashCode() suggests, it's natural to use object-oriented thinking to associate hash codes with arbitrary objects. Indeed, every Java object has a hashCode() method. According to the documentation at this link, does the Java hashCode() method follow the two properties the book requires?

Sections 5.3.2 and 5.3.3 use a fair amount of mathematics. Don't worry about the math. But why do we need to that much fiddly stuff? What's the problem we're trying to solve? Why wouldn't a simpler solution be just as good?

Finally, let's talk about how hash tables are used in the libraries of programming languages. In Java, one very commonly used class is HashMap; the C++ STL offers a similar template class, unordered_map. In both cases, the idea is to make an association between a key value and a mapped value (sometimes just called key and value). The mapped values are stored in a hash table, at a position determined by the hash of the key value. You can think of this "map" idea as being similar to a dictionary: the key values are the words in a dictionary, and the mapped values are the definitions. If I know a word, I can very quickly find its definition. (Some words have multiple definitions—how does that relate to the hashmap?) Of course, Reading the documentation for the C++ and Java implementations, can you tell whether these hash tables are using chaining or linear probing?

Take a look at these exercises: 5.1   5.5   5.6   5.7   5.8   5.9 (this requires a little bit of careful thought).