When you use an Array, you can only look up data by index 0, 1, 2.
But what if you want to look up data by string, like this?
You might think that it's very simple to do this, and that you just have to store an array of strings, or something like "Alice=11"
, "Bob:22"
, "Charlie:33"
. However, if you do that, finding the value of "Bob" or "Charlie" potentially requires looking at every string in the array, which is very slow. We want lookups to be instant.
To get around that, you might think to store a mapping like this:
But where do you store the mapping "Alice"→0, "Bob"→1, "Charlie"→2? You can't store the mapping in an array, or you get the same problem!
It's possible to jump to the right element in just one step, and the mapping idea is correct - you just can't store the mapping in an array, because that always requires a search which is slow.
The trick to doing this efficiently is to use a function that creates the mapping, called a "hash function".
To find the entry where "Charlie" should be stored/retrieved from, you pass "Charlie" through the hash function h
to get a huge number. You then mod (%) that number by the size of the array. This gives you a valid index in the array for the word "Charlie".
You use the same hash function h
to get the index for every entry.
One hash function h
might give you these indexes:
And a different hash function might give you these indexes:
Once you pick a hash function you don't control the indices, but that's what makes it fast to lookup. All indices are "precomputed" in this hash function so you don't need to search for them.
Sometimes indexes "collide" and point to the same location. Collisions are impossible to avoid (if you try, you end up doing the same number of operations), so you need to allow both items to be stored in the same spot. (internally, this is implemented by storing all elements in a LinkedList node - we cover LinkedLists next, but you get the idea, store the items in the same slot):
Collisions are bad because they make things slower. To get Charlie's phone number you first need to walk through Bob's phone number. But collisions are pretty rare. The computer makes sure they're rare by always keeping half the slots in the Array empty. If you add too many items, it just doubles the size of the Array. Even with this extra cost of doubling the array, the average number of operations is O(1), or "just one step".
Summary: a "HashMap" is an Array, but where you look data up by strings and numbers that you choose, instead of just indexes 0,1,2,3. It's named after the "hash" function. Lookups are O(1) on average because collisions are rare and array doubling is rare.
Behind the scenes, HashMaps figure out which slot each key goes in, using this formula: key→ h(key) % len(array)