Data Structures

Making the HashMap

Lesson

When you use an Array, you can only look up data by index 0, 1, 2.

But what if you want to look up data by string? For example, what if you're storing your friends' phone numbers, and you want to quickly lookup a person's phone number given their name? You really want a data structure like this:

You might think that it's very simple to store this information, and that you can just do it by storing a bunch of entries like "Alice:11", "Bob:22", "Charlie:33" in an Array. However, if you did that, finding the phone number of say "Charlie" would actually worst-case require looping though every single entry to check if it's the entry for Charlie, which is a slow brute force search!

We want lookups to be instant.

Imagine you're in the shoes of the first computer scientists, who only knew how to implement Arrays. The question is, how do you use Arrays to build something where you can lookup data by string, not simply index?

The main idea is to convert the strings like "Charlie" into an index, so that you can jump right to an index for all the names, which would mean lookups are fast. It's not so obvious how to do the conversion, though - if you store something like "Alice:0", "Bob:1", "Charlie:2" in an Array to do the conversion, then you get the same problem as before, where you need to brute force through all the entries to find the right one!

The trick to doing this efficiently is to use a function to do the conversion, called a "hash function". That way, you avoid the brute force problem. To find the entry where "Charlie" should be stored/retrieved from, you pass "Charlie" through the hash function h to get a huge number. You then mod (%) that number by the size of the array. This gives you a valid index in the array for the word "Charlie".

You use the same hash function h to get the index for every entry.

One hash function h might give you these indexes:

And a different hash function might give you these indexes:

Once you pick a hash function you don't control the indices, but that's what makes it fast to lookup - it uses randomness to come up with an array index, not storage (which we saw always requires a brute force search). Another way of looking at this is that all the indices are "precomputed" in the hash function so you don't need to search for them.

Collisions

Sometimes indexes "collide" and point to the same location. Collisions are impossible to avoid (if you try, you end up doing the same number of operations), so you need to allow both items to be stored in the same spot. (internally, this is implemented by storing all elements in a LinkedList node - we cover LinkedLists next, but you get the idea, store the items in the same slot):

Collisions are bad because they make things slower. To get Charlie's phone number you first need to walk through Bob's phone number. But collisions are pretty rare. The computer makes sure they're rare by always keeping half the slots in the Array empty. If you add too many items, it just doubles the size of the Array. Even with this extra cost of doubling the array, the average number of operations is $O(1)$ , or "just one step".

A Complete Recap

TLDR: A "HashMap" is an Array, but where you can look up/add/remove data by strings * , instead of just by indexes 0, 1, 2, 3. It's named after the "hash" function. Lookups are $O(1)$ on average because collisions are rare thanks to array size doubling, and array size doubling is rare.

Behind the scenes, HashMaps figure out which slot each key goes in, using this formula: $\texttt{key} \rightarrow \texttt{ h(key) } \texttt{\%} \texttt{ len(array)}$

Mark as Completed:

Submits:

test

Test your code to get an output here!