Sunday, 29 September 2013

Java Hash Maps demystified

Often new programmers use Hash Maps in Java without really knowing how they work. Knowing how Java Hash Maps work would help them to use them judiciously.

How an entry is added into a HashMap :

Lets begin with what happens when we add an entry to a hash map. For that, we use put() method of HashMap class. This method takes a key and a value. When we pass Key and Value object to put() method of Java HashMap, HashMap implementation calls hashCode method on Key object and applies returned hashCode into its hashing function. The hashing function returns a bucket location for this entry. I say entry because it comprises of both Key and Value. An important point to remember is that HashMap in Java stores both Key and Value object. Why it does so, would be clear shortly. Once it gets the bucket location, it adds the entry object to the bucket. A bucket location uses LinkedList to store various entry objects. When we add an entry object to a bucket, it is added at the head of the LinkedList to avoid tail traversing.

Collision resolution :

Java HashMap stores both Key and Value object to resolve collisions. If two Key objects have the same hashCode, they would have the same bucket location because given a hashCode, the hashing function would return the same bucket location every time. At a given bucket location Java HashMap uses LinkedList to store Entry objects (Entry object comprises of Key and Value objects). An important point to note is that if two objects have the same hashCode, they would go into the same bucket location. Since we use LinkedList to store values at a bucket location, having a long LinkedList can defeat the purpose of HashMap. So, we should judiciously implement hashCode() method of the Key object.

How a value is retrieved from a HashMap :

Now let's see how Java HashMap retrieves a Value object. The method provided for this purpose is get() and it takes a Key object as an argument. When we call get() method, the HashMap calculates hashCode of the Key object by calling hashCode method of the Key object. Once it gets the hashCode of the Key object, it calculates the bucket location of the Entry object by using hashing function. Since HashMap stores a LinkedList of Entries at a given bucket location, it uses key.equals() method to identify the correct node in the LinkedList. This explains the reason why we store both Key and Value object.

Rehashing :

What happens when the size of the HashMap exceeds a given threshold ? If it crosses a given threshold known as load-factor (which is 0.75), it resizes itself. Similar to other collection classes like ArrayList, Java HashMap resizes it self by creating a new bucket of size twice the previous size and then starts putting every old element into that new bucket array. This process is called rehashing since it applies hash function to find new bucket location. But wait !!! If you think of rehashing for a while, you would see big problem. There is a potential race condition. What if two threads find that HashMap needs resizing at the same time. In the process of resizing the HashMap, the element in bucket which is stored in LinkedList gets reversed in order because HashMap appends the new element at the head instead of the tail to avoid tail traversing. If race condition happens, you would end up in an infinite loop. That is the reason, Java HashMap is not thread safe. In such scenarios, ConcurrentHashMap should be used. There are more data structures and ways to obtain concurrency apart from ConcurrentHashMap.

Good Hash Map Keys :

Having good keys for a HashMap is very important. If the time taken to generate a hashCode is high, HashMap would perform very badly since every insertion calls hashCode method. Please note that retrieval does not compute hashCode since JVM calculates hashCode of each Key and provides it on demand. Also, if all the objects are mapped to the same bucket, it just as a LinkedList where retrieval would be very expensive. Now let's have a look at one more aspect of a good hash key. When we modify an object's state, JVM sets a flag which indicates that the object is modified and hashCode umust be recomputed. So, the next time when we call hashCode() method of that object,hashCode is recomputed. For this basic reasoning, it recommended to have immutable keys.

Very broadly, a good HashMap key should have the following traits :
1) hashCode() and equals() method of Key object should be fast.
2) hashCode should be well distributed to minimize hash collisions.
3) Key class should be immutable.

No comments:

Post a Comment