Sunday, 29 September 2013

Java Hash Maps demystified

Often new programmers use Hash Maps in Java without really knowing how they work. Knowing how Java Hash Maps work would help them to use them judiciously.

How an entry is added into a HashMap :

Lets begin with what happens when we add an entry to a hash map. For that, we use put() method of HashMap class. This method takes a key and a value. When we pass Key and Value object to put() method of Java HashMap, HashMap implementation calls hashCode method on Key object and applies returned hashCode into its hashing function. The hashing function returns a bucket location for this entry. I say entry because it comprises of both Key and Value. An important point to remember is that HashMap in Java stores both Key and Value object. Why it does so, would be clear shortly. Once it gets the bucket location, it adds the entry object to the bucket. A bucket location uses LinkedList to store various entry objects. When we add an entry object to a bucket, it is added at the head of the LinkedList to avoid tail traversing.

Collision resolution :

Java HashMap stores both Key and Value object to resolve collisions. If two Key objects have the same hashCode, they would have the same bucket location because given a hashCode, the hashing function would return the same bucket location every time. At a given bucket location Java HashMap uses LinkedList to store Entry objects (Entry object comprises of Key and Value objects). An important point to note is that if two objects have the same hashCode, they would go into the same bucket location. Since we use LinkedList to store values at a bucket location, having a long LinkedList can defeat the purpose of HashMap. So, we should judiciously implement hashCode() method of the Key object.

How a value is retrieved from a HashMap :

Now let's see how Java HashMap retrieves a Value object. The method provided for this purpose is get() and it takes a Key object as an argument. When we call get() method, the HashMap calculates hashCode of the Key object by calling hashCode method of the Key object. Once it gets the hashCode of the Key object, it calculates the bucket location of the Entry object by using hashing function. Since HashMap stores a LinkedList of Entries at a given bucket location, it uses key.equals() method to identify the correct node in the LinkedList. This explains the reason why we store both Key and Value object.

Rehashing :

What happens when the size of the HashMap exceeds a given threshold ? If it crosses a given threshold known as load-factor (which is 0.75), it resizes itself. Similar to other collection classes like ArrayList, Java HashMap resizes it self by creating a new bucket of size twice the previous size and then starts putting every old element into that new bucket array. This process is called rehashing since it applies hash function to find new bucket location. But wait !!! If you think of rehashing for a while, you would see big problem. There is a potential race condition. What if two threads find that HashMap needs resizing at the same time. In the process of resizing the HashMap, the element in bucket which is stored in LinkedList gets reversed in order because HashMap appends the new element at the head instead of the tail to avoid tail traversing. If race condition happens, you would end up in an infinite loop. That is the reason, Java HashMap is not thread safe. In such scenarios, ConcurrentHashMap should be used. There are more data structures and ways to obtain concurrency apart from ConcurrentHashMap.

Good Hash Map Keys :

Having good keys for a HashMap is very important. If the time taken to generate a hashCode is high, HashMap would perform very badly since every insertion calls hashCode method. Please note that retrieval does not compute hashCode since JVM calculates hashCode of each Key and provides it on demand. Also, if all the objects are mapped to the same bucket, it just as a LinkedList where retrieval would be very expensive. Now let's have a look at one more aspect of a good hash key. When we modify an object's state, JVM sets a flag which indicates that the object is modified and hashCode umust be recomputed. So, the next time when we call hashCode() method of that object,hashCode is recomputed. For this basic reasoning, it recommended to have immutable keys.

Very broadly, a good HashMap key should have the following traits :
1) hashCode() and equals() method of Key object should be fast.
2) hashCode should be well distributed to minimize hash collisions.
3) Key class should be immutable.

Saturday, 21 September 2013

Java String comparison using "==" Vs equals()

Comparing Strings with "==" instead of equals() method is a common mistake made by new Java developers. What makes it interesting is that "String literal" when compared with "==" works perfectly fine, which makes Java developers think that it is the correct way of comparing Strings which in fact is not. Actually, "==" operator returns true if both the operands point to the same String object. When we create a String literal, a new String object is not created every time. First it is searched in the pool if that object is already present. If it is present, that object is returned. If it is not present, a new object is created and added to the pool. So, "==" operator works fine if both the operands are String literals. To avoid subtle bugs, we should always compare Strings using equals() method.

The following Java program makes it clear :


Friday, 9 August 2013

Designing an efficient and scalable cache

Nearly all applications use some form of caching. Reusing the results of previous computation can decrease latency and increase throughput at the cost of some additional memory. Caching usually looks to be simpler than it is. Most of the naive cache implementations are designed taking single threaded application into consideration. But this design often turns into scalability bottleneck when application scales. Here I would design an efficient and scalable cache for a computational expensive function. I would be using Java for these examples. The codes are not complete and are only for clearing the concept.

Lets start with a naive cache implementation and try to improve it by finding various faults in it. Whenever we think of cache, HashMap comes to the mind. So, lets start with HashMap. Have a look at the following code snippet :

In the above code snippet, Computable declares an function with argument of type A and return type V. ExpensiveFunction implements this interface which takes a lot of time to compute. To implement naive cache, we would have a wrapper that records the results of previous computation and encapsulates the caching process. The technique is called Memoization. CacheImpl1 class is a first attempt at this which uses HashMap. The compute method in the wrapper first checks whether the result has already been computed. If it is already computed, it returns the result. If not, it computes the result and stores it in the cache for future use.

As we know, HashMap is not thread-safe, so to make the program thread-safe, the above method synchronizes the entire compute method. Though this approach ensures thread safety, but it has a performance bottleneck. Only one thread can execute compute at a time. Suppose one thread is busy executing compute which is an expensive function, no other thread would be able to execute compute even though the result has already been computed. If multiple threads are waiting for compute, it may actually take more time than without memoization. This is what we had not expected from caching.

Let's improve it a little bit by using ConcurrentHashMap which is thread safe. So, we need not synchronize the entire method. CacheImpl2 in the below snippet follows this approach, but it also has some bottlenecks. Two threads calling compute may end up computing the same value. Find out how ?? So, whats the benefit of caching. The mere advantage of caching is to prevent the same value from being calculated multiple times.

The problem with CacheImpl2 is that if one thread starts an expensive computation, the other thread is not aware that a computation is already in progress and may start the same computation. FutureTask class in Java can be handy here. Its worth going through this class first. This class provides methods to start and cancel a computation, query to see is the computation is complete and retrieve the results of a computation. FutureTask,get() is the answer to our requirement. This method returns the result if it is available, else it blocks till the result is computed and then returns it. CacheImpl3 is the below snippet uses FutureTask.
CacheImpl3 implementation above looks to be perfect, but it also has a flaw. There is still a case, wherein 2 threads can compute the same value. The culprit is the if block in the compute method which is non-atomic. Think !! With a little thought you should be able to think of that scenario. How to correct this flaw ??? putIfAbsent method of ConcurrentHashMap is the answer. Its clear from the function name, what it does. If you are still not sure, please go through the documentation. Let's look at some more corner cases. What is the computation is cancelled or it fails. Future attempts to compute this results will also fail. To counter this corner case, cache implementation should remove the Future from the cache in such cases. Here is the final implementation of cache :

Thursday, 8 August 2013

Implication of Open Interest in Futures Markets

What is Open Interest ?

The bookish definition says that Open Interest is the total number of outstanding contracts that are held by market participants at the end of the day. In other words, Open Interest is the total number of future contracts or option contracts that have not yet been exercised, expired or filled. Open Interest makes sense only for Futures market.

Open Interest measures the flow of money into the futures market and hence is used by traders to confirm trends and trend reversals in futures market. The open interest position that is reported each day represents the increase or decrease in the number of contracts for that day and hence is shown as a positive or negative number.

Open Interest Calculation :

A contract has both a buyer and a seller. The open interest for a day is the increase or decrease in the number of contracts for that day. For example, if 2 parties in a trade are initiating a new position i.e. 1 new buyer and 1 new seller are entering into the market, open interest would increase by one contract. On the other hand, if 2 parties in a trade are closing an existing position, open interest would decrease by one contract. There is one more possibility wherein one old trader passes its position to a new trader. In this case, open interest does not change.

Benefits of monitoring Open Interest :

In short, increasing open interest means that money is flowing into the market. Due to this, the present trend will continue. If the current trend is upward, it would remain upward. If it is downward, it would remain downward. Declining Open Interest means that current market trend is coming to an end. If the price jumps significantly but open interest remains leveled, it is an early warning to the end of bull market and vice-versa. Hence, an increase or decrease in prices while open interest remains same is an early indication of trend reversal.

Further reading:
Investopedia has more details on this.

Saturday, 27 July 2013

Interesting Programming Puzzle - Separate Engineers and Managers

There are n people of which more than half are engineers and the rest are managers. An engineer always tells the truth. A manager may  or may not speak the truth. to every ith guy, you can ask whether the jth guy is an engineer.
Find out a way to separate out engineers and managers in the most efficient way.

Solution :

Here's an n-1 query solution
1. Maintain three sets of people: UNSEEN, STACK, and DISCARD.
2. Initialize the process by picking one arbitrarily to be the STACK, everything else is UNSEEN.
3. Repeat the following step until UNSEEN is empty:
        a. Pick an UNSEEN element x, remove it from UNSEEN.
        b. Ask the top of the STACK y about x. If y says "manager" pop y off the stack and DISCARD both x and y. If it says "engineer" add x to the top of the STACK.

After all elements have been processed in this way (n-1 comparisons), the top of the stack must be an engineer. Once we get an engineer, we can separate out engineers and managers by asking the engineer since an engineer always speaks truth.

Why does this work? 

First observe that whenever we discard a pair, at least one of them is a manager(If you don't understand this, check by taking various cases). So among the rest of them (STACK and UNSEEN) a majority must still be engineers. So at the end,when UNSEEN is empty, there must be an engineer in the stack, therefore the top of the stack must be an engineer.

This can be improved to n-2 simply by stopping one earlier. When there's one UNSEEN left, if the stack is empty, that UNSEEN one is an engineer. Otherwise, the top of the stack must be an engineer. If is n is even and n>=4, as a first step we can just throw out one person, and apply this algorithm to the remaining n-1 obtaining n-3 comparisons. This gives the optimal algorithm.

Friday, 21 June 2013

Birthday Paradox

This is one of the very interesting problems in Probability theory. So here it goes. Birthday Paradox concerns  the probability, that in a set of n randomly chosen people, some pair of them will have the same birthday.

Consider the following example: Assuming for a moment that birthdays are evenly distributed throughout the year, if you're sitting in a room with forty people in it, what are the chances that two of those people have the same birthday? For simplicity's sake, we'll ignore leap years. A reasonable, intelligent person might point out that the odds don't reach 100% until there are 366 people in the room (the number of days in a year + 1)... and forty is about 11% of 366... so such a person might conclude that the odds of two people in forty sharing a birthday are about 11%. In reality, due to Math's convoluted reasoning, the odds are about 90%. This phenomenon is known as the Birthday Paradox.

It turns out that this phenomenon is very useful in cryptography and hashing algorithms. The birthday paradox is strange, counter-intuitive, and completely true. It’s only a “paradox” because our brains can’t handle the compounding power of exponents. We expect probabilities to be linear and only consider the scenarios we’re involved in (both faulty assumptions, by the way). So, without diverging away lets dive into the main part.

Lets work with some numbers. Suppose there are 23 people in a room. What are the chances that 2 people share a birthday ?
Lets list down all the possible ways, 23 people can have birthdays and count the ways where at least 2 people share a birthday. After this calculating this, the required probability is a cake walk. But since the number of ways are huge, this is hard to do. Lets flip the problem. Instead of finding all the ways where at 2 people share a birthday, lets find all the ways where no 2 people share a birthday. Then we can easily compute the probability that at least 2 people share a birthday by subtracting from 1. 

With 23 people, we have 253 combinations. 
(23 X 22) / 2 = 253
The chance of 2 people having different birthdays is :
1 - (1/365)  = (364 / 365) = 0.997260  --> This should be pretty easy to understand
Makes sense. 364 out of 365 days are Ok.

Now, all the 253 pairs should be different. We use exponents to find the probability. So, the probability that all the 253 pairs have different birthdays is 
(364 / 365) ^ 253 = 0.4995

Now the chances that out of 23 people in a room, no 2 people share a birthday is 1 - 0.4995 = 0.5005 = 50.05%

The generalized formula is :
P(n) = 1 - (364 / 365)^C(n,2), where n is the count of people in the room.

There is another way to look at this problem which may seem easier for some people:
Let's say you have a big wall calendar with all 365 days on it. You walk in and put a big X on your birthday. The next person who walks in has only a 364 possible open days available, so the probability of the two dates not colliding is 364/365. The next person has only 363 open days, so the probability of not colliding is 363/365. If you multiply the probabilities for all 23 people not colliding, then you get:
364/365 × 363/365 × … 365-23+1/365 = Chances of no collisions

Sunday, 14 April 2013

Catastrophic Backtracking

Lets start off with a small puzzle.

Here is the code :
What would be the output of the above code ? Here are the options :
a) 99
b) 100
c) Throw an exception
d) None of the above

As clear from the code, the regular expression will match "no" odd number strings of "a". Most the people would think that the answer is 99 since we have 99 even number strings of a's, but Oups !! Wait a minute !!

The correct answer is d option. Strange !! Isn't it ??
Well to add more to your confusion, the above program runs for more than 10^15 years before printing anything. How is this possible ?

This is due to "Catastrophic BackTracking" which is caused by our regex. Lets take a close look at it : (aa|aab?)+

Lets say we are matching a string that does not below to the pattern specified, example "aaaaa". Here is the DFA for this :


As seen from the DFA in the image above, the DFA engine traverses the "whole tree". The time complexity of the above code is exponential in the length of string : O(2^(n/2)) to be precise.


How can we fix it ?

The solution is simple. While nesting repetition operators, make sure that "there is only one way to match the same match". 
In our code, there are multiple ways to match a string. For example, both "aa" pattern is matched by both "aa" and "aab?". So, lets change our code to use a regular expression which adheres to the above principle :


The regex"(aab?)+" matches exactly the same strings as "(aa|aab?)+", but doesn't exhibit "catastrophic backtracking". The above code does produce an output and it prints 99.