Idle Process

Saturday, 21 September 2013

Java String comparison using "==" Vs equals()

Comparing Strings with "==" instead of equals() method is a common mistake made by new Java developers. What makes it interesting is that "String literal" when compared with "==" works perfectly fine, which makes Java developers think that it is the correct way of comparing Strings which in fact is not. Actually, "==" operator returns true if both the operands point to the same String object. When we create a String literal, a new String object is not created every time. First it is searched in the pool if that object is already present. If it is present, that object is returned. If it is not present, a new object is created and added to the pool. So, "==" operator works fine if both the operands are String literals. To avoid subtle bugs, we should always compare Strings using equals() method.

The following Java program makes it clear :

Friday, 9 August 2013

Designing an efficient and scalable cache

Nearly all applications use some form of caching. Reusing the results of previous computation can decrease latency and increase throughput at the cost of some additional memory. Caching usually looks to be simpler than it is. Most of the naive cache implementations are designed taking single threaded application into consideration. But this design often turns into scalability bottleneck when application scales. Here I would design an efficient and scalable cache for a computational expensive function. I would be using Java for these examples. The codes are not complete and are only for clearing the concept.

Lets start with a naive cache implementation and try to improve it by finding various faults in it. Whenever we think of cache, HashMap comes to the mind. So, lets start with HashMap. Have a look at the following code snippet :

In the above code snippet, Computable declares an function with argument of type A and return type V. ExpensiveFunction implements this interface which takes a lot of time to compute. To implement naive cache, we would have a wrapper that records the results of previous computation and encapsulates the caching process. The technique is called Memoization. CacheImpl1 class is a first attempt at this which uses HashMap. The compute method in the wrapper first checks whether the result has already been computed. If it is already computed, it returns the result. If not, it computes the result and stores it in the cache for future use.

As we know, HashMap is not thread-safe, so to make the program thread-safe, the above method synchronizes the entire compute method. Though this approach ensures thread safety, but it has a performance bottleneck. Only one thread can execute compute at a time. Suppose one thread is busy executing compute which is an expensive function, no other thread would be able to execute compute even though the result has already been computed. If multiple threads are waiting for compute, it may actually take more time than without memoization. This is what we had not expected from caching.

Let's improve it a little bit by using ConcurrentHashMap which is thread safe. So, we need not synchronize the entire method. CacheImpl2 in the below snippet follows this approach, but it also has some bottlenecks. Two threads calling compute may end up computing the same value. Find out how ?? So, whats the benefit of caching. The mere advantage of caching is to prevent the same value from being calculated multiple times.

The problem with CacheImpl2 is that if one thread starts an expensive computation, the other thread is not aware that a computation is already in progress and may start the same computation. FutureTask class in Java can be handy here. Its worth going through this class first. This class provides methods to start and cancel a computation, query to see is the computation is complete and retrieve the results of a computation. FutureTask,get() is the answer to our requirement. This method returns the result if it is available, else it blocks till the result is computed and then returns it. CacheImpl3 is the below snippet uses FutureTask.

CacheImpl3 implementation above looks to be perfect, but it also has a flaw. There is still a case, wherein 2 threads can compute the same value. The culprit is the if block in the compute method which is non-atomic. Think !! With a little thought you should be able to think of that scenario. How to correct this flaw ??? putIfAbsent method of ConcurrentHashMap is the answer. Its clear from the function name, what it does. If you are still not sure, please go through the documentation. Let's look at some more corner cases. What is the computation is cancelled or it fails. Future attempts to compute this results will also fail. To counter this corner case, cache implementation should remove the Future from the cache in such cases. Here is the final implementation of cache :

Thursday, 8 August 2013

Implication of Open Interest in Futures Markets

What is Open Interest ?

The bookish definition says that Open Interest is the total number of outstanding contracts that are held by market participants at the end of the day. In other words, Open Interest is the total number of future contracts or option contracts that have not yet been exercised, expired or filled. Open Interest makes sense only for Futures market.

Open Interest measures the flow of money into the futures market and hence is used by traders to confirm trends and trend reversals in futures market. The open interest position that is reported each day represents the increase or decrease in the number of contracts for that day and hence is shown as a positive or negative number.

Open Interest Calculation :

A contract has both a buyer and a seller. The open interest for a day is the increase or decrease in the number of contracts for that day. For example, if 2 parties in a trade are initiating a new position i.e. 1 new buyer and 1 new seller are entering into the market, open interest would increase by one contract. On the other hand, if 2 parties in a trade are closing an existing position, open interest would decrease by one contract. There is one more possibility wherein one old trader passes its position to a new trader. In this case, open interest does not change.

Benefits of monitoring Open Interest :

In short, increasing open interest means that money is flowing into the market. Due to this, the present trend will continue. If the current trend is upward, it would remain upward. If it is downward, it would remain downward. Declining Open Interest means that current market trend is coming to an end. If the price jumps significantly but open interest remains leveled, it is an early warning to the end of bull market and vice-versa. Hence, an increase or decrease in prices while open interest remains same is an early indication of trend reversal.

Further reading:
Investopedia has more details on this.

Saturday, 27 July 2013

Interesting Programming Puzzle - Separate Engineers and Managers

There are n people of which more than half are engineers and the rest are managers. An engineer always tells the truth. A manager may or may not speak the truth. to every ith guy, you can ask whether the jth guy is an engineer.
Find out a way to separate out engineers and managers in the most efficient way.

Solution :

Here's an n-1 query solution
1. Maintain three sets of people: UNSEEN, STACK, and DISCARD.
2. Initialize the process by picking one arbitrarily to be the STACK, everything else is UNSEEN.
3. Repeat the following step until UNSEEN is empty:
a. Pick an UNSEEN element x, remove it from UNSEEN.
b. Ask the top of the STACK y about x. If y says "manager" pop y off the stack and DISCARD both x and y. If it says "engineer" add x to the top of the STACK.

After all elements have been processed in this way (n-1 comparisons), the top of the stack must be an engineer. Once we get an engineer, we can separate out engineers and managers by asking the engineer since an engineer always speaks truth.

Why does this work?

First observe that whenever we discard a pair, at least one of them is a manager(If you don't understand this, check by taking various cases). So among the rest of them (STACK and UNSEEN) a majority must still be engineers. So at the end,when UNSEEN is empty, there must be an engineer in the stack, therefore the top of the stack must be an engineer.

This can be improved to n-2 simply by stopping one earlier. When there's one UNSEEN left, if the stack is empty, that UNSEEN one is an engineer. Otherwise, the top of the stack must be an engineer. If is n is even and n>=4, as a first step we can just throw out one person, and apply this algorithm to the remaining n-1 obtaining n-3 comparisons. This gives the optimal algorithm.

Friday, 21 June 2013

Birthday Paradox

This is one of the very interesting problems in Probability theory. So here it goes. Birthday Paradox concerns the probability, that in a set of n randomly chosen people, some pair of them will have the same birthday.

Consider the following example: Assuming for a moment that birthdays are evenly distributed throughout the year, if you're sitting in a room with forty people in it, what are the chances that two of those people have the same birthday? For simplicity's sake, we'll ignore leap years. A reasonable, intelligent person might point out that the odds don't reach 100% until there are 366 people in the room (the number of days in a year + 1)... and forty is about 11% of 366... so such a person might conclude that the odds of two people in forty sharing a birthday are about 11%. In reality, due to Math's convoluted reasoning, the odds are about 90%. This phenomenon is known as the Birthday Paradox.

It turns out that this phenomenon is very useful in cryptography and hashing algorithms. The birthday paradox is strange, counter-intuitive, and completely true. It’s only a “paradox” because our brains can’t handle the compounding power of exponents. We expect probabilities to be linear and only consider the scenarios we’re involved in (both faulty assumptions, by the way). So, without diverging away lets dive into the main part.

Lets work with some numbers. Suppose there are 23 people in a room. What are the chances that 2 people share a birthday ?

Lets list down all the possible ways, 23 people can have birthdays and count the ways where at least 2 people share a birthday. After this calculating this, the required probability is a cake walk. But since the number of ways are huge, this is hard to do. Lets flip the problem. Instead of finding all the ways where at 2 people share a birthday, lets find all the ways where no 2 people share a birthday. Then we can easily compute the probability that at least 2 people share a birthday by subtracting from 1.

With 23 people, we have 253 combinations.

(23 X 22) / 2 = 253

The chance of 2 people having different birthdays is :

1 - (1/365) = (364 / 365) = 0.997260 --> This should be pretty easy to understand

Makes sense. 364 out of 365 days are Ok.

Now, all the 253 pairs should be different. We use exponents to find the probability. So, the probability that all the 253 pairs have different birthdays is

(364 / 365) ^ 253 = 0.4995

Now the chances that out of 23 people in a room, no 2 people share a birthday is 1 - 0.4995 = 0.5005 = 50.05%

The generalized formula is :

P(n) = 1 - (364 / 365)^C(n,2), where n is the count of people in the room.

There is another way to look at this problem which may seem easier for some people:
Let's say you have a big wall calendar with all 365 days on it. You walk in and put a big X on your birthday. The next person who walks in has only a 364 possible open days available, so the probability of the two dates not colliding is 364/365. The next person has only 363 open days, so the probability of not colliding is 363/365. If you multiply the probabilities for all 23 people not colliding, then you get:
364/365 × 363/365 × … 365-23+1/365 = Chances of no collisions

Sunday, 14 April 2013

Catastrophic Backtracking

Lets start off with a small puzzle.

Here is the code :
What would be the output of the above code ? Here are the options :
a) 99
b) 100
c) Throw an exception
d) None of the above

As clear from the code, the regular expression will match "no" odd number strings of "a". Most the people would think that the answer is 99 since we have 99 even number strings of a's, but Oups !! Wait a minute !!

The correct answer is d option. Strange !! Isn't it ??
Well to add more to your confusion, the above program runs for more than 10^15 years before printing anything. How is this possible ?

This is due to "Catastrophic BackTracking" which is caused by our regex. Lets take a close look at it : (aa|aab?)+

Lets say we are matching a string that does not below to the pattern specified, example "aaaaa". Here is the DFA for this :

As seen from the DFA in the image above, the DFA engine traverses the "whole tree". The time complexity of the above code is exponential in the length of string : O(2^(n/2)) to be precise.

How can we fix it ?

The solution is simple. While nesting repetition operators, make sure that "there is only one way to match the same match".

In our code, there are multiple ways to match a string. For example, both "aa" pattern is matched by both "aa" and "aab?". So, lets change our code to use a regular expression which adheres to the above principle :

The regex"(aab?)+" matches exactly the same strings as "(aa|aab?)+", but doesn't exhibit "catastrophic backtracking". The above code does produce an output and it prints 99.

Saturday, 6 April 2013

Java Garbage Collection

Garbage Collection is the process of looking at the heap memory, identifying which objects are in use and which are not, and deleting the unused objects. When we say that an object is in use, we mean that some part of our program still has a pointer to that object. An unused object on the other hand, is no longer referenced by any part of our program. Hence the memory used by the unused object can be reclaimed. This is what garbage collection does.

Java does not explicitly allocate and deallocate memory in the program code like in C or C++. Some developers set the object references to null or call System.gc() to free the memory explicitly. Explicitly nulling of object references is not a big deal since modern Java compilers are smart enough to interpret when an object becomes out of scope in "almost all" the cases. But System.gc() would affect the system performance drastically and should not be called. In Java, the developer does not explicitly free the memory in the program code. The garbage collector takes care of it. The Java garbage collector is based on the following 2 hypotheses which are called the weak generation hypotheses:

Most objects become unreachable soon.
References from old objects to new objects only exist in small numbers.

In HotSpot JVM, the heap space is divided into 3 regions :

Young Generation : Most of the newly created objects are located in this region. Since according to the above hypotheses most objects become unreachable soon, most objects die here. When we say objects die here, we mean to say that they are collected by minor GC. The objects which survive this generation are transferred to the old generation.
Old Generation : The objects which survive the young generation are copied here. It is "generally" larger than the young generation. Since it is bigger in size, GC is less frequent than in the young generation. It should be noted that the size of the generations is configurable and is tuned based on the application requirement, but most of the times old generation is bigger than the young generation. When objects die in the old generation we say that a major GC has occurred.
Permanent Generation : The permanent generation is for methods and classes. A GC can occur is this as well and it is counted as a major GC. This generation is populated by the JVM at runtime based on the classes in use by the application. In addition, Java SE library classes and methods may be stored here.

The following figure makes it clear.

Why Java uses different generations in the Heap ?

JVM used Mark and Sweep algorithm for garbage collection. The first step in this process is called marking. In this step, garbage collector identifies which objects are in use and which are not. This can be a time consuming process since we need to scan all the objects. The second step is the deletion step. In this step, unreferenced objects are removed, leaving behind referenced objects and pointers to free space. The following figures explain the 2 steps :

To improve the performance, in addition to removing the unreferenced objects, we compact the remaining referenced objects by moving them together. This makes new memory allocations easier and faster. Having to mark all objects in JVM is inefficient. As more and more objects are allocated, list of objects grows leading to longer garbage collection times. Instead, we can make use of the weak generation hypothesis as explained previously. That is why Java uses Generational Garbage Collection.

Composition of Young Generation :

The young generation is composed of 3 spaces : one Eden space and two Survivor spaces. The majority of the newly created objects are created in the Eden space. After one GC in the Eden space, surviving objects are moved to one of the survivor spaces. Once this survivor space is full, surviving objects are moved to the other survivor space. And surviving objects from this survivor space are moved to the old generation.

Some Garbage Collectors in JVM :

Some types of garbage collectors:

Serial GC
Parallel GC
Concurrent Mark & Sweep (CMS) GC

Serial GC :

This GC first marks the surviving objects in the old generation. It then sweeps the heap leaving only the surviving objects behind. It then moves the survived objects to one part. Serial GC causes stop-the-world i.e. the application becomes inactive during Serial GC.

Parallel GC :

This GC also causes stop-the world. While Serial GC uses only one thread for GC, Parallel GC uses multiple threads for GC and is hence efficient than Serial GC. It makes sense to use this GC when we have multiple cores.

Concurrent Mark & Sweep GC :

This GC uses only one thread. This GC performs the following steps :

Initial Mark
Concurrent Marking
Remark
Concurrent Sweeping

As clear from the below figure that Concurrent Mark Sweep GC stops the world only during the initial mark and remark phases. During the concurrent marking and sweeping phases, this GC runs along with application thread. We can parallelize this with concurrency in HotSpot JVM. This GC is used when the response time is very crucial.

JVM options for different GC algorithms :

Serial generational collector (-XX:+UseSerialGC).
Parallel for young space, serial for old space generational collector (-XX:+UseParallelGC).
Parallel young and old space generational collector (-XX:+UseParallelOldGC).
Concurrent mark sweep with serial young space collector (-XX:+UseConcMarkSweepGC -XX:-UseParNewGC).
Concurrent mark sweep with parallel young space collector (-XX:+UseConcMarkSweepGC -XX:+UseParNewGC).
G1 garbage collector (-XX:+UseG1GC).