Sunday 14 April 2013

Catastrophic Backtracking

Lets start off with a small puzzle.

Here is the code :
What would be the output of the above code ? Here are the options :
a) 99
b) 100
c) Throw an exception
d) None of the above

As clear from the code, the regular expression will match "no" odd number strings of "a". Most the people would think that the answer is 99 since we have 99 even number strings of a's, but Oups !! Wait a minute !!

The correct answer is d option. Strange !! Isn't it ??
Well to add more to your confusion, the above program runs for more than 10^15 years before printing anything. How is this possible ?

This is due to "Catastrophic BackTracking" which is caused by our regex. Lets take a close look at it : (aa|aab?)+

Lets say we are matching a string that does not below to the pattern specified, example "aaaaa". Here is the DFA for this :


As seen from the DFA in the image above, the DFA engine traverses the "whole tree". The time complexity of the above code is exponential in the length of string : O(2^(n/2)) to be precise.


How can we fix it ?

The solution is simple. While nesting repetition operators, make sure that "there is only one way to match the same match". 
In our code, there are multiple ways to match a string. For example, both "aa" pattern is matched by both "aa" and "aab?". So, lets change our code to use a regular expression which adheres to the above principle :


The regex"(aab?)+" matches exactly the same strings as "(aa|aab?)+", but doesn't exhibit "catastrophic backtracking". The above code does produce an output and it prints 99.

Saturday 6 April 2013

Java Garbage Collection

Garbage Collection is the process of looking at the heap memory, identifying which objects are in use and which are not, and deleting the unused objects. When we say that an object is in use, we mean that some part of our program still has a pointer to that object. An unused object on the other hand, is no longer referenced by any part of our program. Hence the memory used by the unused object can be reclaimed. This is what garbage collection does.

Java does not explicitly allocate and deallocate memory in the program code like in C or C++. Some developers set the object references to null or call System.gc() to free the memory explicitly. Explicitly nulling of object references is not a big deal  since modern Java compilers are smart enough to interpret when an object becomes out of scope in "almost all" the cases. But System.gc() would affect the system performance drastically and should not be called. In Java, the developer does not explicitly free the memory in the program code. The garbage collector takes care of it. The Java garbage collector is based on the following 2 hypotheses which are called the weak generation hypotheses:

  1. Most objects become unreachable soon.
  2. References from old objects to new objects only exist in small numbers.
In HotSpot JVM, the heap space is divided into 3 regions :
  1. Young Generation : Most of the newly created objects are located in this region. Since according to the above hypotheses most objects become unreachable soon, most objects die here. When we say objects die here, we mean to say that they are collected by minor GC. The objects which survive this generation are transferred to the old generation.
  2. Old Generation : The objects which survive the young generation are copied here. It is "generally" larger than the young generation. Since it is bigger in size, GC is less frequent than in the young generation. It should be noted that the size of the generations is configurable and is tuned based on the application requirement, but most of the times old generation is bigger than the young generation. When objects die in the old generation we say that a major GC has occurred.
  3. Permanent Generation : The permanent generation is for methods and classes. A GC can occur is this as well and it is counted as a major GC. This generation is populated by the JVM at runtime based on the classes in use by the application. In addition, Java SE library classes and methods may be stored here.
The following figure makes it clear.

Figure 1: GC Area & Data Flow.

Why Java uses different generations in the Heap ?

JVM used Mark and Sweep algorithm for garbage collection. The first step in this process is called marking. In this step, garbage collector identifies which objects are in use and which are not. This can be a time consuming process since we need to scan all the objects. The second step is the deletion step. In this step, unreferenced objects are removed, leaving behind referenced objects and pointers to free space. The following figures explain the 2 steps :
To improve the performance, in addition to removing the unreferenced objects, we compact the remaining referenced objects by moving them together. This makes new memory allocations easier and faster. Having to mark all objects in JVM is inefficient. As more and more objects are allocated, list of objects grows leading to longer garbage collection times. Instead, we can make use of the weak generation hypothesis as explained previously. That is why Java uses Generational Garbage Collection.

Composition of Young Generation :

The young generation is composed of 3 spaces : one Eden space and two Survivor spaces. The majority of the newly created objects are created in the Eden space. After one GC in the Eden space, surviving objects are moved to one of the survivor spaces. Once this survivor space is full, surviving objects are moved to the other survivor space. And surviving objects from this survivor space are moved to the old generation.

Some Garbage Collectors in JVM :

Some types of garbage collectors:
  1. Serial GC
  2. Parallel GC
  3. Concurrent Mark & Sweep (CMS) GC

Serial GC : 

This GC first marks the surviving objects in the old generation. It then sweeps the heap leaving only the surviving objects behind. It then moves the survived objects to one part. Serial GC causes stop-the-world i.e. the application becomes inactive during Serial GC.

Parallel GC : 

This GC also causes stop-the world. While Serial GC uses only one thread for GC, Parallel GC uses multiple threads for GC and is hence efficient than Serial GC. It makes sense to use this GC when we have multiple cores.

Concurrent Mark & Sweep GC : 

This GC uses only one thread. This GC performs the following steps :
  • Initial Mark
  • Concurrent Marking
  • Remark
  • Concurrent Sweeping
As clear from the below figure that Concurrent Mark Sweep GC stops the world only during the initial mark and remark phases. During the concurrent marking and sweeping phases, this GC runs along with application thread. We can parallelize this with concurrency in HotSpot JVM. This GC is used when the response time is very crucial.

Figure 5: Serial GC & CMS GC.


JVM options for different GC algorithms :

  • Serial generational collector (-XX:+UseSerialGC).
  • Parallel for young space, serial for old space generational collector (-XX:+UseParallelGC).
  • Parallel young and old space generational collector (-XX:+UseParallelOldGC).
  • Concurrent mark sweep with serial young space collector (-XX:+UseConcMarkSweepGC -XX:-UseParNewGC).
  • Concurrent mark sweep with parallel young space collector (-XX:+UseConcMarkSweepGC -XX:+UseParNewGC).
  • G1 garbage collector (-XX:+UseG1GC).