Sunday, 14 April 2013

Catastrophic Backtracking

Lets start off with a small puzzle.

Here is the code :
What would be the output of the above code ? Here are the options :
a) 99
b) 100
c) Throw an exception
d) None of the above

As clear from the code, the regular expression will match "no" odd number strings of "a". Most the people would think that the answer is 99 since we have 99 even number strings of a's, but Oups !! Wait a minute !!

The correct answer is d option. Strange !! Isn't it ??
Well to add more to your confusion, the above program runs for more than 10^15 years before printing anything. How is this possible ?

This is due to "Catastrophic BackTracking" which is caused by our regex. Lets take a close look at it : (aa|aab?)+

Lets say we are matching a string that does not below to the pattern specified, example "aaaaa". Here is the DFA for this :


As seen from the DFA in the image above, the DFA engine traverses the "whole tree". The time complexity of the above code is exponential in the length of string : O(2^(n/2)) to be precise.


How can we fix it ?

The solution is simple. While nesting repetition operators, make sure that "there is only one way to match the same match". 
In our code, there are multiple ways to match a string. For example, both "aa" pattern is matched by both "aa" and "aab?". So, lets change our code to use a regular expression which adheres to the above principle :


The regex"(aab?)+" matches exactly the same strings as "(aa|aab?)+", but doesn't exhibit "catastrophic backtracking". The above code does produce an output and it prints 99.

No comments:

Post a Comment