Monday 14 November 2011

Using Lucene to search Java code


Several websites allow the software development community to share information by publishing developer guidelines, white papers, FAQs, and source code. As the information content grows, with several developers contributing to the repository, websites provide search engines to search all of the information present on the site. While the search engines do a very good job in retrieving text documents, they severely constrain a developer when searching source code. Search engines consider source code files as plain text documents and hence and are no different from a sophisticated grep tool capable of handling large source files.
In this article I propose the approach of using Lucene, the Java-based open source search engine, to search source code by extracting and indexing relevant source code elements. I restrict the search to Java source code only. However, extending the search to any other programming language's source code should not be very different.
The article gives you a brief overview of the important aspects of a search engine in the context of Lucene.

Overview

Lucene is one of the most popular open source search engine libraries. Lucene consists of core APIs that allow indexing and searching of text. Given a set of text files, Lucene can create indexes and allow you to search those indexes with complex queries such as +title:Lucene -content:Search , search AND Lucene , +search +code. Before going into the details of searching, let me introduce you to some of the functional aspects of Lucene.

Indexing Text in Lucene

Search engines scan all of the data that need to be searched and store them in a structure that allows efficient retrieval. The most well-known structure is called the inverted index. For example, consider indexing a set of conference proceedings. First, each paper or document of the proceedings is considered to have distinct parts, or fields, such as title, author, email, abstract, and content. The text of each field is tokenized and keywords or terms are extracted. The inverted index for the proceedings will be stored as shown in the following table.
FieldTermDocument FrequencyDocument
AuthorErik Hatcher1Doc1
Bill Inmon1Doc6
....
Emailerik@abc.com1Doc1
xyz@mit.edu1Doc15
....
Abstractlucene1Doc1
j2ee2Doc6,Doc7
system5Doc1,Doc2,Doc6,Doc15,Doc18
....
For each term in a field, the number of the documents it occurs in (document frequency) and the IDs of the documents in which it occurs are stored. Other details are stored for each term: the number of times it occurs in each document and the position where it occurs. However, all that is important for us is to know that indexing a text file for Lucene means storing it in a format that allows for efficient querying and retrieval.

Analyzing Text to Be Indexed

Lucene processes the text to be indexed with analyzers. Analyzers are used to tokenize text, extract relevant words, discard common words, stem the words (reduce them to the root form, meaning that bowlingbowlerand bowls are reduced to bowl), and perform any other desired processing before storing it into the index. The common analyzers provided by Lucene are:
  • SimpleAnalyzer: Tokenizes the string to a set of words and converts them to lower case.
  • StandardAnalyzer: Tokenizes the string to a set of words identifying acronyms, email addresses, host names, etc., discarding the basic English stop words (a, an, the, to) and stemming the words.

Searching Indexes

Once created, indexes can be searched by providing complex queries specifying the field and the term that needs to be searched. For example, the user query abstract:system AND email:abc@mit.edu will result in all documents that contain system in the abstract and have the email address abc@mit.edu. Hence, a search on the sample index of the earlier table would return Doc15. Documents that match the query are ranked based on the number of times the terms occurs in the document and the number of documents that have the terms. Lucene implements a ranking mechanism and gives us the flexibility to modify it if required.

Saturday 12 November 2011

Efficient way to find largest unique substring from a given string

The algorithm which I am proposing, takes O(n) time and O(1) space. The unique sub string in this problem is a sub string having all the characters distinct.

I am assuming that the string contains ASCII characters. ASCII includes definition of 128 characters. If we represent each ASCII character by 1 bit, we require 16 bytes or 4 integers to represent all.

ALGORITHM
Its very simple.  Take a hash table of 4 integers initialized to 0. Just keep on reading characters of the string and check whether that character is already in the current sub string by checking whether the corresponding bit in the hash table is set.
        If it is not set, you advance the end pointer of the sub string and then check whether the current unique sub string is larger than the maximum length unique sub string till now and accordingly update.
        If it is set, you advance the start pointer of the current sub string till the point that bit is unset and follow the loop.

CODE:
Here is the C code for it :

'start' and 'end' mark the current substring processed.
'mstart' and 'mend' mark the longest found substring till now
str[] is the given string


At the end, the sub string between mstart and mend is the required sub string.


Running partitional clustering using hadoop map reduce



Here I will be discussing k means clustering which I implemented using hadoop map reduce.

The k means Clustering:
We need vectors to represent our data. We also need k centers to start. These centers are vectors too, sometimes they are just a subset of the input vectors, but sometimes they are random points or points-of-interest to which we are going to cluster them. Since this is a MapReduce version we have to figure out what keys and values we are using. We are just using a vector, a vector can be a clustercenter as well. So we treat our clustercenter-vectors always like keys, and the input vectors are simple values. The clustering itself works like this:

In the map step
 Read the cluster centers into memory from a sequencefile
 Iterate over each cluster center for each input key/value pair.
 Measure the distances and save the nearest center which has the lowest distance to the vector
 Write the clustercenter with its vector to the filesystem.

In the reduce step
 Iterate over each value vector and calculate the average vector. (Sum each vector and divide each part by the number of vectors we received).
 This is the new center, save it into a SequenceFile.
 Check the convergence between the clustercenter that is stored in the key object and the new center.
 If it they are not equal, increment an update counter
 Run this whole thing until nothing was updated anymore.


LIST PROCESSING
Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements. A MapReduce program will do this twice, using two different list processing idioms: map, and reduce.

MAPPING LISTS
The first phase of a MapReduce program is called mapping. A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element.

REDUCING LISTS
 Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

MapReduce Data Flow
Now that we have seen the components that make up a basic MapReduce job, we can see how everything works together at a higher level: It is to be noted that a single node can run many mapper processes at the same time.


The distance measurement
We need a measurement of a distance between two vectors, especially between a center and a vector. I've came up with the manhattan distance because it doesn't require much computation overhead like square-rooting (Euclidian distance) and it is not too complex. Let's have a look:


 InputSplits:
 An InputSplit describes a unit of work that comprises a single map task in a MapReduce program. A MapReduce program applied to a data set, collectively referred to as a Job, is made up of several (possibly several hundred) tasks. Map tasks may involve reading a whole file; they often involve reading only part of a file. By default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same size as blocks in HDFS). You can control this value by setting themapred.min.split.size parameter in hadoop-site.xml, or by overriding the parameter in theJobConf object used to submit a particular MapReduce job. By processing a file in chunks, we allow several map tasks to operate on a single file in parallel. If the file is very large, this can improve performance significantly through parallelism. Even more importantly, since the various blocks that make up the file may be spread across several different nodes in the cluster, it allows tasks to be scheduled on each of these different nodes; the individual blocks are thus all processed locally, instead of needing to be transferred from one node to another. Of course, while log files can be processed in this piece-wise fashion, some file formats are not amenable to chunked processing. By writing a custom InputFormat, you can control how the file is broken up (or is not broken up) into splits. The InputFormat defines the list of tasks that make up the mapping phase; each task corresponds to a single input split. The tasks are then assigned to the nodes in the system based on where the input file chunks are physically resident. An individual node may have several dozen tasks assigned to it. The node will begin working on the tasks, attempting to perform as many in parallel as it can. The on-node parallelism is controlled by the mapred.tasktracker.map.tasks.maximum parameter.


The Mapper
Let's assume that there is a list or a list-like sequencefile-iterating interface that is called centers. It contains ClusterCenter objects that represent the current centers. The DistanceMeasurer class contains the static method we defined previously.

It's just a looping and a measuring, always keeping a reference to the nearest center. Afterwards we are writing it out.


The Reducer
 Once again let's have a list or a list-like sequence file-iterating interface that is called centers. Here we need it for storage reasons.
 Let me explain:
 The first loop only dumps the values in the iterable into a list and sums up each component of the vector in a newly created center. Then we are averaging it in another loop and we are writing the new center along with each vector we held in memory the whole time. Afterwards we are just checking if the vector has changed, this method is just a delegating to the underlying vectors compareTo. If the centers are not equal it returns true. And therefore it updates an counter. Controlling Hadoop MapReduce Job recursion First off you should know that in Hadoop you have counters, you may see them after a job ran or in the Webinterface of the Jobtracker. "Famous" counters are the "Map input records" or the "Reduce output records". The best of all is that we can setup our own counters, just with the use of enums.


How to setup Counter?
The simplest approach is to just define an enum like this: public enum UpdateCounter { UPDATED } Now you can manipulate the counter using: context.getCounter(UpdateCounter.UPDATED).increment(1);
"context" is the context object you get from your mapper or your reducer. This line will obviously increment your update counter by 1.

 How to fetch the counter?
This is as easy as setting up an enum. You are submitting a job like this:
Configuration conf = new Configuration(); Job job = new Job(conf); job.setJobName("Graph explorer"); job.setMapperClass(DatasetImporter.class);
job.setReducerClass(ExplorationReducer.class); // leave out the stuff with paths etc. job.waitForCompletion(true);

You can access your counter like this:
long counter = job.getCounters().findCounter(ExplorationReducer.UpdateCounter.UPDATED) .getValue();


 How to get the recursion running?
 Now we know how to get the counter. Now setting up a recursion is quite simple. The only thing that you should watch for is already existing paths from older job runs.


Quick Summary:
 In my code I have taken a small data set, so the number of mapper tasks is 1 because the input file does not exceed 64 MB limit as by default the input file is split into chunks of 64 MB though we can very well change this limit. The recursion is controlled by counters as shown in the above section. Initially I took 2 random cluster centers (hence k=2 in k means clustering). Then I implemented WritableComparable interface for keys as well as values (i.e. the vector class and the clustercenter class in the code). Then I implemented the k means clustering using map reduce. After running the task, I found that the data set converged in 3 interations which is quite small because the data set was very small hence the beauty of map reduce could not be appreciated. Had the data set been larger(of the order of 100 MB or more), there would have been significant performance improvement by running the task using hadoop map reduce.