Monday, 14 November 2011

Using Lucene to search Java code


Several websites allow the software development community to share information by publishing developer guidelines, white papers, FAQs, and source code. As the information content grows, with several developers contributing to the repository, websites provide search engines to search all of the information present on the site. While the search engines do a very good job in retrieving text documents, they severely constrain a developer when searching source code. Search engines consider source code files as plain text documents and hence and are no different from a sophisticated grep tool capable of handling large source files.
In this article I propose the approach of using Lucene, the Java-based open source search engine, to search source code by extracting and indexing relevant source code elements. I restrict the search to Java source code only. However, extending the search to any other programming language's source code should not be very different.
The article gives you a brief overview of the important aspects of a search engine in the context of Lucene.

Overview

Lucene is one of the most popular open source search engine libraries. Lucene consists of core APIs that allow indexing and searching of text. Given a set of text files, Lucene can create indexes and allow you to search those indexes with complex queries such as +title:Lucene -content:Search , search AND Lucene , +search +code. Before going into the details of searching, let me introduce you to some of the functional aspects of Lucene.

Indexing Text in Lucene

Search engines scan all of the data that need to be searched and store them in a structure that allows efficient retrieval. The most well-known structure is called the inverted index. For example, consider indexing a set of conference proceedings. First, each paper or document of the proceedings is considered to have distinct parts, or fields, such as title, author, email, abstract, and content. The text of each field is tokenized and keywords or terms are extracted. The inverted index for the proceedings will be stored as shown in the following table.
FieldTermDocument FrequencyDocument
AuthorErik Hatcher1Doc1
Bill Inmon1Doc6
....
Emailerik@abc.com1Doc1
xyz@mit.edu1Doc15
....
Abstractlucene1Doc1
j2ee2Doc6,Doc7
system5Doc1,Doc2,Doc6,Doc15,Doc18
....
For each term in a field, the number of the documents it occurs in (document frequency) and the IDs of the documents in which it occurs are stored. Other details are stored for each term: the number of times it occurs in each document and the position where it occurs. However, all that is important for us is to know that indexing a text file for Lucene means storing it in a format that allows for efficient querying and retrieval.

Analyzing Text to Be Indexed

Lucene processes the text to be indexed with analyzers. Analyzers are used to tokenize text, extract relevant words, discard common words, stem the words (reduce them to the root form, meaning that bowlingbowlerand bowls are reduced to bowl), and perform any other desired processing before storing it into the index. The common analyzers provided by Lucene are:
  • SimpleAnalyzer: Tokenizes the string to a set of words and converts them to lower case.
  • StandardAnalyzer: Tokenizes the string to a set of words identifying acronyms, email addresses, host names, etc., discarding the basic English stop words (a, an, the, to) and stemming the words.

Searching Indexes

Once created, indexes can be searched by providing complex queries specifying the field and the term that needs to be searched. For example, the user query abstract:system AND email:abc@mit.edu will result in all documents that contain system in the abstract and have the email address abc@mit.edu. Hence, a search on the sample index of the earlier table would return Doc15. Documents that match the query are ranked based on the number of times the terms occurs in the document and the number of documents that have the terms. Lucene implements a ranking mechanism and gives us the flexibility to modify it if required.

No comments:

Post a Comment