Idle Process: August 2012

Saturday, 25 August 2012

Thread Locals in Java

Thread Locals

Thread local is a scope just as we have static scoped variables which belong to a class or instance scoped variables which belong to an Object. Thread local variables belong to a thread. Each thread would have its own thread local variable. So, threads can't modify each other's thread local variables. Thread variables are a sort of global variables which are restricted to a thread.

When do we use Thread Local variables ?

By default, data is shared between threads. You can refer to my previous post to get an idea about what is shared among threads. You can use Thread Local variables when you want each thread to have its own copy of something. One very important use cases of thread locals is when we have an object that is not thread safe, but we want to avoid synchronizing access to that object. It would be more clear from an example.

Suppose I have a requirement to use use Java Calendar object in my code. Since Java Calendar is not thread safe, we can either have Calendar object as an instance variable or have it as a class variable and provide synchronization to access it. The first method can't be used in most of the production codes because Calendar object creation is an expensive operation. And still if we have 2 threads of the same process accessing the variable, we need synchronization. The second method looks good since we would have only one object of Calendar class, but we would have to take care of synchronization.

If we don't want to bother about synchronization, we can go for thread local variables in which case, each thread would be given its own local copy of the thread local variable. Another alternative to thread locals or synchronization is to make the variable a local variable. Local variables are always thread safe. But in our case, since Calendar object creation is an expensive operation, it is not recommended to use local variable for Calendar. Since each time the method is called, a Calendar object would be created which is an expensive operation, it would slow down the.

Another very important use case of Thread Local variables is when we want to associate state with a thread. Many frameworks use ThreadLocals to maintain some context related to the current thread. Web applications might store information about the current request and current session in thread local variables so that the application has easy access to them without passing them as parameters every time. Let me explain this with a scenario.

Lets say, we have a Servlet which calls some methods. You have a requirement to generate a unique transaction ID for each request you receive and pass this transaction ID to the business methods for processing. One way is to generate a unique transaction ID each time the servlet receive a request and pass this trasaction ID to the methods which require it. But this doesn't look good, since passing of transaction ID to all methods which require it is redundant and unnecessary. Instead, we can use thread local variable to store the transaction ID. Every method which requires transaction ID can access it through the thread local variable. The servlet might be receiving many requests, but each request is processed in a separate thread. So, each transaction ID would be local to a thread and would be accessible all through the thread's execution which is what I mean when I say that Thread Local variables are global.

Usage of Thread Local variables in Java

Java provide a class named ThreadLocal by which you can set and get Thread Local variables.
Typically Thread Local variables are static fields in classes. The code below shows you how to create a Thread Local variable.

Problems with Thread Locals

Thread Locals also comes up with many problems and you have to be careful while using thread locals. Thread Locals can lead to classloading leaks. Thread Locals are very dangerous when it comes to long running applications and garbage collection. Let me explain this point a little bit.

If you use thread locals to store some object instance, there is a high risk that the object stored in thread local is never collected by garbage collector when your application runs inside WebLogic Server. This is because WebLogic server maintains a pool of working threads even when the class that created it is garbage collected. So, if you do not clean up when you are done, any references that it holds as part of the webapp deployment will remain in the heap and would never be garbage collected. This problem can be solved through the proper use of Weak References with Thread Locals.

Sunday, 19 August 2012

Processes andThreads

In this post I would be discussing about processes and threads.

Process

A process is an instance of a program that is being executed. A process consumes the resources of an operating system. Since there are many processes running at a time, how does the OS manages its resources ? To manage processes, an operating system has a process table. A process table is a data structure which includes the following information:

Process ID
Process Owner
Process priority
Pointer to the executable code of the process
Parent Process
Environment variables
Process state

A process can have many threads of execution. By default, any running program has a single thread of execution. A process has a unique address space which is generally not shared with any other process, except during inter process communication, the operating system can relax this condition.

Threads

A thread is a smallest unit of execution that can be scheduled by an OS. A thread is called the light weight process because thread creation can be 10-100 times faster than a process creation. This is because threads share address space unlike processes which do not share address space. Here I am talking about the threads of the same process. Threads of different processes, of course do not share address space. The main reason for having threads is that in many applications, many activities are going on at the same time. Some of these activities may block from time to time. By decomposing such an application into multiple threads, we increase performance. Threads yield no performance gain when all of them are CPU bound, but when there is substantial amount of I/O as well as computing. Having threads, allows the activities to overlap, this speeding up the application. Threads also allow parallel execution on a multiprocessor system. In this case, the programmer needs to be careful to avoid race condition.

A thread has the following information:

Thread ID
Program Counter
Register Set
Stack

So what does the thread share with other threads of the same process?

Code section
Data section
OS resources

Lets talk about the advantages of threads.

Thread Advantages:

Thread creation and destruction is faster than process creation and destruction.
A thread has lower context switching overhead than a process. This is because a thread has a lesser context than a process because threads share address space. Remember here I am talking about thread of the same process.
Information sharing between threads is easier and has less overhead because threads share address space. So, data produced by one thread is immediately available to all other threads of the same process.

Thread Disadvantages:

Since global variables are shared between threads, inadvertent modification of shared variables can be disastrous. It calls for concurrency control measure which have their own complications.

Types of Thread Implementations:

There are 3 types of thread implementations:

User Level Threads
Kernel Level Threads
Hybrid implementation

User Level Threads:

The type of thread implementation puts the thread package entirely in user space. The kernel is not aware of threads. The kernel just knows that it is managing single threaded processes. Like an OS maintains a process table, a process maintains a thread table which does the same job as process table does for operating system. Each process has its own private thread table. In this implementation, when a thread wants to go to the blocked state, it notifies the run time system. The run time system saves the thread state in the thread table and looks for a ready thread in the thread table to run. We see that in case of user level thread implementation, we don't trap to the kernel in case of thread switching. This is at least an order of magnitude faster than trapping to the kernel in case of kernel level thread implementation.

The main problem with this type of thread implementation is that if by chance any user-level thread is blocked in the kernel, all threads of that process are blocked. Another problem with use-level threads is that we don't take advantage of multiprocessing since the kernel is not aware of any threads.

Kernel Level Threads:

In this type of thread implementation, any thread in a process would be mapped to a kernel level thread. Switching between threads in this case requires kernel mode switch which is expensive. When a thread blocks, the kernel, at its option, can run either another thread from the same process or a thread from a different process. With user level threads, the run time system keeps running threads from one process until the kernel takes the CPU away from it.

Saturday, 4 August 2012

Website Parsing

Today I would be writing about a website parser which I wrote. In this post, I would show you how to parse www.cricbuzz.com website. But the logic is nearly the same for other websites also. You can play around by changing the logic according to your needs. I have used Python in my code, so you need to know Python to follow this post.

So lets begin parsing cricbuzz site. First go to that website. Go to the page that you want to parse. Suppose I want to parse the ongoing England Vs South Africa test match. Go to the Full scorecard page of cricbuzz as shown below:

.
If you are using Google Chrome browser, press Shift + Ctrl + J to go into the developer mode. You would see a new split window having some tabs as shown below:

Then click on the Network Tab:

Then click on scorecard.json which is highlighted in the above picture.

We see that this site uses JSON which is a light weight data interchange format to send the data. JSON is easy for machines to parse and generate. It is based on Javascript Programming Language. Now you can use your logic to parse the site. I will be using Python's Json package to parse the Json content. Lets start with the code. First import json package. We would need the URL of the JSON page to begin parsing, so get the url by right clicking on scorecard.json. You can check that URL by pasting in your web browser. You should see a page like this :

We need to get the data from this URL to begin parsing. We can use urllib2 package for this task. The following statement would get the whole data in result string, where the URL is the copied URL:

result = json.load(urllib2.urlopen(URL))

The logic I have used is that if the score changes after 20 mins, it would send an email to the person. To handle the email part, we have to use smtp package.

So, here is the complete code:

The code I have used has very little practicality, but the idea was to make the concept clear. If the idea was clear, you can play around with the logic. :)

Idle Process