Idle Process: Algorithms

Showing posts with label Algorithms. Show all posts

Saturday, 3 May 2014

Largest Rectangular Area in a Histogram

Problem :

Find the largest rectangular area possible in a given histogram where the largest rectangle can be made of a number of contiguous bars. All bars have same width and the width is 1 unit.

For example, consider the following histogram with 7 bars of heights {6, 2, 5, 4, 5, 2, 6}. The largest possible rectangle possible is 12 (see the below figure, the max area rectangle is highlighted in red)

Solution:

A straightforward answer is to go for each bar in the histogram and find the maximum possible area in histogram for it. Finally find the maximum of these values. This will require O(n^2) time. We would solve this problem in O(n) time complexity.

There are a few invariants that we can use for this problem :

If we include bar i, maximum possible height of rectangle including that bar will be h(i), height of that bar.
If we include bar i, maximum possible width of rectangle including that bar will be L + R + 1, where :

L is the number of adjacent bars to the left of ith bar and height greater than or equal to h(i)
R is the number of adjacent bars to the right of ith bar and height greater than or equal to h(i)

Now our task remains as follows:
1) Get Li
2) Get Ri
3) Get area of rectangle for bar i : A(i) = h(i) * (Li + Ri + 1)
4) Fid maximum value of A(i)

Here is the Java code :

Saturday, 26 October 2013

Posix Threads

In this post I would demonstrate the usage of Posix threads through a very simple problem. I would also demonstrate how we can synchronize Posix threads using mutex. Though there are many synchronizing primitives available, I would stick to mutex for simplicity. Please note that the code presented here is only for demonstration purpose and is inefficient.

Before starting on the problem, let me give a brief introduction of Posix threads. Historically, hardware vendors developed their own proprietary implementations of threads. These implementations differed substantially from each other. So, it was very difficult to develop multi-threaded portable applications. So, a standard specification was required to standardize the thread interface and this interface is Posix thread.

Problem :

A string is to be processed by a given number of threads. The processing in this case would be printing of the string. Each thread should do specific amount of work before passing control to the next thread. The work in our case is the printing of characters. The control is to be passed to the next thread in a round robin manner till the string is processed completely.

Program should take 3 command line arguments.
<Program name> <Input String> <Amount of work> <Number of threads>
Here is the desired output for a sample run.
ankit@ankit:~/problems/p1$ ./driver ABCDEFGHILKLMN 2 3
Thread 1 : AB
Thread 2 : CD
Thread 3 : EF
Thread 1 : GH
Thread 2 : IL
Thread 3 : KL
Thread 1 : MN
ankit@ankit:~/problems/p1$

Note that each thread is getting control in a round robin manner. In the beginning, thread 1 printed "AB" since the amount of work specified is 2. Then thread 2 printed "CD", then thread 3 printed "EF". After this, control again came to thread 1.

This program should be complied using -lpthread option. eg: gcc driver.c -lpthread -o driver

This problem throws light on many basic aspects of Posix threads like thread creation, thread synchronization. Ideally the sleeping part in the program is not efficient. Instead we should wait the thread and signal when its turn come. But this is for demonstration purpose.

Sunday, 20 October 2013

Subset Problems

In this post, I would discuss some basic programming problems on sets.

Problem Statement:

Given a list of numbers, find all the possible subsets of the list of numbers. For example, if the set of numbers given is {1,2,3}, the output should be {},{1},{2},{3},{1,2},{1,3},{2,3},{1,2,3}.

Solution:

The first solution is recursive is pretty straight forward. Please follow the Java code given below.

The second solution is iterative and is more efficient than the first solution. This solution works based on the principle that for a set of n elements, there would 2^n subsets. The binary representation of an integer i in the range 0 to (2^n - 1) is used to find the corresponding ith subset. For example, if there are 3 elements in the set, we enumerate the binary representation of all the numbers from 0 to 7. Each binary representation represents a subset. The set bit at ith index represents that take ith element from the set. Here are those :
000 -> Empty set
001 -> Only take 1st element
010 -> Only take 2nd element
011 -> Take 1st and 2nd element
100 -> Only take 3rd element
101 -> Take 1st and 3rd element
110 -> Take 2nd and 3rd element
111 -> Take all elements

If we tweek the above problem statement a little. Here is the new problem statement:

Problem Statement:

Given a list of numbers from 0-9, find the greatest number that can be formed from these numbers such that the number is divisible by three

Solution:

We can use the recursive approach above. One more property is needed : If a number is divisible by 3, then the sum of its digits is also divisible by three. Since we have to find the greatest number, we can start with the greatest number in the recursive solution above. Also the list need to be reverse sorted so that the smallest number is removed first. Here is the code segment :

Sunday, 6 October 2013

Finding Overlapping Rectangles in a given set of Axis aligned rectagles

Describe an algorithm that takes an unsorted array of axis-aligned rectangles and returns any pair of rectangles that overlaps, if there is such a pair. Axis-aligned means that all the rectangle sides are either parallel or perpendicular to the x- and y-axis. You can assume that each rectangle object has two variables in it: the x-y coordinates of the upper-left corner and the bottom-right corner.

Solution :

We can use Sweep Line algorithm for this. Create a sorted array of the x-coordinates of the left and the right edges of all the rectangles. The x-coordinates should be annotated with the rectangles they belong and whether they are the left or the right edge of the rectangle. Now, scan the sorted list from left to right, adding each rectangle to a balanced BST when you encounter the left edge of a rectangle and removing the rectangle from the balanced BST when you encounter the right edge of the rectangle. The solution requires the usage of one dimensional range search tree. A range search tree is a tree which is capable of performing range search queries efficiently. The time complexity of a range search query in a range search tree is O(R + logN), where N is the number of nodes in the tree and R is the number of elements in the result set.

First lets see how a search query works in a one dimensional range search tree:
1DRangeSearchTree(k1, k2, v):
Input : Keys k1 and k2, and a node v of a binary search tree T
Output : The elements whose keys are greater than or equal to k1 and less than or equal to k2.
Algorithm :
if key(v) >= k1 and key(v) <= k2
ElementsInLeftSubTree = 1DRangeSearchTree(k1, k2, T.left(v))
ElementsInRightSubTree = 1DRangeSearchTree(k1, k2, T.right(v))
return ElementsInLeftSubTree U ElementsInRightSubTree U {element(v)}
else if key(v) < k1
return 1DRangeSearchTree(k1, k2, T.right(v))
else if key(v) > k2
return 1DRangeSearchTree(k1, k2, T.left(v))

Now lets see the detailed algorithm of the original problem:

1) Sort all left and right rectange edges, according to their X value, into list L.

2) Create a new empty range search tree T, for Y ordering of rectangle tops/bottoms

3) Create a new empty result set RS of unique rectangle pairs

4) Traverse L in ascending order. For all V in L:

If V.isRightEdge()

T.remove(V.rectangle.top)

T.remove(V.rectangle.bottom)

else

For all U in T.getRange(V.rectangle.top, V.rectangle.bottom)

RS.add(<V.rectangle, U.rectangle>)

T.add(V.rectangle.top)

T.add(V.rectangle.bottom)

5) return RS

Time complexity is O(R + N log N) where R is the actual number of intersections.

Please note that this algorithm does not consider two rectangles as overlapping if one rectangle is completely within another rectangle. However this can be corrected with a minor tweek in the search algorithm for 1DRangeSearchTree.

Reference : http://stackoverflow.com/questions/3324880/axisaligned-rectangles-intersection

Saturday, 27 July 2013

Interesting Programming Puzzle - Separate Engineers and Managers

There are n people of which more than half are engineers and the rest are managers. An engineer always tells the truth. A manager may or may not speak the truth. to every ith guy, you can ask whether the jth guy is an engineer.
Find out a way to separate out engineers and managers in the most efficient way.

Solution :

Here's an n-1 query solution
1. Maintain three sets of people: UNSEEN, STACK, and DISCARD.
2. Initialize the process by picking one arbitrarily to be the STACK, everything else is UNSEEN.
3. Repeat the following step until UNSEEN is empty:
a. Pick an UNSEEN element x, remove it from UNSEEN.
b. Ask the top of the STACK y about x. If y says "manager" pop y off the stack and DISCARD both x and y. If it says "engineer" add x to the top of the STACK.

After all elements have been processed in this way (n-1 comparisons), the top of the stack must be an engineer. Once we get an engineer, we can separate out engineers and managers by asking the engineer since an engineer always speaks truth.

Why does this work?

First observe that whenever we discard a pair, at least one of them is a manager(If you don't understand this, check by taking various cases). So among the rest of them (STACK and UNSEEN) a majority must still be engineers. So at the end,when UNSEEN is empty, there must be an engineer in the stack, therefore the top of the stack must be an engineer.

This can be improved to n-2 simply by stopping one earlier. When there's one UNSEEN left, if the stack is empty, that UNSEEN one is an engineer. Otherwise, the top of the stack must be an engineer. If is n is even and n>=4, as a first step we can just throw out one person, and apply this algorithm to the remaining n-1 obtaining n-3 comparisons. This gives the optimal algorithm.

Saturday, 2 March 2013

Bit Manipulation : Find 2 numbers which are repeated odd number of times in an array in which all other numbers are repeated even number of times

Today I would be sharing a problem which can be very efficiently solved using bit manipulations.

Problem :

Given an array containing integer numbers containing numbers which are repeated even number of times except two numbers which are repeated odd number of times. The range of integer numbers is not given. Find the two numbers which are repeated even odd number of time.

Solution :

It can be solved in O(n) time and O(1) space. The basic concept here is x ^ x = 0. Here's the algorithm :

1) XOR all the elements in the array. This would nullify all the elements which are repeated even number of times. So, we would get the XOR of elements which are repeated odd number of times.

2) Now divide the set of numbers into 2 sets such that numbers which are repeated odd number of times go into different sets. To divide the set in this way is quite easy given that the set bits in the XOR of two numbers are the bits in which the two numbers differ. From step 1, we got the XOR of the numbers which are repeated odd number of times. Divide all the elements of the array into 2 sets depending on any of the set bits in the XOR output of step 1.

3) XOR all the elements within each set and we would get the elements which are repeated odd number of times since the elements which are repeated even number of times would have nullified each other.

Here is the code snippet :

Now result1 and result2 would hold the numbers which are repeated odd number of times.

There is a trick to find the lowest bit position set. Suppose x is a number and we wish to find the lowest bit position which is set. We can do so by :
x |= x & (x-1);
The above code snippet can be made more efficient by employing the above rule.

Saturday, 12 November 2011

Efficient way to find largest unique substring from a given string

The algorithm which I am proposing, takes O(n) time and O(1) space. The unique sub string in this problem is a sub string having all the characters distinct.

I am assuming that the string contains ASCII characters. ASCII includes definition of 128 characters. If we represent each ASCII character by 1 bit, we require 16 bytes or 4 integers to represent all.

ALGORITHM
Its very simple. Take a hash table of 4 integers initialized to 0. Just keep on reading characters of the string and check whether that character is already in the current sub string by checking whether the corresponding bit in the hash table is set.
If it is not set, you advance the end pointer of the sub string and then check whether the current unique sub string is larger than the maximum length unique sub string till now and accordingly update.
If it is set, you advance the start pointer of the current sub string till the point that bit is unset and follow the loop.

CODE:
Here is the C code for it :

'start' and 'end' mark the current substring processed.
'mstart' and 'mend' mark the longest found substring till now
str[] is the given string

At the end, the sub string between mstart and mend is the required sub string.

Running partitional clustering using hadoop map reduce

Here I will be discussing k means clustering which I implemented using hadoop map reduce.

The k means Clustering:
We need vectors to represent our data. We also need k centers to start. These centers are vectors too, sometimes they are just a subset of the input vectors, but sometimes they are random points or points-of-interest to which we are going to cluster them. Since this is a MapReduce version we have to figure out what keys and values we are using. We are just using a vector, a vector can be a clustercenter as well. So we treat our clustercenter-vectors always like keys, and the input vectors are simple values. The clustering itself works like this:

In the map step
 Read the cluster centers into memory from a sequencefile
 Iterate over each cluster center for each input key/value pair.
 Measure the distances and save the nearest center which has the lowest distance to the vector
 Write the clustercenter with its vector to the filesystem.

In the reduce step
 Iterate over each value vector and calculate the average vector. (Sum each vector and divide each part by the number of vectors we received).
 This is the new center, save it into a SequenceFile.
 Check the convergence between the clustercenter that is stored in the key object and the new center.
 If it they are not equal, increment an update counter
 Run this whole thing until nothing was updated anymore.

LIST PROCESSING
Conceptually, MapReduce programs transform lists of input data elements into lists of output data elements. A MapReduce program will do this twice, using two different list processing idioms: map, and reduce.

MAPPING LISTS
The first phase of a MapReduce program is called mapping. A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element.

REDUCING LISTS
Reducing lets you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

MapReduce Data Flow
Now that we have seen the components that make up a basic MapReduce job, we can see how everything works together at a higher level: It is to be noted that a single node can run many mapper processes at the same time.

The distance measurement
We need a measurement of a distance between two vectors, especially between a center and a vector. I've came up with the manhattan distance because it doesn't require much computation overhead like square-rooting (Euclidian distance) and it is not too complex. Let's have a look:

InputSplits:
An InputSplit describes a unit of work that comprises a single map task in a MapReduce program. A MapReduce program applied to a data set, collectively referred to as a Job, is made up of several (possibly several hundred) tasks. Map tasks may involve reading a whole file; they often involve reading only part of a file. By default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same size as blocks in HDFS). You can control this value by setting themapred.min.split.size parameter in hadoop-site.xml, or by overriding the parameter in theJobConf object used to submit a particular MapReduce job. By processing a file in chunks, we allow several map tasks to operate on a single file in parallel. If the file is very large, this can improve performance significantly through parallelism. Even more importantly, since the various blocks that make up the file may be spread across several different nodes in the cluster, it allows tasks to be scheduled on each of these different nodes; the individual blocks are thus all processed locally, instead of needing to be transferred from one node to another. Of course, while log files can be processed in this piece-wise fashion, some file formats are not amenable to chunked processing. By writing a custom InputFormat, you can control how the file is broken up (or is not broken up) into splits. The InputFormat defines the list of tasks that make up the mapping phase; each task corresponds to a single input split. The tasks are then assigned to the nodes in the system based on where the input file chunks are physically resident. An individual node may have several dozen tasks assigned to it. The node will begin working on the tasks, attempting to perform as many in parallel as it can. The on-node parallelism is controlled by the mapred.tasktracker.map.tasks.maximum parameter.

The Mapper
Let's assume that there is a list or a list-like sequencefile-iterating interface that is called centers. It contains ClusterCenter objects that represent the current centers. The DistanceMeasurer class contains the static method we defined previously.

It's just a looping and a measuring, always keeping a reference to the nearest center. Afterwards we are writing it out.

The Reducer
Once again let's have a list or a list-like sequence file-iterating interface that is called centers. Here we need it for storage reasons.
Let me explain:
The first loop only dumps the values in the iterable into a list and sums up each component of the vector in a newly created center. Then we are averaging it in another loop and we are writing the new center along with each vector we held in memory the whole time. Afterwards we are just checking if the vector has changed, this method is just a delegating to the underlying vectors compareTo. If the centers are not equal it returns true. And therefore it updates an counter. Controlling Hadoop MapReduce Job recursion First off you should know that in Hadoop you have counters, you may see them after a job ran or in the Webinterface of the Jobtracker. "Famous" counters are the "Map input records" or the "Reduce output records". The best of all is that we can setup our own counters, just with the use of enums.

How to setup Counter?
The simplest approach is to just define an enum like this: public enum UpdateCounter { UPDATED } Now you can manipulate the counter using: context.getCounter(UpdateCounter.UPDATED).increment(1);
"context" is the context object you get from your mapper or your reducer. This line will obviously increment your update counter by 1.

How to fetch the counter?
This is as easy as setting up an enum. You are submitting a job like this:
Configuration conf = new Configuration(); Job job = new Job(conf); job.setJobName("Graph explorer"); job.setMapperClass(DatasetImporter.class);
job.setReducerClass(ExplorationReducer.class); // leave out the stuff with paths etc. job.waitForCompletion(true);

You can access your counter like this:
long counter = job.getCounters().findCounter(ExplorationReducer.UpdateCounter.UPDATED) .getValue();

How to get the recursion running?
Now we know how to get the counter. Now setting up a recursion is quite simple. The only thing that you should watch for is already existing paths from older job runs.

Quick Summary:
In my code I have taken a small data set, so the number of mapper tasks is 1 because the input file does not exceed 64 MB limit as by default the input file is split into chunks of 64 MB though we can very well change this limit. The recursion is controlled by counters as shown in the above section. Initially I took 2 random cluster centers (hence k=2 in k means clustering). Then I implemented WritableComparable interface for keys as well as values (i.e. the vector class and the clustercenter class in the code). Then I implemented the k means clustering using map reduce. After running the task, I found that the data set converged in 3 interations which is quite small because the data set was very small hence the beauty of map reduce could not be appreciated. Had the data set been larger(of the order of 100 MB or more), there would have been significant performance improvement by running the task using hadoop map reduce.

Idle Process

Saturday, 3 May 2014

Largest Rectangular Area in a Histogram

Problem :

Solution:

Saturday, 26 October 2013

Posix Threads

Problem :

Sunday, 20 October 2013

Subset Problems

In this post, I would discuss some basic programming problems on sets.

Problem Statement:

Solution:

Problem Statement:

Solution:

Sunday, 6 October 2013

Finding Overlapping Rectangles in a given set of Axis aligned rectagles

Solution :

Saturday, 27 July 2013

Interesting Programming Puzzle - Separate Engineers and Managers

Solution :

Why does this work?

Saturday, 2 March 2013

Bit Manipulation : Find 2 numbers which are repeated odd number of times in an array in which all other numbers are repeated even number of times

Saturday, 12 November 2011

Efficient way to find largest unique substring from a given string

Running partitional clustering using hadoop map reduce

About Me

Blog Archive