Tuesday 7 February 2012

What Hadoop is not meant for

I find a lot of people who think that hadoop will solve all their data center problems. But they would be disappointed to know that it is not so. Hadoop only solves a specific set of problems which are called big data problems. There is lot of fuss going on in the IT world about big data. Well its very simple, big data consists of datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytic,and visualizing. One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers.


Hadoop is not a substitute for a database. In an indexed database, issuing a SQL SELECT statement gives you the result in milliseconds. If you want to change data, you have UPDATE statement which comes to your rescue. Hadoop can't do all this. The important point is that hadoop stores data in files and does not index them. If you want to find something, you have to run a Map Reduce job which goes through all the data. This takes time and this means that you can't directly use hadoop as a replacement for database. And again writing a Map Reduce job is not easy. It requires professional Map Reduce programmers and it takes time to wite map reduce programs.

Hadoop works fine when the  data is too big for a database(when it has exceeded the physical limits for a database, not that you want to save money). With very large databases, the cost of generating indexes is so high that you can't easily index changing data. There is another problem when many machines are trying to write to the database. You have issues in getting locks. Here comes hadoop which writes data in files in a cluster. Now, we have Hive on the top of hadoop which writes data in columns.

No comments:

Post a Comment