Friday, 10 February 2012

NoSQL

I have hearing a lot about NoSQL these days. The first important thing to know about NoSQL is that NoSQL do not use SQL as the query language.  These data stores may not require fixed length table schemas. As the need to analyze unstructured data web data grows, NoSQL grows. With the rise of real time web, there is a need to analyze this huge amount of data. Companies realized that performance and real time nature was more important than consistency, which traditional databases were spending a lot of processing time to achieve. As such, NoSQL databases are highly optimized for retrieve and append operations and offer little functionality beyond record storage. The reduced run time flexibility compared to full SQL systems is compensated by significant gains in scalability and performance.

Traditional RDBMS are characterized by the transactional ACID properties of Atomicity, Consistency, Isolation and Durability. In contrast, NoSQL is characterized by BASE acronym :
1) Basically Available: It means we replicate the data among many different storage servers to make the failures partial.
2) Soft state: While the database which follows ACID properties make sure that the data is always in a consistent state, NoSQL allows data to be inconsistent. So, we need not refresh the database time and again since data can be inconsistent in a NoSQL database.
3) Eventually consistent: In contrast to ACID systems that enforce consistency at transaction commit, NoSQL guarantees consistency only at some undefined time in future. It means that data may not consistent at any given time, but it will eventually be consistent.

NoSQL emerged as companies struggled to deal with unprecedented data under tight latency constraints. This unstructured and semi-structured data provides a rich information source which can be harnessed to add more value to the business. Organizations which have massive data storage needs are looking forward to NoSQL. What the future holds for NoSQL remains to be seen.

Tuesday, 7 February 2012

What Hadoop is not meant for

I find a lot of people who think that hadoop will solve all their data center problems. But they would be disappointed to know that it is not so. Hadoop only solves a specific set of problems which are called big data problems. There is lot of fuss going on in the IT world about big data. Well its very simple, big data consists of datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytic,and visualizing. One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers.


Hadoop is not a substitute for a database. In an indexed database, issuing a SQL SELECT statement gives you the result in milliseconds. If you want to change data, you have UPDATE statement which comes to your rescue. Hadoop can't do all this. The important point is that hadoop stores data in files and does not index them. If you want to find something, you have to run a Map Reduce job which goes through all the data. This takes time and this means that you can't directly use hadoop as a replacement for database. And again writing a Map Reduce job is not easy. It requires professional Map Reduce programmers and it takes time to wite map reduce programs.

Hadoop works fine when the  data is too big for a database(when it has exceeded the physical limits for a database, not that you want to save money). With very large databases, the cost of generating indexes is so high that you can't easily index changing data. There is another problem when many machines are trying to write to the database. You have issues in getting locks. Here comes hadoop which writes data in files in a cluster. Now, we have Hive on the top of hadoop which writes data in columns.