Monday, 10 March 2014

Hadoop Introduction




Hadoop Introduction
Big Data ... Big Data everywhere... well that's been the state of Industry as of Now. Everybody is talking about big data, hiring big data professionals, starting projects on big data. People who have knowledge of big data tools, are getting very good salary. So what exactly is this big data. For me it is a problem. Consider a scenario before smartphones or internet enabled phones, For shopping we used to go to markets(at least in India), for paying utility bills we had to stand in queues, for transferring money, for rail ticket booking etc we had to stand in queues or physically go to the place. Now with more and more usage of internet (thanks to smartphones), we are doing every thing online, shopping, bill payments, investment in share market, connecting with friends, applying for passport or Driving license etc. Now all these things have suddenly increased the amount of digital data we produce. I think big data has been there from quite some time, but now it has come in form of digital data. earlier it was in form of hard copies or some files in government offices. big data is everywhere...
       lets take an example to know that how big this data is...

       Facebook has 950 m users, Now assume on an average, each user produces 10 kb of data. Now I agree that not every body is producing this much of data and some might be producing far more higher amount than 10 KB. but I am just assuming 10KB


        1 user produces 10KB/d

        1000 users will produce ~10MB/d (10KB*1000)
        1000000(i.e. 1m) users will produce ~ 10 GB/d(10MB*1000)
        950m users will produce ~9.5TB/d

       Now thats some big number per day.


      So when we deal with this huge amount of data, you should be able to do two basic things with it.

1. Store
2. Process

       We should be able to store this huge amount of data and process this huge amount of data. When we are storing this huge data, it is obvious that we cannot store it in one machine. We need multiple machines to store data. But using multiple Systems to store a huge file brings in its own complexities. So we need a Multi machine file system which can abstract the details of dividing a file into multiple parts and storing it on different machines. 

      When it comes to processing of data, as a programmer i should not bother if file stored on one machine or multiple machines. I should not be concerned about connecting to different machines and processing a particular part of file in that machine. It should be abstracted from a developer.

Google was one of first companies who started focusing on solving this problem. Sanjay Gemawat with his team came up with two components

1. GFS (Google File System)
2. MRF (Map Reduce Framework)

     GFS is a multi-machine file system which handles storage of huge data.

MRF is a library which helps processing the data stored on GFS. MRF abstract many low level complex details from the developer and makes processing of data sitting on GFS easier with parallel processing.Papers about this was published by Google.

Doug Cutting and  Mike Cafarella implemented the papers published by google on GFS and MRF for handling huge data in a project called Nutch. Nutch is a web crawler. Then they separated that two components from Nutch and created a new project. This new project was named Hadoop. Hadoop has two basic components.

1. HDFS(Hadoop Distributed File System)
2. MRF(Map Reduce Framework)

    HDFS is implementation of paper on GFS. it handles the storage of huge amount of data.

MRF is implementation of MRF paper published by Google. It handles processing of Huge amount of data.

Now when do i start saying that i have problem of big data. People call it 3 V's

1. Volume
2. Velocity
3. Variety

     Volume, when amount of data that you handle increases, after a point you have to split and put it on multiple machines. As soon as you start using multiple machines. The system becomes more complex. More are the machines , more is complexity. we need a abstraction, so that system should be easy to scale without adding much to complexity of system. Then we have a need of HDFS and MRF at that point.


    Velocity, If new data that is coming in system is coming in very fast, then we have problem of big data. we need a solution like Hadoop. Imagine Facebook's example that we took earlier. 9.5TB/d , its huge data in very short time & that's high velocity of data coming in.

   
   Variety, Your data that you are processing can be of structured, unstructured and semi - structured form. You may be getting features out of this unstructured data and putting them into matrix to process with some ML algorithm. Now to process these huge matrices in parallel way, is difficult. we need a big data solution here.

What do i need to work on hadoop?


 Hadoop is built on java. if you know java, it is very easy for you. But if you don't know java, don't worry. There are other ways to work on it. There other languages like Hive and Pig. Hive has Sql like syntax, it is built by Facebook and donated to Apache foundation. If you are comfortable with sql. you may like to use Hive. Hive converts sql queries to Java code behind the scene. So it very useful if for operations like join, group by etc. Pig on the other hand was built by Yahoo and donated to Apache. it has more of scripting language kind of syntax. so if you are from scripting background, you may like to use pig.


That's the end of the session today. Please feel free to share your thoughts on this. Stay tuned for more posts...:)

No comments:

Post a Comment