Hadoop Distributed File System
In Previous posts we
learned how to install Hadoop , Introduction to Hadoop etc. today we will learn
about HDFS (Hadoop Distributed file system). HDFS is component of Hadoop. It
handles storage part of Hadoop. HDFS follows master slave architecture.
Let us discuss Master slave Arch. Let us discuss, what is Master slave Arch.
In Master slave Arch. we have two kind of machines. First set is Master other is slaves.
Master does following two things.
1. Plan
2. Monitor
Master is like Manager of your team, He will plan. If he has some work to do, Master will plan whom to assign that work.
Slaves do Following two things.
1. Work
2. Report
Slave is like developer of your team( :P Please don't feel offended it is just for analogy). Slave does the actual work. If Master assign work to slaves and slaves works and complete the work. Similar to Manager of your team who wants to develop some software, he will plan who is going to develop which component of software. Main software development work is done by Developer of team.
Now manager also monitors and keep track of work, similarly Master machine will monitor status of work by slaves. In daily standup meetings developer gives status to manager, similarly at specific interval slaves in system will send status report to Master.
This is Master slave arch. Now This arch. is followed by both HDFS and MapReduce. Let us see How HDFS is following this architecture.
In HDFS Master node is called Namenode and slave node is called Datanode. Before explaining what Namenode and datanode do let us see what is blocks. Any file that we store on HDFS is divided into multiple parts. Each part is called block. Now By default each block has size of 64MB, but you can change it to 124MB or some other number according to your need. So a file is composed of multiple blocks of 64MB size. these blocks are scattered on different datanodes in cluster.
Namenode: As discussed earlier, master node is one, who plans that which slave is going to perform which task. Main task of HDFS is storing data. So Namnode here plans that which slave(data node) is going to store which part of file. Namenode stores a File System table in which it mentions that which block of file is stored on which datanode. It is like your manager maintains information that which developer is working on which module. This File System table is called FSImage in Namenode. It is kept in main memory by namenode. Any changes to this are done on written on logs file, that file is called editlogs file.
Datanode: Datanode is the place where data is stored. As discussed earlier, blocks are stored on datanodes. Datanodes also report to Namenode by sending a signal regularly. This signal is called Heartbeat(HB) signal. datanode sends this signal to Namenode after every 3 seconds. if namenode does not receive signal from datanode for 10 minutes. namenode will assume that data node is down. it will decommission that datanode. Now next question that comes in mind is, if i lose a datanode like this, i will lose my data. HDFS takes care of this by replication. By default HDFS maintains 3 copies of each block that it stores on it. Suppose First copy is in machine m1 and rack r1. then second copy of that block will be in machine m2 and rack r1. Third copy of that block will be in machine m3 and rack r2. So that if one machine goes down you can recover block from different machine but same rack. if the whole rack goes down you can get block from other rack, other machine. reason to keep two copies in same rack is that, inter rack data transfer speed is high.
Now there is one more type of node that HDFS has. It is called Secondary Namenode. Secondary namneode is also a master node. Secondary Namenode merges the FSImage and editlogs file and keeps FSImage updated. secondary namenode runs on different machine than namnode.
One point to notice here is that secondary namenode never become namenode, even though namnode is down. it is a misnomer.
In Master slave Arch. we have two kind of machines. First set is Master other is slaves.
Master does following two things.
1. Plan
2. Monitor
Master is like Manager of your team, He will plan. If he has some work to do, Master will plan whom to assign that work.
Slaves do Following two things.
1. Work
2. Report
Slave is like developer of your team( :P Please don't feel offended it is just for analogy). Slave does the actual work. If Master assign work to slaves and slaves works and complete the work. Similar to Manager of your team who wants to develop some software, he will plan who is going to develop which component of software. Main software development work is done by Developer of team.
Now manager also monitors and keep track of work, similarly Master machine will monitor status of work by slaves. In daily standup meetings developer gives status to manager, similarly at specific interval slaves in system will send status report to Master.
This is Master slave arch. Now This arch. is followed by both HDFS and MapReduce. Let us see How HDFS is following this architecture.
In HDFS Master node is called Namenode and slave node is called Datanode. Before explaining what Namenode and datanode do let us see what is blocks. Any file that we store on HDFS is divided into multiple parts. Each part is called block. Now By default each block has size of 64MB, but you can change it to 124MB or some other number according to your need. So a file is composed of multiple blocks of 64MB size. these blocks are scattered on different datanodes in cluster.
Namenode: As discussed earlier, master node is one, who plans that which slave is going to perform which task. Main task of HDFS is storing data. So Namnode here plans that which slave(data node) is going to store which part of file. Namenode stores a File System table in which it mentions that which block of file is stored on which datanode. It is like your manager maintains information that which developer is working on which module. This File System table is called FSImage in Namenode. It is kept in main memory by namenode. Any changes to this are done on written on logs file, that file is called editlogs file.
Datanode: Datanode is the place where data is stored. As discussed earlier, blocks are stored on datanodes. Datanodes also report to Namenode by sending a signal regularly. This signal is called Heartbeat(HB) signal. datanode sends this signal to Namenode after every 3 seconds. if namenode does not receive signal from datanode for 10 minutes. namenode will assume that data node is down. it will decommission that datanode. Now next question that comes in mind is, if i lose a datanode like this, i will lose my data. HDFS takes care of this by replication. By default HDFS maintains 3 copies of each block that it stores on it. Suppose First copy is in machine m1 and rack r1. then second copy of that block will be in machine m2 and rack r1. Third copy of that block will be in machine m3 and rack r2. So that if one machine goes down you can recover block from different machine but same rack. if the whole rack goes down you can get block from other rack, other machine. reason to keep two copies in same rack is that, inter rack data transfer speed is high.
Now there is one more type of node that HDFS has. It is called Secondary Namenode. Secondary namneode is also a master node. Secondary Namenode merges the FSImage and editlogs file and keeps FSImage updated. secondary namenode runs on different machine than namnode.
One point to notice here is that secondary namenode never become namenode, even though namnode is down. it is a misnomer.
No comments:
Post a Comment