Thursday, 24 April 2014

Cluster Setup

Cluster Setup

  • Purpose
This document describes how to install and configure and setup cluster set up for few nodes(less than 10).

Pre-requisites
  1. Create User and Group Hadoop in all the nodes of the cluster.
  2. Check the CPU Configuration and Memory Configuration in all the nodes. ( cat /proc/meminfo and cat /proc/cpuinfo)


Installation
Unpack the software in all the nodes. Typically one machine is designed as NameNode and another Machine as JobTracker. These are the masters. Small clusters can have both NameNode and JobTracker running in the same node. The rest machines are Both DataNode and TaskTracker. These are the slaves.
For us, st11p00me-irptproc008 is configured as NameNode and st11p00me-irptproc009 and st11p00me-irptproc010 as DataNodes.


Configuration
The following steps describe how to configure hadoop cluster:-
Configuration Files
Site Specific Configuration Files:
  1. In core-site.xml file in Conf directory of hadoop, configure the hadoop temp directory and filesystem default name.
  2. In hdfs-site.xml, we need to configure the replication factor. We put the replication factor as 3 and the default dfs block size as 64MB.
  3. In mapred.site.xml, we need to put the JobTracker’s IP and Port.
  4. In the last section, I would attach snapshot of all the config file.

SSH Setup

  1. SSH Localhost should work all the nodes.
  2. Use ssh-keygen -t rsa to generate keys.
  3. The above command would generate keys and keep in id_rsa.pub
  4. Do not use in passphrase while generating the keys.
  5. Copy the public key for all hosts in each nodes in .ssh/authorized_keys file.
  6. Repeat the step for all the nodes.


Execution
  1. Use the bin/hadoop namenode -format
  2. Start the hadoop daemons using bin/start-all.sh
  3. Use jps and check the processes running. In Master we would find namenode,secondary namenode and JobTracker and in Slaves, we would find DataNode and Task Tracker processes.
4. Stop the hadoop daemons using bin/stop-all.sh.

Resources.
Below are the screen shots of the configuration xml’s

In Master Node

  1. core-site xml


Parameter
Value
Notes
fs.default.name
hdfs://17.172.45.71:9000
IP and Port of the HDFS
hadoop.tmp.dir
/home/hadoop/bin/hadoop-0.20.2/hadoop-{$.user.name}
Directory for creating intermediate temporary files.



  1. hdfs-site.xml


Parameter
Value
Notes
dfs.replication
3
Copies of Data kept in all the data nodes
dfs.data.dir
/home/hadoop/bin/hadoop-0.20.2/dataNode
Directory to keep all the blocks of data
dfs.block.size
128MB
The block size in which the data would be distributed
dfs.hosts.exclude
conf/excludes
The host names to be excluded from the cluster





  1. mapred-site.xml


Parameter
Value
Notes
mapred.job.tracker
17.172.45.71:9001
Job Tracker IP and Port
mapred.tasktracker.map.taks.maximum
12
Maximum number of the Map tasks to be run on a given machine.
mapred.tasktracker.reducer.taks.maximum
4
Maximum number of the Reducer tasks to be run on a given machine.
mapred.child.java.opts
-Xmx4G
Java Memory Option
mapred.reduce.parallel.copies
20
Higher number of parallel copies run by reduces to fetch outputs from very large number of maps
tracker.http.threads
20
Number of threads on TaskTracker HTTP Server


In Slave Node

  1. core-site.xml


Parameter
Value
Notes
fs.default.name
hdfs://17.172.45.71:9000
IP and Port of the Job Tracker
hadoop.tmp.dir
/home/hadoop/bin/hadoop-0.20.2/hadoop-{$.user.name}
Directory for creating intermediate temporary files.










  1. hdfs-site.xml


Parameter
Value
Notes
dfs.replication
3
Copies of Data kept in all the data nodes
dfs.data.dir
/home/hadoop/bin/hadoop-0.20.2/dataNode
Directory to keep all the blocks of data
dfs.block.size
128MB
The block size in which the data would be distributed
dfs.hosts.exclude
conf/excludes
The host names to be excluded from the cluster


  1. mapred-site.xml


Parameter
Value
Notes
mapred.job.tracker
17.172.45.71:9001
Job Tracker IP and Port
mapred.tasktracker.map.taks.maximum
12
Maximum number of the Map tasks to be run on a given machine.
mapred.tasktracker.reducer.taks.maximum
4
Maximum number of the Reducer tasks to be run on a given machine.
mapred.child.java.opts
-Xmx4G
Java Memory Option
mapred.reduce.parallel.copies
20
Higher number of parallel copies run by reduces to fetch outputs from very large number of maps
tracker.http.threads
20
Number of threads on TaskTracker HTTP Server








Notes

  1. NameNode is more memory intensive than CPU.
  2. Sometimes the daemon process of the slave nodes doesn’t gets killed by stop-all.sh in master. We need to manually kill all the respective java processes in those cases.
  3. DataNodes are more CPU intensive. So avoid running any other CPU intensive jobs in those nodes.
  4. Configuration for efficient cluster is purely dependent on
  1. Data and Job to be executed.
  2. Network IO capacity and latency
  3. Experimentation with loads of configuration properties from different school of thoughts.
  4. Avoid Using Huge Number of Small Files.

No comments:

Post a Comment