Crack the Interviews: Cluster Setup

Cluster Setup

Purpose

This document describes how to install and configure and setup cluster set up for few nodes(less than 10).

Pre-requisites

Create User and Group Hadoop in all the nodes of the cluster.
Check the CPU Configuration and Memory Configuration in all the nodes. ( cat /proc/meminfo and cat /proc/cpuinfo)

Installation

Unpack the software in all the nodes. Typically one machine is designed as NameNode and another Machine as JobTracker. These are the masters. Small clusters can have both NameNode and JobTracker running in the same node. The rest machines are Both DataNode and TaskTracker. These are the slaves.

For us, st11p00me-irptproc008 is configured as NameNode and st11p00me-irptproc009 and st11p00me-irptproc010 as DataNodes.

Configuration

The following steps describe how to configure hadoop cluster:-

Configuration Files

Site Specific Configuration Files:

In core-site.xml file in Conf directory of hadoop, configure the hadoop temp directory and filesystem default name.
In hdfs-site.xml, we need to configure the replication factor. We put the replication factor as 3 and the default dfs block size as 64MB.
In mapred.site.xml, we need to put the JobTracker’s IP and Port.
In the last section, I would attach snapshot of all the config file.

SSH Setup

SSH Localhost should work all the nodes.
Use ssh-keygen -t rsa to generate keys.
The above command would generate keys and keep in id_rsa.pub
Do not use in passphrase while generating the keys.
Copy the public key for all hosts in each nodes in .ssh/authorized_keys file.
Repeat the step for all the nodes.

Execution

Use the bin/hadoop namenode -format
Start the hadoop daemons using bin/start-all.sh
Use jps and check the processes running. In Master we would find namenode,secondary namenode and JobTracker and in Slaves, we would find DataNode and Task Tracker processes.

4. Stop the hadoop daemons using bin/stop-all.sh.

Resources.

Below are the screen shots of the configuration xml’s

In Master Node

core-site xml

Parameter	Value	Notes
fs.default.name	hdfs://17.172.45.71:9000	IP and Port of the HDFS
hadoop.tmp.dir	/home/hadoop/bin/hadoop-0.20.2/hadoop-{$.user.name}	Directory for creating intermediate temporary files.

hdfs-site.xml

Parameter	Value	Notes
dfs.replication	3	Copies of Data kept in all the data nodes
dfs.data.dir	/home/hadoop/bin/hadoop-0.20.2/dataNode	Directory to keep all the blocks of data
dfs.block.size	128MB	The block size in which the data would be distributed
dfs.hosts.exclude	conf/excludes	The host names to be excluded from the cluster

mapred-site.xml

Parameter	Value	Notes
mapred.job.tracker	17.172.45.71:9001	Job Tracker IP and Port
mapred.tasktracker.map.taks.maximum	12	Maximum number of the Map tasks to be run on a given machine.
mapred.tasktracker.reducer.taks.maximum	4	Maximum number of the Reducer tasks to be run on a given machine.
mapred.child.java.opts	-Xmx4G	Java Memory Option
mapred.reduce.parallel.copies	20	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps
tracker.http.threads	20	Number of threads on TaskTracker HTTP Server

In Slave Node

core-site.xml

Parameter	Value	Notes
fs.default.name	hdfs://17.172.45.71:9000	IP and Port of the Job Tracker
hadoop.tmp.dir	/home/hadoop/bin/hadoop-0.20.2/hadoop-{$.user.name}	Directory for creating intermediate temporary files.

hdfs-site.xml

Parameter	Value	Notes
dfs.replication	3	Copies of Data kept in all the data nodes
dfs.data.dir	/home/hadoop/bin/hadoop-0.20.2/dataNode	Directory to keep all the blocks of data
dfs.block.size	128MB	The block size in which the data would be distributed
dfs.hosts.exclude	conf/excludes	The host names to be excluded from the cluster

mapred-site.xml

Parameter	Value	Notes
mapred.job.tracker	17.172.45.71:9001	Job Tracker IP and Port
mapred.tasktracker.map.taks.maximum	12	Maximum number of the Map tasks to be run on a given machine.
mapred.tasktracker.reducer.taks.maximum	4	Maximum number of the Reducer tasks to be run on a given machine.
mapred.child.java.opts	-Xmx4G	Java Memory Option
mapred.reduce.parallel.copies	20	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps
tracker.http.threads	20	Number of threads on TaskTracker HTTP Server

Notes

NameNode is more memory intensive than CPU.
Sometimes the daemon process of the slave nodes doesn’t gets killed by stop-all.sh in master. We need to manually kill all the respective java processes in those cases.
DataNodes are more CPU intensive. So avoid running any other CPU intensive jobs in those nodes.
Configuration for efficient cluster is purely dependent on

Data and Job to be executed.
Network IO capacity and latency
Experimentation with loads of configuration properties from different school of thoughts.
Avoid Using Huge Number of Small Files.

Crack the Interviews

Thursday, 24 April 2014

Cluster Setup

No comments:

Post a Comment

About Me