Cluster
Setup
- Purpose
This
document describes how to install and configure and setup cluster
set up for few nodes(less than 10).
Pre-requisites
- Create User and Group Hadoop in all the nodes of the cluster.
- Check the CPU Configuration and Memory Configuration in all the nodes. ( cat /proc/meminfo and cat /proc/cpuinfo)
Installation
Unpack
the software in all the nodes. Typically one machine is designed as
NameNode and another Machine as JobTracker. These are the masters.
Small clusters can have both NameNode and JobTracker running in the
same node. The rest machines are Both DataNode and TaskTracker. These
are the slaves.
For
us, st11p00me-irptproc008 is configured as NameNode and
st11p00me-irptproc009 and st11p00me-irptproc010 as DataNodes.
Configuration
The
following steps describe how to configure hadoop cluster:-
Configuration
Files
Site
Specific Configuration Files:
- In core-site.xml file in Conf directory of hadoop, configure the hadoop temp directory and filesystem default name.
- In hdfs-site.xml, we need to configure the replication factor. We put the replication factor as 3 and the default dfs block size as 64MB.
- In mapred.site.xml, we need to put the JobTracker’s IP and Port.
- In the last section, I would attach snapshot of all the config file.
SSH
Setup
- SSH Localhost should work all the nodes.
- Use ssh-keygen -t rsa to generate keys.
- The above command would generate keys and keep in id_rsa.pub
- Do not use in passphrase while generating the keys.
- Copy the public key for all hosts in each nodes in .ssh/authorized_keys file.
- Repeat the step for all the nodes.
Execution
- Use the bin/hadoop namenode -format
- Start the hadoop daemons using bin/start-all.sh
- Use jps and check the processes running. In Master we would find namenode,secondary namenode and JobTracker and in Slaves, we would find DataNode and Task Tracker processes.
4.
Stop the hadoop daemons using bin/stop-all.sh.
Resources.
Below
are the screen shots of the configuration xml’s
In
Master Node
- core-site xml
Parameter
|
Value
|
Notes
|
fs.default.name
|
hdfs://17.172.45.71:9000
|
IP
and Port of the HDFS
|
hadoop.tmp.dir
|
/home/hadoop/bin/hadoop-0.20.2/hadoop-{$.user.name}
|
Directory
for creating intermediate temporary files.
|
- hdfs-site.xml
Parameter
|
Value
|
Notes
|
dfs.replication
|
3
|
Copies
of Data kept in all the data nodes
|
dfs.data.dir
|
/home/hadoop/bin/hadoop-0.20.2/dataNode
|
Directory
to keep all the blocks of data
|
dfs.block.size
|
128MB
|
The
block size in which the data would be distributed
|
dfs.hosts.exclude
|
conf/excludes
|
The
host names to be excluded from the cluster
|
- mapred-site.xml
Parameter
|
Value
|
Notes
|
mapred.job.tracker
|
17.172.45.71:9001
|
Job
Tracker IP and Port
|
mapred.tasktracker.map.taks.maximum
|
12
|
Maximum
number of the Map tasks to be run on a given machine.
|
mapred.tasktracker.reducer.taks.maximum
|
4
|
Maximum
number of the Reducer tasks to be run on a given machine.
|
mapred.child.java.opts
|
-Xmx4G
|
Java
Memory Option
|
mapred.reduce.parallel.copies
|
20
|
Higher
number of parallel copies run by reduces to fetch outputs from
very large number of maps
|
tracker.http.threads
|
20
|
Number
of threads on TaskTracker HTTP Server
|
In
Slave Node
- core-site.xml
Parameter
|
Value
|
Notes
|
fs.default.name
|
hdfs://17.172.45.71:9000
|
IP
and Port of the Job Tracker
|
hadoop.tmp.dir
|
/home/hadoop/bin/hadoop-0.20.2/hadoop-{$.user.name}
|
Directory
for creating intermediate temporary files.
|
- hdfs-site.xml
Parameter
|
Value
|
Notes
|
dfs.replication
|
3
|
Copies
of Data kept in all the data nodes
|
dfs.data.dir
|
/home/hadoop/bin/hadoop-0.20.2/dataNode
|
Directory
to keep all the blocks of data
|
dfs.block.size
|
128MB
|
The
block size in which the data would be distributed
|
dfs.hosts.exclude
|
conf/excludes
|
The
host names to be excluded from the cluster
|
- mapred-site.xml
Parameter
|
Value
|
Notes
|
mapred.job.tracker
|
17.172.45.71:9001
|
Job
Tracker IP and Port
|
mapred.tasktracker.map.taks.maximum
|
12
|
Maximum
number of the Map tasks to be run on a given machine.
|
mapred.tasktracker.reducer.taks.maximum
|
4
|
Maximum
number of the Reducer tasks to be run on a given machine.
|
mapred.child.java.opts
|
-Xmx4G
|
Java
Memory Option
|
mapred.reduce.parallel.copies
|
20
|
Higher
number of parallel copies run by reduces to fetch outputs from
very large number of maps
|
tracker.http.threads
|
20
|
Number
of threads on TaskTracker HTTP Server
|
Notes
- NameNode is more memory intensive than CPU.
- Sometimes the daemon process of the slave nodes doesn’t gets killed by stop-all.sh in master. We need to manually kill all the respective java processes in those cases.
- DataNodes are more CPU intensive. So avoid running any other CPU intensive jobs in those nodes.
- Configuration for efficient cluster is purely dependent on
- Data and Job to be executed.
- Network IO capacity and latency
- Experimentation with loads of configuration properties from different school of thoughts.
- Avoid Using Huge Number of Small Files.
No comments:
Post a Comment