HadooP
What is Hadoop?
- The Apache Hadoop project is an open source software for reliable, scalable, distributed computing.
- Hadoop is a flexible and available architecture for large-scale computation and data processing on a network of commodity hardware.
- The Apache Hadoop software library is a framework that allows for the distributed processing of large sets across clusters of computers using a simple programming model.
- Imagine this scenarioYou have 1GB of data that you need to process.The data are stored in a relational database in your desktop computer and this desktop computer has no problem handling this load. Then your company starts growing very quickly, and that data grows to 10GB.And then 100GB.And you start to reach the limits of your current desktop computer. So you scale-up by investing in a larger computer, and you are then OK for a few more months. When your data grows to 10TB, and then 100TB.And you are fast approaching the limits of that computer. Moreover, you are now asked to feed your application with unstructured data coming from sources like Facebook, Twitter, RFID readers, sensors, and so on. Your management wants to derive information from both the relational data and the unstructured data, and wants this information as soon as possible.What should you do? Hadoop may be the answer!
- It is a framework written in Java originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Google’s MapReduce and Google File System technologies as its foundation. It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware, that is, relatively inexpensive computers. This massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate.
- Hadoop replicates its data across different computers, so that if one goes down, the data are processed on one of the replicated computers.
- The key distinctions of Hadoop are that it isAccessible—Hadoop runs on large clusters of commodity machines or on cloudcomputing services such as Amazon’s Elastic Compute Cloud (EC2).Robust—Because it is intended to run on commodity hardware, Hadoop is architect with the assumption of frequent hardware malfunctions. It can gracefullyhandle most such failures.Scalable—Hadoop scales linearly to handle larger data by adding more nodes tothe cluster.Simple—Hadoop allows users to quickly write efficient parallel code.
The
Hadoop and its family projects that are described briefly here:
- Common
A
set of components and interfaces for distributed file systems and
general I/O
(serialization,
Java RPC, persistent data structures).
- Avro
A
serialization system for efficient, cross-language RPC, and
persistent data
storage.
- MapReduce
A
distributed data processing model and execution environment that runs
on large
clusters
of commodity machines.
- HDFS
A
distributed file system that runs on large clusters of commodity
machines.
- Pig
A
data flow language and execution environment for exploring very large
datasets.
Pig
runs on HDFS and MapReduce clusters.
- Hive
A
distributed data warehouse. Hive manages data stored in HDFS and
provides a
query
language based on SQL (and which is translated by the runtime engine
to
MapReduce
jobs) for querying the data.
- HBase
A
distributed, column-oriented database. HBase uses HDFS for its
underlying
storage,
and supports both batch-style computations using MapReduce and point
queries
(random reads).
- ZooKeeper
A
distributed, highly available coordination service. ZooKeeper
provides primitives
such
as distributed locks that can be used for building distributed
applications.
- Sqoop
A
tool for efficiently moving data between relational databases and
HDFS.
- Why Hadoop?
- Need to process multi petabytes datasets.
- Data may not have strict schema.
- Expensive to built reliability in each application.
- Nodes fails everyday.
- Failure is expected, rather than exceptional.
- The number of nodes in cluster is not constant.
- Efficient, reliable, open source Apache License.
What
is Hadoop use for?
- Search(yahoo, Amazon).
- Log processing(Facebook, yahoo).
- Recommendation system(Facebook).
- Data warehouse.(Facebook)
- Video and image analysis.
- Difference between Hadoop and RDBMS
Hadoop
|
RDBMS
|
|
|
|
|
|
|
|
|
- Limitations of Hadoop
- Not to process transactions (random access)
- Not good when work cannot be parallelized.
- Not good for low latency data access
- Not good for processing lots of small files
- Not good for intensive calculations with little data.
- Installing Apache Hadoop
- Java version 6 or later on your machine.
- Hadoop runs on Unix and on Windows.
- Linux is the only supported production platform.
- Windows is only supported as a development platform, and additionally requires Cygwin to run.
- During the Cygwin installation process, you should include the openssh package if you plan to run Hadoop in pseudo-distributed mode.
- Installation
1.Prerequisites
Download
a stable release, which is packaged as a gzipped tar file, from the
Apache
Hadoop releases page and
unpack it somewhere on your file system:
%
tar xzf hadoop-x.y.z.tar.gz
Before
you can run Hadoop, you need to tell it where Java is located on
your system. If you have the JAVA_HOME environment
variable set to point to a suitable Java installation, that will be
used, and you don’t have to configure anything further. (It is
often set in a shell startup file, such as ~/.bash_profile or
~/.bashrc.) Otherwise, you can set the Java installation that
Hadoop uses by editing conf/hadoop-env.sh and specifying the
JAVA_HOME variable. For example, on my Mac, I changed the line to
read:
export
JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
to
point to version 1.6.0 of Java. On Ubuntu, the equivalent line is:
export
JAVA_HOME=/usr/lib/jvm/java-6-sun
It’s
very convenient to create an environment variable that points to the
Hadoop installation directory (HADOOP_INSTALL, say) and to put the
Hadoop binary directory on
your
command-line path.
For
example:
%
export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
%
export PATH=$PATH:$HADOOP_INSTALL/bin
Check
that Hadoop runs by typing:
%
hadoop version
Hadoop
0.20.2
Subversion
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20
-r 911707
Compiled
by chrisdo on Fri Feb 19 08:07:34 UTC 2010
Configuration
Each
component in Hadoop is configured using an XML file. Common
properties go
in
core-site.xml, HDFS properties go in hdfs-site.xml, and
MapReduce properties go in
mapred-site.xml.
These files are all located in the conf subdirectory.
Modes
of Running
Hadoop
can be run in one of three modes:
- Standalone (or local) mode
There
are no daemons running and everything runs in a single JVM.
Standalone
mode
is suitable for running MapReduce programs during development, since
it
is
easy to test and debug them.
- Pseudo-distributed mode
The
Hadoop daemons run on the local machine, thus simulating a cluster on
a
small
scale.
- Fully distributed mode
The
Hadoop daemons run on a cluster of machines.
- Standalone Mode
In
standalone mode, there is no further action to take, since the
default properties are set for standalone mode, and there are no
daemons to run.
- Pseudo-Distributed Mode
The
configuration files should be created with the following contents and
placed in the conf directory (although you can place
configuration files in any directory as long as you start the
daemons with the --config option):
<?xml
version="1.0"?>
<!--
core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
<?xml
version="1.0"?>
<!--
hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<?xml
version="1.0"?>
<!--
mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Configuring
SSH
In
pseudo-distributed mode, we have to start daemons, and to do that, we
need to have
SSH
installed. Hadoop doesn’t actually distinguish between
pseudo-distributed and
fully
distributed modes: it merely starts daemons on the set of hosts in
the cluster
(defined
by the slaves file) by SSH-ing to each host and starting a
daemon process.
Pseudo-distributed
mode is just a special case of fully distributed mode in which the
(single)
host is localhost, so we need to make sure that we can SSH to
localhost and log
in
without having to enter a password.
First,
make sure that SSH is installed and a server is running. On Ubuntu,
for example,
this
is achieved with:
%
sudo apt-get install ssh
On
Windows with Cygwin
you
can set up an SSH server (after having installed the openssh
package) by running
ssh-host-config
-y.
On
Mac OS X, make sure Remote Login (under System Preferences, Sharing)
is enabled for the current user (or all users).
Then
to enable password-less login, generate a new SSH key with an empty
passphrase:
%
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
%
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test
this with:
%
ssh localhost
If
successful, you should not have to type in a password.
Formatting
the HDFS filesystem
Before
it can be used, a brand-new HDFS installation needs to be formatted.
The formatting
process
creates an empty filesystem by creating the storage directories and
the
initial
versions of the namenode’s persistent data structures. Datanodes
are not involved
in
the initial formatting process, since the namenode manages all of the
filesystem’s
metadata,
and datanodes can join or leave the cluster dynamically. For the same
reason,
you don’t need to say how large a filesystem to create, since this
is determined
by
the number of datanodes in the cluster, which can be increased as
needed, long after
the
filesystem was formatted.
Formatting
HDFS is quick to do. Just type the following:
%
hadoop namenode -format
Starting
and stopping the daemons
To
start the HDFS and MapReduce daemons, type:
%
start-dfs.sh
%
start-mapred.sh
If
you have placed configuration files outside the default conf
directory,
start
the daemons with the --config option, which takes an absolute
path
to the configuration directory:
%
start-dfs.sh --config path-to-config-directory
%
start-mapred.sh --config path-to-config-directory
Three
daemons will be started on your local machine: a namenode, a
secondary namenode,
and
a datanode. You can check whether the daemons started successfully
by
looking at the logfiles in the logs directory (in the Hadoop
installation directory), or
by
looking at the web UIs, at http://localhost:50030/ for the
jobtracker and at
http://localhost:50070/
for the namenode. You can also use Java’s jps command to see
whether
they are running.
Stopping
the daemons is done in the obvious way:
%
stop-dfs.sh
%
stop-mapred.sh
Fully
Distributed Mode
Cluster
Specification
Hadoop
is designed to run on commodity hardware. That means that you are not
tied
to
expensive, proprietary offerings from a single vendor; rather, you
can choose standardized,
commonly
available hardware from any of a large range of vendors to build
your
cluster.
“Commodity”
does not mean “low-end.” Low-end machines often have cheap
components,
which
have higher failure rates than more expensive (but still
commodityclass)
machines.
When you are operating tens, hundreds, or thousands of machines,
cheap
components turn out to be a false economy, as the higher failure rate
incurs a
greater
maintenance cost. On the other hand, large database class machines
are not
recommended
either, since they don’t score well on the price/performance curve.
And
even
though you would need fewer of them to build a cluster of comparable
performance
to
one built of mid-range commodity hardware, when one did fail it would
have
a
bigger impact on the cluster, since a larger proportion of the
cluster hardware would
be
unavailable.
Hardware
specifications rapidly become obsolete, but for the sake of
illustration, a
typical
choice of machine for running a Hadoop datanode and tasktracker in
mid-2010
would
have the following specifications:
Processor
2
quad-core 2-2.5GHz CPUs
Memory
16-24
GB ECC RAM*
Storage
4
× 1TB SATA disks
Network
Gigabit
Ethernet
While
the hardware specification for your cluster will assuredly be
different, Hadoop
is
designed to use multiple cores and disks, so it will be able to take
full advantage of
more
powerful hardware.
The
bulk of Hadoop is written in Java, and can therefore run on any
platform with a
JVM,
although there are enough parts that harbor Unix assumptions (the
control
scripts,
for example) to make it unwise to run on a non-Unix platform in
production.
In
fact, Windows operating systems are not supported production
platforms (although
they
can be used with Cygwin as a development platform).
How
large should your cluster be? There isn’t an exact answer to this
question, but the
beauty
of Hadoop is that you can start with a small cluster (say, 10 nodes)
and grow it
as
your storage and computational needs grow. In many ways, a better
question is this:
how
fast does my cluster need to grow? You can get a good feel for this
by considering
storage
capacity.
For
example, if your data grows by 1 TB a week, and you have three-way
HDFS replication,
then
you need an additional 3 TB of raw storage per week. Allow some room
for
intermediate files and logfiles (around 30%, say), and this works out
at about one
machine
(2010 vintage) per week, on average. In practice, you wouldn’t buy
a new
machine
each week and add it to the cluster. The value of doing a
back-of-the-envelope
calculation
like this is that it gives you a feel for how big your cluster should
be: in this
example,
a cluster that holds two years of data needs 100 machines.
For
a small cluster (on the order of 10 nodes), it is usually acceptable
to run the namenode and the jobtracker on a single master machine (as
long as at least one copy of the namenode’s metadata is stored on a
remote filesystem). As the cluster and the number of files stored in
HDFS grow, the namenode needs more memory, so the namenode and
jobtracker should be moved onto separate machines.
The
secondary namenode can be run on the same machine as the namenode,
but again
for
reasons of memory usage (the secondary has the same memory
requirements as the
primary),
it is best to run it on a separate piece of hardware, especially for
larger clusters.
Master
node scenarios
Depending
on the size of the cluster, there are various configurations for
running the
master
daemons: the namenode, secondary namenode, and jobtracker. On a small
cluster
(a few tens of nodes), it is convenient to put them on a single
machine; however,
as
the cluster gets larger, there are good reasons to separate them.
The
namenode has high memory requirements, as it holds file and block
metadata for
the
entire namespace in memory. The secondary namenode, while idle most
of the time,
has
a comparable memory footprint to the primary when it creates a
checkpoint.
For
filesystems with a large number of files, there may not be enough
physical memory on one machine to run both the primary and secondary
namenode.
The
secondary namenode keeps a copy of the latest checkpoint of the
filesystem metadata that it creates. Keeping this (stale) backup on a
different node to the namenode allows recovery in the event of loss
(or corruption) of all the namenode’s metadata files Machines
running the namenodes should typically run on 64-bit hardware to
avoid the 3 GB limit on Java heap size in 32-bit architectures.
Network
Topology A common Hadoop cluster architecture consists of a two-level
network topology, as illustrated in below image. Typically there are
30 to 40 servers per rack, with a 1 GB switch for the rack (only
three are shown in the diagram), and an uplink to a core switch or
router (which is normally 1 GB or better). The salient point is that
the aggregate bandwidth between nodes on the same rack is much
greater than that between nodes on different racks.
Typical
two-level network architecture for a Hadoop
Rack
awareness
To
get maximum performance out of Hadoop, it is important to configure
Hadoop so
that
it knows the topology of your network. If your cluster runs on a
single rack, then
there
is nothing more to do, since this is the default. However, for
multirack clusters,
you
need to map nodes to racks. By doing this, Hadoop will prefer
within-rack transfers
(where
there is more bandwidth available) to off-rack transfers when placing
MapReduce
tasks on nodes. HDFS will be able to place replicas more
intelligently to
trade-off
performance and resilience.
Network
locations such as nodes and racks are represented in a tree, which
reflects the
network
“distance” between locations. The namenode uses the network
location when
determining
where to place block replicas . The
jobtracker uses network location to determine where the closest
replica
is as input for a map task that is scheduled to run on a tasktracker.
For
the network in above figure, the rack topology is described by two
network locations,
say,
/switch1/rack1 and /switch1/rack2. Since there is only
one top-level switch in this
cluster,
the locations can be simplified to /rack1 and /rack2.
The
Hadoop configuration must specify a map between node addresses and
network
locations.
The map is described by a Java interface, DNSToSwitchMapping, whose
signature
is:
public
interface DNSToSwitchMapping {
public
List<String> resolve(List<String> names);
}
The
names parameter is a list of IP addresses, and the
return value is a list of corresponding network location strings.
The topology.node.switch.mapping.impl configuration property defines
an implementation of the DNSToSwitchMapping interface that the
namenode and the jobtracker use to resolve worker node network
locations.
For
the network in our example, we would map node1, node2,
and node3 to /rack1,
and
node4, node5, and node6 to /rack2.
Most
installations don’t need to implement the interface themselves,
however, since
the
default implementation is ScriptBasedMapping, which runs a
user-defined script to
determine
the mapping. The script’s location is controlled by the property
topology.script.file.name.
The script must accept a variable number of arguments
that
are the hostnames or IP addresses to be mapped, and it must emit the
corresponding network locations to standard output, separated by
whitespace. The Hadoop wiki has an example at
http://wiki.apache.org/hadoop/topology_rack_awareness_scripts.
If
no script location is specified, the default behavior is to map all
nodes to a single
network
location, called /default-rack.
Cluster
Setup and Installation
Your
hardware has arrived. The next steps are to get it racked up and
install the software
needed
to run Hadoop.
There
are various ways to install and configure Hadoop. This chapter
describes how
to
do it from scratch using the Apache Hadoop distribution, and will
give you the
background
to cover the things you need to think about when setting up Hadoop.
Alternatively,
if you would like to use RPMs or Debian packages for managing your
Hadoop
installation, then you might want to start with Cloudera’s
Distribution .
To
ease the burden of installing and maintaining the same software on
each node, it is
normal
to use an automated installation method like Red Hat Linux’s
Kickstart or
Debian’s
Fully Automatic Installation. These tools allow you to automate the
operating
system
installation by recording the answers to questions that are asked
during the
installation
process (such as the disk partition layout), as well as which
packages to
install.
Crucially, they also provide hooks to run scripts at the end of the
process, which
are
invaluable for doing final system tweaks and customization that is
not covered by
the
standard installer.
The
following sections describe the customizations that are needed to run
Hadoop.
These
should all be added to the installation script.
Installing
Java
Java
6 or later is required to run Hadoop. The latest stable Sun JDK is
the preferred
option,
although Java distributions from other vendors may work, too. The
following
command
confirms that Java was installed correctly:
%
java -version
java
version "1.6.0_12"
Java(TM)
SE Runtime Environment (build 1.6.0_12-b04)
Java
HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
Creating
a Hadoop User
It’s
good practice to create a dedicated Hadoop user account to separate
the Hadoop
installation
from other services running on the same machine.
Some
cluster administrators choose to make this user’s home directory an
NFSmounted
drive,
to aid with SSH key distribution (see the following discussion). The
NFS
server is typically outside the Hadoop cluster. If you use NFS, it is
worth considering
autofs,
which allows you to mount the NFS filesystem on demand, when the
system
accesses it. Autofs provides some protection against the NFS server
failing and
allows
you to use replicated filesystems for failover. There are other NFS
gotchas to
watch
out for, such as synchronizing UIDs and GIDs. For help setting up NFS
on Linux,
refer
to the HOWTO at http://nfs.sourceforge.net/nfs-howto/index.html.
Installing
Hadoop
Download
Hadoop from the Apache Hadoop releases page
(http://hadoop.apache.org/
core/releases.html),
and unpack the contents of the distribution in a sensible location,
such
as /usr/local (/opt is another standard choice). Note
that Hadoop is not installed
in
the hadoop user’s home directory, as that may be an NFS-mounted
directory:
%
cd /usr/local
%
sudo tar xzf hadoop-x.y.z.tar.gz
We
also need to change the owner of the Hadoop files to be the hadoop
user and group:
%
sudo chown -R hadoop:hadoop hadoop-x.y.z
Testing
the Installation
Once
you’ve created the installation file, you are ready to test it by
installing it on the
machines
in your cluster. This will probably take a few iterations as you
discover kinks
in
the install. When it’s working, you can proceed to configure Hadoop
and give it a
test
run. This process is documented in the following sections.
SSH
Configuration
The
Hadoop control scripts rely on SSH to perform cluster-wide
operations. For example,
there
is a script for stopping and starting all the daemons in the cluster.
Note
that
the control scripts are optional—cluster-wide operations can be
performed by
other
mechanisms, too (such as a distributed shell).
To
work seamlessly, SSH needs to be set up to allow password-less login
for the
hadoop
user from machines in the cluster. The simplest way to achieve this
is to generate
a
public/private key pair, and it will be shared across the cluster
using NFS.
First,
generate an RSA key pair by typing the following in the hadoop user
account:
%
ssh-keygen -t rsa -f ~/.ssh/id_rsa
Even
though we want password-less logins, keys without passphrases are not
considered
good
practice (it’s OK to have an empty passphrase when running a local
pseudodistributed cluster), so we specify a passphrase when prompted
for one. We shall use ssh-agent to avoid the need to enter a
password for each connection.
The
private key is in the file specified by the -f option,
~/.ssh/id_rsa, and the public key is stored in a
file with the same name with .pub appended,
~/.ssh/id_rsa.pub.
Next
we need to make sure that the public key is in the
~/.ssh/authorized_keys file on all the machines
in the cluster that we want to connect to. If the hadoop user’s
home
directory
is an NFS filesystem, as described earlier, then the keys can be
shared across
the
cluster by typing:
%
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
If
the home directory is not shared using NFS, then the public keys will
need to be
shared
by some other means.
Test
that you can SSH from the master to a worker machine by making sure
sshagent
is
running, and then run ssh-add to store your passphrase. You
should be able
to
ssh to a worker without entering the passphrase again.
Hadoop
Configuration
There
are a handful of files for controlling the configuration of a Hadoop
installation;
the
most important ones are listed below.
Hadoop
configuration files
Filename
|
Format
|
Description
|
hadoop-env.sh
|
Bash
script
|
Environment
variables that are used in the scripts to run Hadoop.
|
core-site.xml
|
Hadoop
configuration
XML
|
Configuration
settings for Hadoop Core, such as I/O settings that are
common
to HDFS and MapReduce.
|
hdfs-site.xml
|
Hadoop
configuration XML
|
Configuration
settings for HDFS daemons: the namenode, the secondary
namenode,
and the datanodes.
|
mapred-site.xml
|
Hadoop
configuration XML
|
Configuration
settings for MapReduce daemons: the jobtracker, and
the
tasktrackers.
|
masters
|
Plain
text
|
A
list of machines (one per line) that each run a secondary
namenode.
|
slaves
|
Plain
text
|
A
list of machines (one per line) that each run a datanode and a
tasktracker.
|
hadoop-metrics.properties
|
Java
Properties
|
Properties
for controlling how metrics are published in Hadoop.
|
log4j.properties
|
Java
Properties
|
Properties
for system logfiles, the namenode audit log, and the task log for
the tasktracker child process .
|
These
files are all found in the conf directory of the Hadoop
distribution. The configuration
directory
can be relocated to another part of the filesystem (outside the
Hadoop
installation,
which makes upgrades marginally easier) as long as daemons are
started
with
the --config option specifying the location of this directory on the
local filesystem.
No comments:
Post a Comment