Thursday, 24 April 2014

HadooP

  •  HadooP

    What is Hadoop?
  1. The Apache Hadoop project is an open source software for reliable, scalable, distributed computing.
  2. Hadoop is a flexible and available architecture for large-scale computation and data processing on a network of commodity hardware.
  3. The Apache Hadoop software library is a framework that allows for the distributed processing of large sets across clusters of computers using a simple programming model.
  4. Imagine this scenario
    You have 1GB of data that you need to process.
    The data are stored in a relational database in your desktop computer and this desktop computer has no problem handling this load. Then your company starts growing very quickly, and that data grows to 10GB.And then 100GB.And you start to reach the limits of your current desktop computer. So you scale-up by investing in a larger computer, and you are then OK for a few more months. When your data grows to 10TB, and then 100TB.
    And you are fast approaching the limits of that computer. Moreover, you are now asked to feed your application with unstructured data coming from sources like Facebook, Twitter, RFID readers, sensors, and so on. Your management wants to derive information from both the relational data and the unstructured data, and wants this information as soon as possible.
    What should you do? Hadoop may be the answer!
  5. It is a framework written in Java originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Google’s MapReduce and Google File System technologies as its foundation. It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware, that is, relatively inexpensive computers. This massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate.
  6. Hadoop replicates its data across different computers, so that if one goes down, the data are processed on one of the replicated computers.
  7. The key distinctions of Hadoop are that it is
    Accessible—Hadoop runs on large clusters of commodity machines or on cloud
    computing services such as Amazon’s Elastic Compute Cloud (EC2).
    Robust—Because it is intended to run on commodity hardware, Hadoop is architect with the assumption of frequent hardware malfunctions. It can gracefully
    handle most such failures.
    Scalable—Hadoop scales linearly to handle larger data by adding more nodes to
    the cluster.
    Simple—Hadoop allows users to quickly write efficient parallel code.








The Hadoop and its family projects that are described briefly here:

  • Common
A set of components and interfaces for distributed file systems and general I/O
(serialization, Java RPC, persistent data structures).
  • Avro
A serialization system for efficient, cross-language RPC, and persistent data
storage.
  • MapReduce
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
  • HDFS
A distributed file system that runs on large clusters of commodity machines.
  • Pig
A data flow language and execution environment for exploring very large datasets.
Pig runs on HDFS and MapReduce clusters.
  • Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.
  • HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
  • ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.
  • Sqoop
A tool for efficiently moving data between relational databases and HDFS.

  • Why Hadoop?

  1. Need to process multi petabytes datasets.
  2. Data may not have strict schema.
  3. Expensive to built reliability in each application.
  4. Nodes fails everyday.
  5. Failure is expected, rather than exceptional.
  6. The number of nodes in cluster is not constant.
  7. Efficient, reliable, open source Apache License.

What is Hadoop use for?

  • Search(yahoo, Amazon).
  • Log processing(Facebook, yahoo).
  • Recommendation system(Facebook).
  • Data warehouse.(Facebook)
  • Video and image analysis.

  • Difference between Hadoop and RDBMS
Hadoop
RDBMS
  • Hadoop works in both structure, semi-structure and unstructured data.
  • RDBMS systems mostly depend on structured data with a known schema.
  • Hadoop only read oriented.
  • RDBMS systems both reads and updates data.
  • Hadoop is batch oriented.
  • RDBMS are strong in transactional processing.

  • Its highly secured and sophisticated data compression which is not in hadoop.

  • Limitations of Hadoop
  1. Not to process transactions (random access)
  2. Not good when work cannot be parallelized.
  3. Not good for low latency data access
  4. Not good for processing lots of small files
  5. Not good for intensive calculations with little data.

  • Installing Apache Hadoop

    1.Prerequisites
    • Java version 6 or later on your machine.
    • Hadoop runs on Unix and on Windows.
    • Linux is the only supported production platform.
    • Windows is only supported as a development platform, and additionally requires Cygwin to run.
    • During the Cygwin installation process, you should include the openssh package if you plan to run Hadoop in pseudo-distributed mode.
  • Installation
    Download a stable release, which is packaged as a gzipped tar file, from the Apache Hadoop releases page and unpack it somewhere on your file system:
% tar xzf hadoop-x.y.z.tar.gz
Before you can run Hadoop, you need to tell it where Java is located on your system. If you have the JAVA_HOME environment variable set to point to a suitable Java installation, that will be used, and you don’t have to configure anything further. (It is often set in a shell startup file, such as ~/.bash_profile or ~/.bashrc.) Otherwise, you can set the Java installation that Hadoop uses by editing conf/hadoop-env.sh and specifying the JAVA_HOME variable. For example, on my Mac, I changed the line to read:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

to point to version 1.6.0 of Java. On Ubuntu, the equivalent line is:
export JAVA_HOME=/usr/lib/jvm/java-6-sun
It’s very convenient to create an environment variable that points to the Hadoop installation directory (HADOOP_INSTALL, say) and to put the Hadoop binary directory on
your command-line path.
For example:

% export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_INSTALL/bin

Check that Hadoop runs by typing:
% hadoop version
Hadoop 0.20.2
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707
Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010

Configuration
Each component in Hadoop is configured using an XML file. Common properties go
in core-site.xml, HDFS properties go in hdfs-site.xml, and MapReduce properties go in
mapred-site.xml. These files are all located in the conf subdirectory.

Modes of Running

Hadoop can be run in one of three modes:

  • Standalone (or local) mode
There are no daemons running and everything runs in a single JVM. Standalone
mode is suitable for running MapReduce programs during development, since it
is easy to test and debug them.
  • Pseudo-distributed mode
The Hadoop daemons run on the local machine, thus simulating a cluster on a
small scale.
  • Fully distributed mode
The Hadoop daemons run on a cluster of machines.
  • Standalone Mode
In standalone mode, there is no further action to take, since the default properties are set for standalone mode, and there are no daemons to run.
  • Pseudo-Distributed Mode
The configuration files should be created with the following contents and placed in the conf directory (although you can place configuration files in any directory as long as you start the daemons with the --config option):
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

Configuring SSH
In pseudo-distributed mode, we have to start daemons, and to do that, we need to have
SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and
fully distributed modes: it merely starts daemons on the set of hosts in the cluster
(defined by the slaves file) by SSH-ing to each host and starting a daemon process.
Pseudo-distributed mode is just a special case of fully distributed mode in which the
(single) host is localhost, so we need to make sure that we can SSH to localhost and log
in without having to enter a password.
First, make sure that SSH is installed and a server is running. On Ubuntu, for example,
this is achieved with:
% sudo apt-get install ssh

On Windows with Cygwin
you can set up an SSH server (after having installed the openssh package) by running
ssh-host-config -y.
On Mac OS X, make sure Remote Login (under System Preferences, Sharing) is enabled for the current user (or all users).
Then to enable password-less login, generate a new SSH key with an empty passphrase:
% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Test this with:
% ssh localhost
If successful, you should not have to type in a password.


Formatting the HDFS filesystem
Before it can be used, a brand-new HDFS installation needs to be formatted. The formatting
process creates an empty filesystem by creating the storage directories and the
initial versions of the namenode’s persistent data structures. Datanodes are not involved
in the initial formatting process, since the namenode manages all of the filesystem’s
metadata, and datanodes can join or leave the cluster dynamically. For the same
reason, you don’t need to say how large a filesystem to create, since this is determined
by the number of datanodes in the cluster, which can be increased as needed, long after
the filesystem was formatted.
Formatting HDFS is quick to do. Just type the following:
% hadoop namenode -format
Starting and stopping the daemons
To start the HDFS and MapReduce daemons, type:
% start-dfs.sh
% start-mapred.sh
If you have placed configuration files outside the default conf directory,
start the daemons with the --config option, which takes an absolute
path to the configuration directory:
% start-dfs.sh --config path-to-config-directory
% start-mapred.sh --config path-to-config-directory
Three daemons will be started on your local machine: a namenode, a secondary namenode,
and a datanode. You can check whether the daemons started successfully
by looking at the logfiles in the logs directory (in the Hadoop installation directory), or
by looking at the web UIs, at http://localhost:50030/ for the jobtracker and at
http://localhost:50070/ for the namenode. You can also use Java’s jps command to see
whether they are running.
Stopping the daemons is done in the obvious way:
% stop-dfs.sh
% stop-mapred.sh

Fully Distributed Mode
Cluster Specification
Hadoop is designed to run on commodity hardware. That means that you are not tied
to expensive, proprietary offerings from a single vendor; rather, you can choose standardized,
commonly available hardware from any of a large range of vendors to build
your cluster.
Commodity” does not mean “low-end.” Low-end machines often have cheap components,
which have higher failure rates than more expensive (but still commodityclass)
machines. When you are operating tens, hundreds, or thousands of machines,
cheap components turn out to be a false economy, as the higher failure rate incurs a
greater maintenance cost. On the other hand, large database class machines are not
recommended either, since they don’t score well on the price/performance curve. And
even though you would need fewer of them to build a cluster of comparable performance
to one built of mid-range commodity hardware, when one did fail it would have
a bigger impact on the cluster, since a larger proportion of the cluster hardware would
be unavailable.
Hardware specifications rapidly become obsolete, but for the sake of illustration, a
typical choice of machine for running a Hadoop datanode and tasktracker in mid-2010
would have the following specifications:
Processor
2 quad-core 2-2.5GHz CPUs
Memory
16-24 GB ECC RAM*
Storage
4 × 1TB SATA disks
Network
Gigabit Ethernet

While the hardware specification for your cluster will assuredly be different, Hadoop
is designed to use multiple cores and disks, so it will be able to take full advantage of
more powerful hardware.
The bulk of Hadoop is written in Java, and can therefore run on any platform with a
JVM, although there are enough parts that harbor Unix assumptions (the control
scripts, for example) to make it unwise to run on a non-Unix platform in production.
In fact, Windows operating systems are not supported production platforms (although
they can be used with Cygwin as a development platform).
How large should your cluster be? There isn’t an exact answer to this question, but the
beauty of Hadoop is that you can start with a small cluster (say, 10 nodes) and grow it
as your storage and computational needs grow. In many ways, a better question is this:
how fast does my cluster need to grow? You can get a good feel for this by considering
storage capacity.
For example, if your data grows by 1 TB a week, and you have three-way HDFS replication,
then you need an additional 3 TB of raw storage per week. Allow some room
for intermediate files and logfiles (around 30%, say), and this works out at about one
machine (2010 vintage) per week, on average. In practice, you wouldn’t buy a new
machine each week and add it to the cluster. The value of doing a back-of-the-envelope
calculation like this is that it gives you a feel for how big your cluster should be: in this
example, a cluster that holds two years of data needs 100 machines.
For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the jobtracker on a single master machine (as long as at least one copy of the namenode’s metadata is stored on a remote filesystem). As the cluster and the number of files stored in HDFS grow, the namenode needs more memory, so the namenode and jobtracker should be moved onto separate machines.
The secondary namenode can be run on the same machine as the namenode, but again
for reasons of memory usage (the secondary has the same memory requirements as the
primary), it is best to run it on a separate piece of hardware, especially for larger clusters.
Master node scenarios
Depending on the size of the cluster, there are various configurations for running the
master daemons: the namenode, secondary namenode, and jobtracker. On a small
cluster (a few tens of nodes), it is convenient to put them on a single machine; however,
as the cluster gets larger, there are good reasons to separate them.
The namenode has high memory requirements, as it holds file and block metadata for
the entire namespace in memory. The secondary namenode, while idle most of the time,
has a comparable memory footprint to the primary when it creates a checkpoint.
For filesystems with a large number of files, there may not be enough physical memory on one machine to run both the primary and secondary namenode.

The secondary namenode keeps a copy of the latest checkpoint of the filesystem metadata that it creates. Keeping this (stale) backup on a different node to the namenode allows recovery in the event of loss (or corruption) of all the namenode’s metadata files Machines running the namenodes should typically run on 64-bit hardware to avoid the 3 GB limit on Java heap size in 32-bit architectures.
Network Topology A common Hadoop cluster architecture consists of a two-level network topology, as illustrated in below image. Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack (only three are shown in the diagram), and an uplink to a core switch or router (which is normally 1 GB or better). The salient point is that the aggregate bandwidth between nodes on the same rack is much greater than that between nodes on different racks.
Typical two-level network architecture for a Hadoop
Rack awareness
To get maximum performance out of Hadoop, it is important to configure Hadoop so
that it knows the topology of your network. If your cluster runs on a single rack, then
there is nothing more to do, since this is the default. However, for multirack clusters,
you need to map nodes to racks. By doing this, Hadoop will prefer within-rack transfers
(where there is more bandwidth available) to off-rack transfers when placing
MapReduce tasks on nodes. HDFS will be able to place replicas more intelligently to
trade-off performance and resilience.
Network locations such as nodes and racks are represented in a tree, which reflects the
network “distance” between locations. The namenode uses the network location when
determining where to place block replicas . The jobtracker uses network location to determine where the closest
replica is as input for a map task that is scheduled to run on a tasktracker.
For the network in above figure, the rack topology is described by two network locations,
say, /switch1/rack1 and /switch1/rack2. Since there is only one top-level switch in this
cluster, the locations can be simplified to /rack1 and /rack2.
The Hadoop configuration must specify a map between node addresses and network
locations. The map is described by a Java interface, DNSToSwitchMapping, whose
signature is:


public interface DNSToSwitchMapping {
public List<String> resolve(List<String> names);
}

The names parameter is a list of IP addresses, and the return value is a list of corresponding network location strings. The topology.node.switch.mapping.impl configuration property defines an implementation of the DNSToSwitchMapping interface that the namenode and the jobtracker use to resolve worker node network locations.
For the network in our example, we would map node1, node2, and node3 to /rack1,
and node4, node5, and node6 to /rack2.

Most installations don’t need to implement the interface themselves, however, since
the default implementation is ScriptBasedMapping, which runs a user-defined script to
determine the mapping. The script’s location is controlled by the property
topology.script.file.name. The script must accept a variable number of arguments
that are the hostnames or IP addresses to be mapped, and it must emit the corresponding network locations to standard output, separated by whitespace. The Hadoop wiki has an example at http://wiki.apache.org/hadoop/topology_rack_awareness_scripts.
If no script location is specified, the default behavior is to map all nodes to a single
network location, called /default-rack.

Cluster Setup and Installation
Your hardware has arrived. The next steps are to get it racked up and install the software
needed to run Hadoop.
There are various ways to install and configure Hadoop. This chapter describes how
to do it from scratch using the Apache Hadoop distribution, and will give you the
background to cover the things you need to think about when setting up Hadoop.
Alternatively, if you would like to use RPMs or Debian packages for managing your
Hadoop installation, then you might want to start with Cloudera’s Distribution .
To ease the burden of installing and maintaining the same software on each node, it is
normal to use an automated installation method like Red Hat Linux’s Kickstart or
Debian’s Fully Automatic Installation. These tools allow you to automate the operating
system installation by recording the answers to questions that are asked during the
installation process (such as the disk partition layout), as well as which packages to
install. Crucially, they also provide hooks to run scripts at the end of the process, which
are invaluable for doing final system tweaks and customization that is not covered by
the standard installer.
The following sections describe the customizations that are needed to run Hadoop.
These should all be added to the installation script.
Installing Java
Java 6 or later is required to run Hadoop. The latest stable Sun JDK is the preferred
option, although Java distributions from other vendors may work, too. The following
command confirms that Java was installed correctly:
% java -version
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
Creating a Hadoop User
It’s good practice to create a dedicated Hadoop user account to separate the Hadoop
installation from other services running on the same machine.
Some cluster administrators choose to make this user’s home directory an NFSmounted
drive, to aid with SSH key distribution (see the following discussion). The
NFS server is typically outside the Hadoop cluster. If you use NFS, it is worth considering
autofs, which allows you to mount the NFS filesystem on demand, when the
system accesses it. Autofs provides some protection against the NFS server failing and
allows you to use replicated filesystems for failover. There are other NFS gotchas to
watch out for, such as synchronizing UIDs and GIDs. For help setting up NFS on Linux,
refer to the HOWTO at http://nfs.sourceforge.net/nfs-howto/index.html.
Installing Hadoop
Download Hadoop from the Apache Hadoop releases page (http://hadoop.apache.org/
core/releases.html), and unpack the contents of the distribution in a sensible location,
such as /usr/local (/opt is another standard choice). Note that Hadoop is not installed
in the hadoop user’s home directory, as that may be an NFS-mounted directory:
% cd /usr/local
% sudo tar xzf hadoop-x.y.z.tar.gz
We also need to change the owner of the Hadoop files to be the hadoop user and group:
% sudo chown -R hadoop:hadoop hadoop-x.y.z

Testing the Installation
Once you’ve created the installation file, you are ready to test it by installing it on the
machines in your cluster. This will probably take a few iterations as you discover kinks
in the install. When it’s working, you can proceed to configure Hadoop and give it a
test run. This process is documented in the following sections.
SSH Configuration
The Hadoop control scripts rely on SSH to perform cluster-wide operations. For example,
there is a script for stopping and starting all the daemons in the cluster. Note
that the control scripts are optional—cluster-wide operations can be performed by
other mechanisms, too (such as a distributed shell).
To work seamlessly, SSH needs to be set up to allow password-less login for the
hadoop user from machines in the cluster. The simplest way to achieve this is to generate
a public/private key pair, and it will be shared across the cluster using NFS.
First, generate an RSA key pair by typing the following in the hadoop user account:
% ssh-keygen -t rsa -f ~/.ssh/id_rsa

Even though we want password-less logins, keys without passphrases are not considered
good practice (it’s OK to have an empty passphrase when running a local pseudodistributed cluster), so we specify a passphrase when prompted for one. We shall use ssh-agent to avoid the need to enter a password for each connection.
The private key is in the file specified by the -f option, ~/.ssh/id_rsa, and the public key is stored in a file with the same name with .pub appended, ~/.ssh/id_rsa.pub.
Next we need to make sure that the public key is in the ~/.ssh/authorized_keys file on all the machines in the cluster that we want to connect to. If the hadoop user’s home
directory is an NFS filesystem, as described earlier, then the keys can be shared across
the cluster by typing:
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
If the home directory is not shared using NFS, then the public keys will need to be
shared by some other means.
Test that you can SSH from the master to a worker machine by making sure sshagent
is running, and then run ssh-add to store your passphrase. You should be able
to ssh to a worker without entering the passphrase again.
Hadoop Configuration
There are a handful of files for controlling the configuration of a Hadoop installation;
the most important ones are listed below.
Hadoop configuration files
Filename
Format
Description
hadoop-env.sh
Bash script
Environment variables that are used in the scripts to run Hadoop.
core-site.xml
Hadoop configuration
XML
Configuration settings for Hadoop Core, such as I/O settings that are
common to HDFS and MapReduce.
hdfs-site.xml
Hadoop configuration XML
Configuration settings for HDFS daemons: the namenode, the secondary
namenode, and the datanodes.
mapred-site.xml
Hadoop configuration XML
Configuration settings for MapReduce daemons: the jobtracker, and
the tasktrackers.
masters
Plain text
A list of machines (one per line) that each run a secondary namenode.
slaves
Plain text
A list of machines (one per line) that each run a datanode and a tasktracker.
hadoop-metrics.properties
Java Properties
Properties for controlling how metrics are published in Hadoop.
log4j.properties
Java Properties
Properties for system logfiles, the namenode audit log, and the task log for the tasktracker child process .

These files are all found in the conf directory of the Hadoop distribution. The configuration
directory can be relocated to another part of the filesystem (outside the Hadoop
installation, which makes upgrades marginally easier) as long as daemons are started
with the --config option specifying the location of this directory on the local filesystem.

No comments:

Post a Comment