Crack the Interviews: HadooP

HadooP

What is Hadoop?

The Apache Hadoop project is an open source software for reliable, scalable, distributed computing.
Hadoop is a flexible and available architecture for large-scale computation and data processing on a network of commodity hardware.
The Apache Hadoop software library is a framework that allows for the distributed processing of large sets across clusters of computers using a simple programming model.
Imagine this scenario

You have 1GB of data that you need to process.

The data are stored in a relational database in your desktop computer and this desktop computer has no problem handling this load. Then your company starts growing very quickly, and that data grows to 10GB.And then 100GB.And you start to reach the limits of your current desktop computer. So you scale-up by investing in a larger computer, and you are then OK for a few more months. When your data grows to 10TB, and then 100TB.

And you are fast approaching the limits of that computer. Moreover, you are now asked to feed your application with unstructured data coming from sources like Facebook, Twitter, RFID readers, sensors, and so on. Your management wants to derive information from both the relational data and the unstructured data, and wants this information as soon as possible.

What should you do? Hadoop may be the answer!
It is a framework written in Java originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Google’s MapReduce and Google File System technologies as its foundation. It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware, that is, relatively inexpensive computers. This massive parallel processing is done with great performance. However, it is a batch operation handling massive quantities of data, so the response time is not immediate.
Hadoop replicates its data across different computers, so that if one goes down, the data are processed on one of the replicated computers.
The key distinctions of Hadoop are that it is

Accessible—Hadoop runs on large clusters of commodity machines or on cloud

computing services such as Amazon’s Elastic Compute Cloud (EC2).

Robust—Because it is intended to run on commodity hardware, Hadoop is architect with the assumption of frequent hardware malfunctions. It can gracefully

handle most such failures.

Scalable—Hadoop scales linearly to handle larger data by adding more nodes to

the cluster.

Simple—Hadoop allows users to quickly write efficient parallel code.

The Hadoop and its family projects that are described briefly here:

Common

A set of components and interfaces for distributed file systems and general I/O

(serialization, Java RPC, persistent data structures).

Avro

A serialization system for efficient, cross-language RPC, and persistent data

storage.

MapReduce

A distributed data processing model and execution environment that runs on large

clusters of commodity machines.

HDFS

A distributed file system that runs on large clusters of commodity machines.

Pig

A data flow language and execution environment for exploring very large datasets.

Pig runs on HDFS and MapReduce clusters.

Hive

A distributed data warehouse. Hive manages data stored in HDFS and provides a

query language based on SQL (and which is translated by the runtime engine to

MapReduce jobs) for querying the data.

HBase

A distributed, column-oriented database. HBase uses HDFS for its underlying

storage, and supports both batch-style computations using MapReduce and point

queries (random reads).

ZooKeeper

A distributed, highly available coordination service. ZooKeeper provides primitives

such as distributed locks that can be used for building distributed applications.

Sqoop

A tool for efficiently moving data between relational databases and HDFS.

Why Hadoop?

Need to process multi petabytes datasets.
Data may not have strict schema.
Expensive to built reliability in each application.
Nodes fails everyday.
Failure is expected, rather than exceptional.
The number of nodes in cluster is not constant.
Efficient, reliable, open source Apache License.

What is Hadoop use for?

Search(yahoo, Amazon).
Log processing(Facebook, yahoo).
Recommendation system(Facebook).
Data warehouse.(Facebook)
Video and image analysis.

Difference between Hadoop and RDBMS

Hadoop	RDBMS
Hadoop works in both structure, semi-structure and unstructured data.	RDBMS systems mostly depend on structured data with a known schema.
Hadoop only read oriented.	RDBMS systems both reads and updates data.
Hadoop is batch oriented.	RDBMS are strong in transactional processing.
	Its highly secured and sophisticated data compression which is not in hadoop.

Limitations of Hadoop

Not to process transactions (random access)
Not good when work cannot be parallelized.
Not good for low latency data access
Not good for processing lots of small files
Not good for intensive calculations with little data.

Installing Apache Hadoop

1.Prerequisites

Java version 6 or later on your machine.
Hadoop runs on Unix and on Windows.
Linux is the only supported production platform.
Windows is only supported as a development platform, and additionally requires Cygwin to run.
During the Cygwin installation process, you should include the openssh package if you plan to run Hadoop in pseudo-distributed mode.

Installation

Download a stable release, which is packaged as a gzipped tar file, from the Apache Hadoop releases page and unpack it somewhere on your file system:

% tar xzf hadoop-x.y.z.tar.gz

Before you can run Hadoop, you need to tell it where Java is located on your system. If you have the JAVA_HOME environment variable set to point to a suitable Java installation, that will be used, and you don’t have to configure anything further. (It is often set in a shell startup file, such as ~/.bash_profile or ~/.bashrc.) Otherwise, you can set the Java installation that Hadoop uses by editing conf/hadoop-env.sh and specifying the JAVA_HOME variable. For example, on my Mac, I changed the line to read:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

to point to version 1.6.0 of Java. On Ubuntu, the equivalent line is:

export JAVA_HOME=/usr/lib/jvm/java-6-sun

It’s very convenient to create an environment variable that points to the Hadoop installation directory (HADOOP_INSTALL, say) and to put the Hadoop binary directory on

your command-line path.

For example:

% export HADOOP_INSTALL=/home/tom/hadoop-x.y.z

% export PATH=$PATH:$HADOOP_INSTALL/bin

Check that Hadoop runs by typing:

% hadoop version

Hadoop 0.20.2

Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707

Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010

Configuration

Each component in Hadoop is configured using an XML file. Common properties go

in core-site.xml, HDFS properties go in hdfs-site.xml, and MapReduce properties go in

mapred-site.xml. These files are all located in the conf subdirectory.

Modes of Running

Hadoop can be run in one of three modes:

Standalone (or local) mode

There are no daemons running and everything runs in a single JVM. Standalone

mode is suitable for running MapReduce programs during development, since it

is easy to test and debug them.

Pseudo-distributed mode

The Hadoop daemons run on the local machine, thus simulating a cluster on a

small scale.

Fully distributed mode

The Hadoop daemons run on a cluster of machines.

Standalone Mode

In standalone mode, there is no further action to take, since the default properties are set for standalone mode, and there are no daemons to run.

Pseudo-Distributed Mode

The configuration files should be created with the following contents and placed in the conf directory (although you can place configuration files in any directory as long as you start the daemons with the --config option):

<?xml version="1.0"?>

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost/</value>

</property>

</configuration>

<?xml version="1.0"?>

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

<?xml version="1.0"?>

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>localhost:8021</value>

</property>

</configuration>

Configuring SSH

In pseudo-distributed mode, we have to start daemons, and to do that, we need to have

SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and

fully distributed modes: it merely starts daemons on the set of hosts in the cluster

(defined by the slaves file) by SSH-ing to each host and starting a daemon process.

Pseudo-distributed mode is just a special case of fully distributed mode in which the

(single) host is localhost, so we need to make sure that we can SSH to localhost and log

in without having to enter a password.

First, make sure that SSH is installed and a server is running. On Ubuntu, for example,

this is achieved with:

% sudo apt-get install ssh

On Windows with Cygwin

you can set up an SSH server (after having installed the openssh package) by running

ssh-host-config -y.

On Mac OS X, make sure Remote Login (under System Preferences, Sharing) is enabled for the current user (or all users).

Then to enable password-less login, generate a new SSH key with an empty passphrase:

% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test this with:

% ssh localhost

If successful, you should not have to type in a password.

Formatting the HDFS filesystem

Before it can be used, a brand-new HDFS installation needs to be formatted. The formatting

process creates an empty filesystem by creating the storage directories and the

initial versions of the namenode’s persistent data structures. Datanodes are not involved

in the initial formatting process, since the namenode manages all of the filesystem’s

metadata, and datanodes can join or leave the cluster dynamically. For the same

reason, you don’t need to say how large a filesystem to create, since this is determined

by the number of datanodes in the cluster, which can be increased as needed, long after

the filesystem was formatted.

Formatting HDFS is quick to do. Just type the following:

% hadoop namenode -format

Starting and stopping the daemons

To start the HDFS and MapReduce daemons, type:

% start-dfs.sh

% start-mapred.sh

If you have placed configuration files outside the default conf directory,

start the daemons with the --config option, which takes an absolute

path to the configuration directory:

% start-dfs.sh --config path-to-config-directory

% start-mapred.sh --config path-to-config-directory

Three daemons will be started on your local machine: a namenode, a secondary namenode,

and a datanode. You can check whether the daemons started successfully

by looking at the logfiles in the logs directory (in the Hadoop installation directory), or

by looking at the web UIs, at http://localhost:50030/ for the jobtracker and at

http://localhost:50070/ for the namenode. You can also use Java’s jps command to see

whether they are running.

Stopping the daemons is done in the obvious way:

% stop-dfs.sh

% stop-mapred.sh

Fully Distributed Mode

Cluster Specification

Hadoop is designed to run on commodity hardware. That means that you are not tied

to expensive, proprietary offerings from a single vendor; rather, you can choose standardized,

commonly available hardware from any of a large range of vendors to build

your cluster.

“Commodity” does not mean “low-end.” Low-end machines often have cheap components,

which have higher failure rates than more expensive (but still commodityclass)

machines. When you are operating tens, hundreds, or thousands of machines,

cheap components turn out to be a false economy, as the higher failure rate incurs a

greater maintenance cost. On the other hand, large database class machines are not

recommended either, since they don’t score well on the price/performance curve. And

even though you would need fewer of them to build a cluster of comparable performance

to one built of mid-range commodity hardware, when one did fail it would have

a bigger impact on the cluster, since a larger proportion of the cluster hardware would

be unavailable.

Hardware specifications rapidly become obsolete, but for the sake of illustration, a

typical choice of machine for running a Hadoop datanode and tasktracker in mid-2010

would have the following specifications:

Processor

2 quad-core 2-2.5GHz CPUs

Memory

16-24 GB ECC RAM*

Storage

4 × 1TB SATA disks

Network

Gigabit Ethernet

While the hardware specification for your cluster will assuredly be different, Hadoop

is designed to use multiple cores and disks, so it will be able to take full advantage of

more powerful hardware.

The bulk of Hadoop is written in Java, and can therefore run on any platform with a

JVM, although there are enough parts that harbor Unix assumptions (the control

scripts, for example) to make it unwise to run on a non-Unix platform in production.

In fact, Windows operating systems are not supported production platforms (although

they can be used with Cygwin as a development platform).

How large should your cluster be? There isn’t an exact answer to this question, but the

beauty of Hadoop is that you can start with a small cluster (say, 10 nodes) and grow it

as your storage and computational needs grow. In many ways, a better question is this:

how fast does my cluster need to grow? You can get a good feel for this by considering

storage capacity.

For example, if your data grows by 1 TB a week, and you have three-way HDFS replication,

then you need an additional 3 TB of raw storage per week. Allow some room

for intermediate files and logfiles (around 30%, say), and this works out at about one

machine (2010 vintage) per week, on average. In practice, you wouldn’t buy a new

machine each week and add it to the cluster. The value of doing a back-of-the-envelope

calculation like this is that it gives you a feel for how big your cluster should be: in this

example, a cluster that holds two years of data needs 100 machines.

For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the jobtracker on a single master machine (as long as at least one copy of the namenode’s metadata is stored on a remote filesystem). As the cluster and the number of files stored in HDFS grow, the namenode needs more memory, so the namenode and jobtracker should be moved onto separate machines.

The secondary namenode can be run on the same machine as the namenode, but again

for reasons of memory usage (the secondary has the same memory requirements as the

primary), it is best to run it on a separate piece of hardware, especially for larger clusters.

Master node scenarios

Depending on the size of the cluster, there are various configurations for running the

master daemons: the namenode, secondary namenode, and jobtracker. On a small

cluster (a few tens of nodes), it is convenient to put them on a single machine; however,

as the cluster gets larger, there are good reasons to separate them.

The namenode has high memory requirements, as it holds file and block metadata for

the entire namespace in memory. The secondary namenode, while idle most of the time,

has a comparable memory footprint to the primary when it creates a checkpoint.

For filesystems with a large number of files, there may not be enough physical memory on one machine to run both the primary and secondary namenode.

The secondary namenode keeps a copy of the latest checkpoint of the filesystem metadata that it creates. Keeping this (stale) backup on a different node to the namenode allows recovery in the event of loss (or corruption) of all the namenode’s metadata files Machines running the namenodes should typically run on 64-bit hardware to avoid the 3 GB limit on Java heap size in 32-bit architectures.

Network Topology A common Hadoop cluster architecture consists of a two-level network topology, as illustrated in below image. Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack (only three are shown in the diagram), and an uplink to a core switch or router (which is normally 1 GB or better). The salient point is that the aggregate bandwidth between nodes on the same rack is much greater than that between nodes on different racks.

Typical two-level network architecture for a Hadoop

Rack awareness

To get maximum performance out of Hadoop, it is important to configure Hadoop so

that it knows the topology of your network. If your cluster runs on a single rack, then

there is nothing more to do, since this is the default. However, for multirack clusters,

you need to map nodes to racks. By doing this, Hadoop will prefer within-rack transfers

(where there is more bandwidth available) to off-rack transfers when placing

MapReduce tasks on nodes. HDFS will be able to place replicas more intelligently to

trade-off performance and resilience.

Network locations such as nodes and racks are represented in a tree, which reflects the

network “distance” between locations. The namenode uses the network location when

determining where to place block replicas . The jobtracker uses network location to determine where the closest

replica is as input for a map task that is scheduled to run on a tasktracker.

For the network in above figure, the rack topology is described by two network locations,

say, /switch1/rack1 and /switch1/rack2. Since there is only one top-level switch in this

cluster, the locations can be simplified to /rack1 and /rack2.

The Hadoop configuration must specify a map between node addresses and network

locations. The map is described by a Java interface, DNSToSwitchMapping, whose

signature is:

public interface DNSToSwitchMapping {

public List<String> resolve(List<String> names);

}

The names parameter is a list of IP addresses, and the return value is a list of corresponding network location strings. The topology.node.switch.mapping.impl configuration property defines an implementation of the DNSToSwitchMapping interface that the namenode and the jobtracker use to resolve worker node network locations.

For the network in our example, we would map node1, node2, and node3 to /rack1,

and node4, node5, and node6 to /rack2.

Most installations don’t need to implement the interface themselves, however, since

the default implementation is ScriptBasedMapping, which runs a user-defined script to

determine the mapping. The script’s location is controlled by the property

topology.script.file.name. The script must accept a variable number of arguments

that are the hostnames or IP addresses to be mapped, and it must emit the corresponding network locations to standard output, separated by whitespace. The Hadoop wiki has an example at http://wiki.apache.org/hadoop/topology_rack_awareness_scripts.

If no script location is specified, the default behavior is to map all nodes to a single

network location, called /default-rack.

Cluster Setup and Installation

Your hardware has arrived. The next steps are to get it racked up and install the software

needed to run Hadoop.

There are various ways to install and configure Hadoop. This chapter describes how

to do it from scratch using the Apache Hadoop distribution, and will give you the

background to cover the things you need to think about when setting up Hadoop.

Alternatively, if you would like to use RPMs or Debian packages for managing your

Hadoop installation, then you might want to start with Cloudera’s Distribution .

To ease the burden of installing and maintaining the same software on each node, it is

normal to use an automated installation method like Red Hat Linux’s Kickstart or

Debian’s Fully Automatic Installation. These tools allow you to automate the operating

system installation by recording the answers to questions that are asked during the

installation process (such as the disk partition layout), as well as which packages to

install. Crucially, they also provide hooks to run scripts at the end of the process, which

are invaluable for doing final system tweaks and customization that is not covered by

the standard installer.

The following sections describe the customizations that are needed to run Hadoop.

These should all be added to the installation script.

Installing Java

Java 6 or later is required to run Hadoop. The latest stable Sun JDK is the preferred

option, although Java distributions from other vendors may work, too. The following

command confirms that Java was installed correctly:

% java -version

java version "1.6.0_12"

Java(TM) SE Runtime Environment (build 1.6.0_12-b04)

Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)

Creating a Hadoop User

It’s good practice to create a dedicated Hadoop user account to separate the Hadoop

installation from other services running on the same machine.

Some cluster administrators choose to make this user’s home directory an NFSmounted

drive, to aid with SSH key distribution (see the following discussion). The

NFS server is typically outside the Hadoop cluster. If you use NFS, it is worth considering

autofs, which allows you to mount the NFS filesystem on demand, when the

system accesses it. Autofs provides some protection against the NFS server failing and

allows you to use replicated filesystems for failover. There are other NFS gotchas to

watch out for, such as synchronizing UIDs and GIDs. For help setting up NFS on Linux,

refer to the HOWTO at http://nfs.sourceforge.net/nfs-howto/index.html.

Installing Hadoop

Download Hadoop from the Apache Hadoop releases page (http://hadoop.apache.org/

core/releases.html), and unpack the contents of the distribution in a sensible location,

such as /usr/local (/opt is another standard choice). Note that Hadoop is not installed

in the hadoop user’s home directory, as that may be an NFS-mounted directory:

% cd /usr/local

% sudo tar xzf hadoop-x.y.z.tar.gz

We also need to change the owner of the Hadoop files to be the hadoop user and group:

% sudo chown -R hadoop:hadoop hadoop-x.y.z

Testing the Installation

Once you’ve created the installation file, you are ready to test it by installing it on the

machines in your cluster. This will probably take a few iterations as you discover kinks

in the install. When it’s working, you can proceed to configure Hadoop and give it a

test run. This process is documented in the following sections.

SSH Configuration

The Hadoop control scripts rely on SSH to perform cluster-wide operations. For example,

there is a script for stopping and starting all the daemons in the cluster. Note

that the control scripts are optional—cluster-wide operations can be performed by

other mechanisms, too (such as a distributed shell).

To work seamlessly, SSH needs to be set up to allow password-less login for the

hadoop user from machines in the cluster. The simplest way to achieve this is to generate

a public/private key pair, and it will be shared across the cluster using NFS.

First, generate an RSA key pair by typing the following in the hadoop user account:

% ssh-keygen -t rsa -f ~/.ssh/id_rsa

Even though we want password-less logins, keys without passphrases are not considered

good practice (it’s OK to have an empty passphrase when running a local pseudodistributed cluster), so we specify a passphrase when prompted for one. We shall use ssh-agent to avoid the need to enter a password for each connection.

The private key is in the file specified by the -f option, ~/.ssh/id_rsa, and the public key is stored in a file with the same name with .pub appended, ~/.ssh/id_rsa.pub.

Next we need to make sure that the public key is in the ~/.ssh/authorized_keys file on all the machines in the cluster that we want to connect to. If the hadoop user’s home

directory is an NFS filesystem, as described earlier, then the keys can be shared across

the cluster by typing:

% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

If the home directory is not shared using NFS, then the public keys will need to be

shared by some other means.

Test that you can SSH from the master to a worker machine by making sure sshagent

is running, and then run ssh-add to store your passphrase. You should be able

to ssh to a worker without entering the passphrase again.

Hadoop Configuration

There are a handful of files for controlling the configuration of a Hadoop installation;

the most important ones are listed below.

Hadoop configuration files

Filename	Format	Description
hadoop-env.sh	Bash script	Environment variables that are used in the scripts to run Hadoop.
core-site.xml	Hadoop configuration XML	Configuration settings for Hadoop Core, such as I/O settings that are common to HDFS and MapReduce.
hdfs-site.xml	Hadoop configuration XML	Configuration settings for HDFS daemons: the namenode, the secondary namenode, and the datanodes.
mapred-site.xml	Hadoop configuration XML	Configuration settings for MapReduce daemons: the jobtracker, and the tasktrackers.
masters	Plain text	A list of machines (one per line) that each run a secondary namenode.
slaves	Plain text	A list of machines (one per line) that each run a datanode and a tasktracker.
hadoop-metrics.properties	Java Properties	Properties for controlling how metrics are published in Hadoop.
log4j.properties	Java Properties	Properties for system logfiles, the namenode audit log, and the task log for the tasktracker child process .

These files are all found in the conf directory of the Hadoop distribution. The configuration

directory can be relocated to another part of the filesystem (outside the Hadoop

installation, which makes upgrades marginally easier) as long as daemons are started

with the --config option specifying the location of this directory on the local filesystem.

Crack the Interviews

Thursday, 24 April 2014

HadooP

HadooP

No comments:

Post a Comment

About Me