Saturday, 22 March 2014

Hadoop interview questions



 
Name the most common InputFormats defined in Hadoop? Which one is default ?

 Following 3 are most common InputFormats defined in Hadoop
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat

TextInputFormat is the hadoop default.

What is the difference between TextInputFormat and KeyValueInputFormat class?

TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper

KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

What is InputSplit in Hadoop?

When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split

How is the splitting of file invoked in Hadoop Framework ?

It is invoked by the Hadoop framework by running getInputSplit() method of the Input format class (like FileInputFormat) defined by the user

Consider case scenario: In M/R system,
    - HDFS block size is 64 MB
    - Input format is FileInputFormat
    - We have 3 files of size 64K, 65Mb and 127Mb
then how many input splits will be made by Hadoop framework?

Hadoop will make 5 splits as follows
- 1 split for 64K files
- 2  splits for 65Mb files
- 2 splits for 127Mb file

What is the purpose of RecordReader in Hadoop?

The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat

After the Map phase finishes, the hadoop framework does "Partitioning, Shuffle and sort". Explain what happens in this phase?

- Partitioning
Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same

- Shuffle
After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.

- Sort
Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer

If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer?

The default partitioner computes a hash value for the key and assigns the partition based on this result

What is a Combiner?

The Combiner is a "mini-reduce" process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

What is job tracker?

Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluster

What are some typical functions of Job Tracker?

The following are some typical tasks of Job Tracker
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data
- It locates TaskTracker nodes with available slots at or near the data
- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker

What is task tracker?

Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker

Whats the relationship between Jobs and Tasks in Hadoop?

One job is broken down into one or many tasks in Hadoop.

Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will hadoop do ?

It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job

Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this ?

Speculative Execution

How does speculative execution works in Hadoop  ?


Job tracker makes different task trackers process same input. When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Using command line in Linux, how will you
- see all jobs running in the hadoop cluster
- kill a job


- hadoop job -list
- hadoop job -kill jobid

What is Hadoop Streaming  ?


Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations


What is the characteristic of streaming API that makes it flexible run map reduce jobs in languages like perl, ruby, awk etc.  ?

Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a Map Reduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.


Whats is Distributed Cache in Hadoop ?

Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.


What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it  ?

This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.


What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application  ?

This is a trick questions. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution


Have you ever used Counters in Hadoop. Give us an example scenario ?

Anybody who claims to have worked on a Hadoop project is expected to use counters

Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job  ?


Yes, The input format class provides methods to add multiple directories as input to a Hadoop job


Is it possible to have Hadoop job output in multiple directories. If yes then how ?

Yes, by using Multiple Outputs class


What will a hadoop job do if you try to run it with an output directory that is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit


The hadoop job will throw an exception and exit.


How can you set an arbitrary number of mappers to be created for a job in Hadoop ?


This is a trick question. You cannot set it


How can you set an arbitary number of reducers to be created for a job in Hadoop ?


You can either do it progamatically by using method setNumReduceTasksin the JobConfclass or set it up as a configuration setting


How will you write a custom partitioner for a Hadoop job ?


To have hadoop use a custom partitioner you will have to do minimum the following three
- Create a new class that extends Partitioner class
- Override method getPartition
- In the wrapper that runs the Map Reducer, either
- add the custom partitioner to the job programtically using method setPartitionerClass or
- add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)


How did you debug your Hadoop code ?

There can be several ways of doing this but most common ways are
- By using counters
- The web interface provided by Hadoop framework


Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason?


Its an open ended question but most candidates, if they have written a production job, should talk about some type of alert mechanisn like email is sent or there monitoring system sends an alert. Since Hadoop works on unstructured data, its very important to have a good alerting system for errors since unexpected data can very easily break the job.


Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it ?

This is an open ended question but a candidate who claims to be an intermediate developer and has worked on large data set (10-20GB min) should have run into this problem. There can be many ways to handle this problem but most common way is to alter your algorithm and break down the job into more map reduce phase or use a combiner if possible.


What is HDFS?

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications

What does the statement "HDFS is block structured file system" means?

It means that in HDFS individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity

What does the term "Replication factor" mean?

Replication factor is the number of times a file needs to be replicated in HDFS

What is the default replication factor in HDFS? 


3

What is the default block size of an HDFS block?  

64Mb

What is the benefit of having such big block size (when compared to block
size of linux file system like ext)?

It allows HDFS to decrease the amount of metadata storage required per file (the list of blocks per file will be smaller as the size of individual blocks increases). Furthermore, it allows for fast streaming reads of data, by keeping large amounts of data sequentially laid out on the disk

Why is it recommended to have few very large files instead of a lot of small files in HDFS?

This is because the Name node contains the meta data of each and every file in HDFS and more files means more metadata and since namenode loads all the metadata in memory for speed hence having a lot of files may make the metadata information big enough to exceed the size of the memory on the Name node

True/false question. What is the lowest granularity at which you can apply replication factor in HDFS
- You can choose replication factor per directory
- You can choose replication factor per file in a directory
- You can choose replication factor per block of a file

- True
- True
- False

What is a datanode in HDFS?
Individual machines in the HDFS cluster that hold blocks of data are called datanodes

What is a Namenode in HDFS?
The Namenode stores all the metadata for the file system

What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered?
There is no way. If Namenode dies and there is no backup then there is no way to recover data

Describe how a HDFS client will read a file in HDFS, like will it talk to data node or namenode ... how will data flow etc?
To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file. These locations identify the Data Nodes which hold each block. Clients then read file data directly from the Data Node servers, possibly in parallel. The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum.

Using linux command line. how will you
- List the the number of files in a HDFS directory
- Create a directory in HDFS
- Copy file from your local directory to HDFS

hadoop fs -ls
hadoop fs -mkdir
hadoop fs -put localfile hdfsfile


Advantages of Hadoop?

• Bringing compute and storage together on commodity hardware: The result is blazing speed at low cost.
• Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.
• Linear Scalability: Every parallel technology makes claims about scale up.Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.
• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop programming model does not solve all problems, but it is a strong solution for many tasks.

Definition of Big data?

According to Gartner, Big data can be defined as high volume, velocity
and variety information requiring innovative and cost effective forms of information processing for enhanced decision making.

How Big data differs from database ?

Datasets which are beyond the ability of the database to store, analyze and manage can be defined as Big. The technology extracts required information from large volume whereas the storage area is limited for a database.

Who are all using Hadoop? Give some examples?

• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!

Pig for Hadoop - Give some points?

Pig is Data-flow oriented language for analyzing large data sets.
It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.

Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

Extensibility.
Users can create their own functions to do special-purpose processing.

Features of Pig:
– data transformation functions
– datatypes include sets, associative arrays, tuples
– high-level language for marshalling data
- developed at yahoo!


Hive for Hadoop - Give some points?

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Keypoints:
• SQL-based data warehousing application
– features similar to Pig
– more strictly SQL-type
• Supports SELECT, JOIN, GROUP BY,etc
• Analyzing very large data sets
– log processing, text mining, document indexing
• Developed at Facebook

Map Reduce in Hadoop?

Map reduce :
it is a framework for processing in parallel across huge datasets usning large no. of computers referred to cluster, it involves two processes namely Map and reduce.

Map Process:
In this process input is taken by the master node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.

Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and reduce are performed in distributed mode. Since each operation is independent, so each map can be performed in parallel and hence reducing the net computing time.

What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A data node sends heartbeat to Name node and task tracker will send its heart beat to job tracker. If the Name node or job tracker does not receive heart beat then they will decide that there is some problem in data node or task tracker is unable to perform the assigned task.

What is a metadata?

Metadata is the information about the data stored in data nodes such as location of the file, size of the file and so on.

Is Namenode also a commodity?

No. Namenode can never be a commodity hardware because the entire HDFS rely on it.
It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

Can Hadoop be compared to NOSQL database like Cassandra?

Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

What is Key value pair in HDFS?

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

What is the difference between MapReduce engine and HDFS cluster?

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

History of Hadoop?

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.

The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.

Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.

What is meant by Volunteer Computing?

Volunteer computing projects work by breaking the problem they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed.
SETI@home is the most well-known of many volunteer computing projects.

How Hadoop differs from SETI (Volunteer computing)?

Although SETI (Search for Extra-Terrestrial Intelligence) may be superficially similar to MapReduce (breaking a problem into independent pieces to be worked on in parallel), there are some significant differences. The SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Since the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth.

MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.

Compare RDBMS and MapReduce?

Data size:
RDBMS - Gigabytes
MapReduce - Petabytes
Access:
RDBMS - Interactive and batch
MapReduce - Batch
Updates:
RDBMS - Read and write many times
MapReduce - Write once, read many times
Structure:
RDBMS - Static schema
MapReduce - Dynamic schema
Integrity:
RDBMS - High
MapReduce - Low
Scaling:
RDBMS - Nonlinear
MapReduce - Linear

What is HBase?

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

What is ZooKeeper?

A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

What is Chukwa?

A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports. (At the time of this writing, Chukwa had only recently graduated from a “contrib” module in Core to its own subproject.)

What is Avro?

A data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)

core subproject in Hadoop - What is it?

A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).

What are all Hadoop subprojects?

Pig, Chukwa, Hive, HBase, MapReduce, HDFS, ZooKeeper, Core, Avro

What is a split?

Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split.

Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced.

On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default, although this can be changed for the cluster

Map tasks write their output to local disk, not to HDFS. Why is this?

Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete the map output can be thrown away. So storing it in HDFS, with replication, would be overkill. If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to recreate the map output.

MapReduce data flow with a single reduce task- Explain?

The input to a single reduce task is normally the output from all mappers.
The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability.
For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes.

MapReduce data flow with multiple reduce tasks- Explain?

When there are multiple reducers, the map tasks partition their output,
each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for every key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner.

MapReduce data flow with no reduce tasks- Explain?

It’s also possible to have zero reduce tasks. This can be appropriate when you don’t need the shuffle since the processing can be carried out entirely in parallel.
In this case, the only off-node data transfer is used when the map tasks write to HDFS

What is a block in HDFS?

Filesystems deal with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512 bytes.

Why is a Block in HDFS So Large?

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.

File permissions in HDFS?

HDFS has a permissions model for files and directories.
There are three types of permission: the read permission (r), the write permission (w) and the execute permission (x). The read permission is required to read files or list the contents of a directory. The write permission is required to write a file, or for a directory, to create or delete files or directories in it. The execute permission is ignored for a file since you can’t execute a file on HDFS.

What is Thrift in HDFS?

The Thrift API in the “thriftfs” contrib module exposes Hadoop filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings to interact with a Hadoop filesystem, such as HDFS.
To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.

How Hadoop interacts with C?

Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface.
It works using the Java Native Interface (JNI) to call a Java filesystem client.
The C API is very similar to the Java one, but it typically lags the Java one, so newer features may not be supported. You can find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop distribution.

What is FUSE in HDFS Hadoop?

Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically HDFS) to be mounted as a standard filesystem. You can then use Unix utilities (such as ls and cat) to interact with the filesystem.
Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.

Explain WebDAV in Hadoop?

WebDAV is a set of extensions to HTTP to support editing and updating files. WebDAV shares can be mounted as filesystems on most operating systems, so by exposing HDFS (or other Hadoop filesystems) over WebDAV, it’s possible to access HDFS as a standard filesystem.

What is Sqoop in Hadoop?

It is a tool design to transfer the data between Relational database management system(RDBMS) and Hadoop HDFS.
Thus, we can sqoop the data from RDBMS like mySql or Oracle into HDFS of Hadoop as well as exporting data from HDFS file to RDBMS.
Sqoop will read the table row-by-row and the import process is performed in Parallel. Thus, the output may be in multiple files.
Example:
sqoop INTO "directory";
(SELECT * FROM database.table WHERE condition;)



Hadoop Interview Questions

1. What is Hadoop framework?
  Ans: Hadoop is a open source framework which is written in java by apchesoftware foundation. This framework is used to wirite software application whichrequires to process vast amount of data (It could handle multi tera bytes of data)
.
It works in-paralle on large clusters which could have 1000 of computers (Nodes)on the clusters. It also process data very reliably and fault-tolerant manner. Seethe below image how does it looks.
 
2. On What concept the Hadoop framework works?
  Ans : It works on MapReduce, and it is devised by the Google.
3. What is MapReduce ?
  Ans: Map reduce is an algorithm or concept to process Huge amount of data in afaster way. As per its name you can divide it Map and Reduce.
 
The main MapReduce job usually splits the input data-set into independent chunks.(Big data sets in the multiple small datasets)
 
MapTask: will process these chunks in a completely parallel manner (One node canprocess one or more chunks).
 
The framework sorts the outputs of the maps.
 
Reduce Task : And the above output will be the input for the reducetasks, producesthe final result.Your business logic would be written in the MappedTask and ReducedTask.Typically both the input and the output of the job are stored in a file-system (Notdatabase). The framework takes care of scheduling tasks, monitoring them andre-executes the failed tasks.
4. What is compute and Storage nodes?
  Ans:
Compute
 
Node
: This is the computer or machine where your actual businesslogic will be executed.
Storage Node:
This is the computer or machine where your file system reside tostore the processing data.In most of the cases compute node and storage node would be the samemachine.
5. How does master slave architecture in the Hadoop?
  Ans: The MapReduce framework consists of a single master 
JobTracker 
andmultiple slaves, each cluster-node will have one
TaskskTracker 
.The master is responsible for scheduling the jobs' component tasks on theslaves, monitoring them and re-executing the failed tasks. The slaves execute thetasks as directed by the master.
6. How does an Hadoop application look like or their basic components?
  Ans: Minimally an Hadoop application would have following components.
 
Input location of data
 
Output location of processed data.
 
 A map task.
 
 A reduced task.
 
Job configurationThe Hadoop job client then submits the job (jar/executable etc.) and configurationto the
JobTracker 
which then assumes the responsibility of distributing thesoftware/configuration to the slaves, scheduling tasks and monitoring them,providing status and diagnostic information to the job-client.
7. Explain how input and output data format of the Hadoop framework?
  Ans: The MapReduce framework operates exclusively on pairs, that is, theframework views the input to the job as a set of pairs and produces a set of 
 pairsas the output of the job,
conceivably of different types
. See the flow mentioned below
 (input) ->
 map
-> ->
combine/sorting
-> ->
reduce
-> (output) 
8. What are the restriction to the key and value class ?
 
Ans: The key and value classes have to be serialized by the framework. To make themserializable Hadoop provides a
Writable
interface. As you know from the java itself thatthe key of the Map should be comparable, hence the key has to implement one moreinterface
WritableComparable.
9. Explain the WordCount implementation via Hadoop framework ?
 
Ans:
We will count the words in all the input file flow as below
 
input
 Assume there are two files each having a sentence
Hello World Hello World (In file 1)
 
Hello World Hello World (In file 2)
 
 
Mapper : There would be each mapper for the a file
 
For the given sample input the first map output:
 
< Hello, 1>
 
< World, 1>
 
< Hello, 1>
 
< World, 1>
 
The second map output:
 
< Hello, 1>
 
< World, 1>
 
< Hello, 1>
 
< World, 1>
 
 
Combiner/Sorting (This is done for each individual map)
 So output looks like this
The output of the first map:
 
< Hello, 2>
 
< World, 2>
 
The output of the second map:
 
< Hello, 2>
 
< World, 2>
 
 
Reducer :
It sums up the above output and generates the output as below
< Hello, 4>
 
< World, 4>
 
 
Output
 Final output would look likeHello 4 timesWorld 4 times
10. Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
  Ans:
org.apache.hadoop.mapreduce.Mapper
 
org.apache.hadoop.mapreduce.Reducer
 
11. What Mapper does?
  Ans:
Maps are the individual tasks that transform i
 
nput records into intermediate records. The transformed intermediate records do not needto be of the same type as the input records. A given input pair may map to zero or manyoutput pairs.
 
12. What is the InputSplit in map reduce software?
 
Description: http://htmlimg1.scribdassets.com/5t2vdu6ozk2dd4tu/images/6-413f82b0b1.png
Ans: An
InputSplit
is a logical representation of a unit (A chunk) of input work for amap task; e.g., a filename and a byte range within that file to process or a row set in a textfile.
13. What is the
InputFormat ?
  Ans:
The
InputFormat
is responsible for enumerate (itemise) the
InputSplits
, and producing a
RecordReader
which will turn those logical work units into actual physicalinput records.
 
14. Where do you specify the Mapper Implementation?
  Ans: Generally mapper implementation is specified in the Job itself.
15. How Mapper is instantiated in a running job?
 
Ans
: The Mapper itself is instantiated in the running job, and will be passed aMapContext object which it can use to configure itself.
 
16. Which are the methods in the Mapper interface?
 
Ans :
The Mapper contains the run() method, which call its own setup() methodonly once, it also call a map() method for each input and finally calls it cleanup()method. All above methods you can override in your code.
17. What happens if you d
on’t override the Mapper methods and keep them as
it is?
  Ans: If you do not override any methods (leaving even map as-is), it will act asthe identity function, emitting each input record as a separate output.
18. What is the use of Context object?
  Ans: The Context object allows the mapper to interact with the rest of the Hadoopsystem. ItIncludes configuration data for the job, as well as interfaces which allow it to emitoutput.
19. How can you add the arbitrary key-value pairs in your mapper?
  Ans: You can set arbitrary (key, value) pairs of configuration data in your Job,e.g. with
Job.getConfiguration().set("myKey", "myVal"),
and then retrieve this datain your mapper with
Context.getConfiguration().get("myKey")
. This kind of functionality is typically done in the Mapper's setup() method.
20.
How does Mapper’s run() method works?
  Ans:
The Mapper.run() method then calls map
(KeyInType, ValInType, Context)
for eachkey/value pair in the
InputSplit
for that task 
 
21. Which object can be used to get the progress of a particular job ?
  Ans:
Context
 
22. What is next step after Mapper or MapTask?
 
Ans : The output of the Mapper are sorted and Partitions will be created for theoutput. Number of partition depends on the number of reducer.
23. How can we control particular key should go in a specific reducer?
  Ans: Users can control which keys (and hence records) go to which Reducer byimplementing a custom Partitioner.
24. What is the use of Combiner?
  Ans: It is an optional component or class, and can be specify via
Job.setCombinerClass(ClassName)
, to perform local aggregation of theintermediate outputs, which helps to cut down the amount of data transferredfrom the Mapper to theReducer.
25. How many maps are there in a particular Job?
 
Ans:
The number of maps is usually driven by the total size of the inputs, that is,the total number of blocks of the input files.Generally it is around 10-100 maps per-node. Task setup takes awhile, so it isbest if the maps take at least a minute to execute.Suppose, if you expect 10TB of input data and have a blocksize of 128MB, you'llend up with82,000 maps, to control the number of block you can use the
mapreduce.job.maps
parameter (which only provides a hint to
the framework).
Ultimately, the number of tasks is controlled by the number of splits returned bythe
InputFormat.getSplits()
method (which you can override).
26. What is the Reducer used for?
  Ans:
Reducer 
reduces a set of intermediate values which share a key to a(usually smaller) set of v
alues.
 The number of reduces for the job is set by the user via
 Job.setNumReduceTasks(int)
.
27. Explain the core methods of the Reducer?
  Ans: The API of 
Reducer 
is very similar to that of Mapper, there's a
run()
methodthat receives a
Context
containing the job's configuration as well as interfacingmethods that return data from the reducer itself back to the framework. The
run()
 method calls
setup()
once,
reduce()
once for each key associated with thereduce task, and
cleanup()
once at the end. Each of these methods can accessthe job's configuration data by using
Context.getConfiguration()
. As in Mapper, any or all of these methods can be overridden with customimplementations. If none of these methods are overridden, the default reducer operation is the identity function; values are passed through without further processing.
The heart of Reducer is its
reduce
() method. This is called once per key; thesecond argument is an
Iterable
which returns all the values associated with thatkey.
28. What are the primary phases of the Reducer?
  Ans: Shuffle, Sort and Reduce
29. Explain the shuffle?
 
Ans: Input to the
Reducer
is the sorted output of the mappers. In this phase theframework fetches the relevant partition of the output of all the mappers, via HTTP.
 
30.
Explain the Reducer’s Sort phase?
  Ans:
The framework groups
Reducer
inputs by keys (since different mappers mayhave output the same key) in this stage. The shuffle and sort phases occur simultaneously;while map-outputs are being fetched they are merged (It is similar to merge-sort).
 
31.
Explain the Reducer’s reduce phase?
 
Ans: In this phase the
reduce(MapOutKeyType, Iterable, Context) 
method is called for each pair in the grouped inputs. The output of the reduce task is typically written to the
FileSystem
via
Context.write(ReduceOutKeyType, ReduceOutValType).
Applicationscan use the Context to report progress, set application-level status messages and updateCounters, or just indicate that they are alive. The output of the Reducer is not sorted.
 
32. How many Reducers should be configured?
  Ans:
The right number of reduces seems to be
0.95
or 
1.75
multiplied by (<
no. of nodes
> *
mapreduce.tasktracker.reduce.tasks.maximum
).
 
With 0.95 all of the reduces can launch immediately and start transfering map outputs asthe maps finish. With 1.75 the faster nodes will finish their first round of reduces andlaunch a second wave of reduces doing a much better job of load balancing. Increasingthe number of reduces increases the framework overhead, but increases load balancingand lowers the cost of failures.
 
33. It can be possible that a Job has 0 reducers?
 
Ans: It is legal to set the number of reduce-tasks to zero if no reduction is desired.
 
34. What happens if number of reducers are 0?
 
Ans: In this case the outputs of the map-tasks go directly to the FileSystem, into theoutput path set by
setOutputPath(Path)
. The framework does not sort the map-outputs before writing them out to the FileSystem.
 
35. How many instances of JobTracker can run on a Hadoop Cluser?
  Ans: Only one
36. What is the JobTracker and what it performs in a Hadoop Cluster?
  Ans: JobTracker is a daemon service which submits and tracks the MapReducetasks to the Hadoop cluster. It runs its own JVM process. And usually it run on aseparate machine, and each slave node is configured with job tracker nodelocation.
The JobTracker is single point of failure for the Hadoop MapReduce service. If itgoes down, all running jobs are halted.
JobTracker in Hadoop performs following actions
1.1.
 
Client applications submit jobs to the Job tracker.
 
 
The JobTracker talks to the NameNode to determine the location of the data
 
 
The JobTracker locates TaskTracker nodes with available slots at or near thedata
 
 
The JobTracker submits the work to the chosen TaskTracker nodes.
 
 
The TaskTracker nodes are monitored. If they do not submit heartbeat signalsoften enough, they are deemed to have failed and the work is scheduled on adifferent TaskTracker.
 
 A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark thatspecific record as something to avoid, and it may may even blacklist theTaskTracker as unreliable.
 
When the work is completed, the JobTracker updates its status.
 
 
Client applications can poll the JobTracker for information.
37. How a task is scheduled by a JobTracker?
  Ans: The TaskTrackers send out heartbeat messages to the JobTracker, usuallyevery few minutes, to reassure the JobTracker that it is still alive. Thesemessages also inform the JobTracker of the number of available slots, so theJobTracker can stay up to date with where in the cluster work can be delegated.When the JobTracker tries to find somewhere to schedule a task within theMapReduce operations, it first looks for an empty slot on the same server thathosts the DataNode containing the data, and if not, it looks for an empty slot on amachine in the same rack.
38. How many instances of Tasktracker run on a Hadoop cluster?
  Ans: There is one Daemon Tasktracker process for each slave node in theHadoop cluster.
39. What are the two main parts of the Hadoop framework?
  Ans: Hadoop consists of two main parts
 
Hadoop distributed file system
, a distributed file system with high throughput,
 
Hadoop MapReduce
, a software framework for processing large data sets.
40. Explain the use of TaskTracker in the Hadoop cluster?
  Ans: A Tasktracker is a slave node in the cluster which that accepts the tasksfrom JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs inits own JVM Process.Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to dothe actual work (called as Task Instance) this is to ensure that process failuredoes not take down the task tracker.The Tasktracker monitors these task instances, capturing the output and exitcodes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker.The TaskTrackers also send out heartbeat messages to the JobTracker, usuallyevery few minutes, to reassure the JobTracker that it is still alive. Thesemessages also inform the JobTracker of the number of available slots, so theJobTracker can stay up to date with where in the cluster work can be delegated
.
 
41. What do you mean by TaskInstance?
  Ans: Task instances are the actual
MapReduce jobs
which run on each slavenode. The TaskTracker starts a separate JVM processes to do the actual work(called as Task Instance) this is to ensure that process failure does not take downthe entire task tracker.Each Task Instance runs on its own JVM process. Therecan be multiple processes of task instance running on a slave node. This isbased on the number of slots configured on task tracker. By default a new taskinstance JVM process is spawned for a task.
42. How many daemon processes run on a Hadoop cluster?
  Ans: Hadoop is comprised of five separate daemons. Each of these daemonsruns in its own JVM.Following 3 Daemons run on Master nodes.
NameNode
- This daemon storesand maintains the metadata for HDFS.
Secondary NameNode
- Performs housekeeping functions for the NameNode.
 
JobTracker 
- Manages MapReduce jobs, distributes individual tasks to machinesrunning the Task Tracker. Following 2 Daemons run on each Slave nodes
DataNode
 
 –
Stores actual HDFS data blocks.
TaskTracker 
 
 –
It is Responsible for instantiating and monitoring individual Mapand Reduce tasks.
43. How many maximum JVM can run on a slave node?
  Ans: One or Multiple instances of Task Instance can run on each slave node.Each task instance is run as a separate JVM process. The number of Taskinstances can be controlled by configuration. Typically a high end machine isconfigured to run more task instances. 
44. What is NAS?
  Ans: It is one kind of file system where data can reside on one centralizedmachine and all the cluster member will read write data from that shareddatabase, which would not be as efficient as HDFS.
45. How HDFA differs with NFS?
  Ans: Following are differences between HDFS and NAS1.
o
 
In HDFS Data Blocks are distributed across local drives of all machines in acluster. Whereas in NAS data is storedon dedicated hardware.
o
 
HDFS is designed to work with MapReduce System, since computation ismoved to data. NAS is not suitable for MapReduce since data is storedseparately from the computations.
o
 
HDFS runs on a cluster of machines and provides redundancy usingreplication protocol. Whereas NAS is provided by a single machinetherefore does not provide data redundancy.
46. How does a NameNode handle the failure of the data nodes?
  Ans:HDFS has master/slave architecture. An HDFS cluster consists of a single
NameNode
, a master server that manages the file system namespace and regulatesaccess to files by clients.In addition, there are a number of DataNodes, usually one per node in the cluster,which manage storage attached to the nodes that they run on.
 
The NameNode and DataNode are pieces of software designed to run on commoditymachines.NameNode periodically receives a Heartbeat and a Block report from each of theDataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode isfunctioning properly. A Blockreport contains a list of all blocks on a DataNode. WhenNameNode notices that it has not received a heartbeat message from a data nodeafter a certain amount of time, the data node is marked as dead. Since blocks will beunder replicated the system begins replicating the blocks that were stored on thedead DataNode. The NameNode Orchestrates the replication of data blocks fromone DataNode to another. The replication data transfer happens directly betweenDataNode and the data never passes through the NameNode.
47. Can Reducer talk with each other?
  Ans: No, Reducer runs in isolation.
48.
Where the Mapper’s Intermediate data will be stored?
  Ans: The mapper output (intermediate data) is stored on the Local file system(NOT HDFS) of each individual mapper nodes. This is typically a temporarydirectory location which can be setup in config by the Hadoop administrator. Theintermediate data is cleaned up after the Hadoop Job completes.
49. What is the use of Combiners in the Hadoop framework?
  Ans: Combiners are used to increase the efficiency of a MapReduce program.They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to betransferred across to the reducers.You can use your reducer code as a combiner if the operation performed iscommutative and associative.The execution of combiner is not guaranteed; Hadoop may or may not execute acombiner. Also, if required it may execute it more than 1 times. Therefore your MapReduce
 jobs should not depend on the combiners’ execution.
 
50. What is the Hadoop MapReduce API contract for a key and value Class?
 Ans:
◦The Key must implement the
org.apache.hadoop.io.WritableComparable
 interface.
◦The value must implement the
org.apache.hadoop.io.Writable
interface.
51. What is a IdentityMapper and IdentityReducer in MapReduce?
org.apache.hadoop.mapred.lib.IdentityMapper 
: Implements the identity function,mapping inputs directly to outputs. If MapReduce programmer does not set theMapper Class using
JobConf.setMapperClass
then
IdentityMapper.class
is usedas a default value.
org.apache.hadoop.mapred.lib.IdentityReducer 
: Performs no reduction, writingall input values directly to the output. If MapReduce programmer does not set theReducer Class using
JobConf.setReducerClass
then
IdentityReducer.class
isused as a default value.
52. What is the meaning of speculative execution in Hadoop? Why is itimportant?
 Ans: Speculative execution is a way of coping with individual Machine performance.In large clusters where hundreds or thousands of machines are involved there maybe machines which are not performing as fast as others.This may result in delays in a full job due to only one machine not performaing well.To avoid this, speculative execution in hadoop can run multiple copies of same mapor reduce task on different slave nodes. The results from first node to finish are used.
53. When the reducers are are started in a MapReduce job?
 Ans: In a MapReduce job reducers do not start executing the reduce method untilthe all Map jobs have completed. Reducers start copying intermediate key-valuepairs from the mappers as soon as they are available. The programmer definedreduce method is called only after all the mappers have finished.If reducers do not start before all mappers finish then why does the progress onMapReduce job shows something like Map(50%) Reduce(10%)? Why reducersprogress percentage is displayed when mapper is not finished yet?Reducers start copying intermediate key-value pairs from the mappers as soon asthey are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress startsshowing up as soon as any intermediate key-value pair for a mapper is available tobe transferred to reducer.Though the reducer progress is updated still the programmer defined reduce methodis called only after all the mappers have finished.
54. What is HDFS ? How it is different from traditional file systems?
HDFS, the Hadoop Distributed File System, is responsible for storing huge data onthe cluster. This is a distributed file system designed to run on commodity hardware.It has many similarities with existing distributed file systems. However, thedifferences from other distributed file systems are significant.
◦HDFS is highly fault
-tolerant and is designed to be deployed on low-cost hardware.
◦HDFS provides high throughput access to application data and is suitable for 
applications that have large data sets.
◦HDFS is designed to support very large files. Applications that are compatible with
HDFS are those that deal with large data sets. These applications write their dataonly once but they read it one or more times and require these reads to be satisfiedat streaming speeds. HDFS supports write-once-read-many semantics on files.
55. What is HDFS Block size? How is it different from traditional file systemblock size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster.Each block is typically 64Mb or 128Mb in size.Each block is replicated multiple times. Default is to replicate each block three times.Replicas are stored on different nodes. HDFS utilizes the local file system to storeeach HDFS block as a separate file. HDFS Block size can not be compared with thetraditional file system block size.
57.What is a NameNode? How many instances of NameNode run on a HadoopCluster? 
The NameNode is the centerpiece of an HDFS file system. It keeps the directory treeof all files in the file system, and tracks where across the cluster the file data is kept.It does not store the data of these files itself.There is only One NameNode process run on any hadoop cluster. NameNode runson its own JVM process. In a typical production cluster its run on a separatemachine.The NameNode is a Single Point of Failure for the HDFS Cluster. When theNameNode goes down, the file system goes offline.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds thesuccessful requests by returning a list of relevant DataNode servers where the datalives.
58. What is a DataNode? How many instances of DataNode run on a HadoopCluster?
 A DataNode stores data in the Hadoop File System HDFS. There is only OneDataNode process run on any hadoop slave node. DataNode runs on its own JVMprocess. On startup, a DataNode connects to the NameNode. DataNode instancescan talk to each other, this is mostly during replicating data.
59. How the Client communicates with HDFS?
The Client communication to HDFS happens using Hadoop HDFS API. Clientapplications talk to the NameNode whenever they wish to locate a file, or when theywant to add/copy/move/delete a file on HDFS. The NameNode responds thesuccessful requests by returning a list of relevant DataNode servers where the datalives. Client applications can talk directly to a DataNode, once the NameNode hasprovided the location of the data.
60. How the HDFS Blocks are replicated?
HDFS is designed to reliably store very large files across machines in a large cluster.It stores each file as a sequence of blocks; all blocks in a file except the last blockare the same size.The blocks of a file are replicated for fault tolerance. The block size and replicationfactor are configurable per file. An application can specify the number of replicas of afile. The replication factor can be specified at file creation time and can be changedlater. Files in HDFS are write-once and have strictly one writer at any time.The NameNode makes all decisions regarding replication of blocks. HDFS usesrack-aware replica placement policy. In default configuration there are total 3 copiesof a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rdcopy on a different rack.


No comments:

Post a Comment