Name the most common InputFormats
defined in Hadoop? Which one is default ?
Following 3 are most common InputFormats defined in Hadoop
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
TextInputFormat is the hadoop default.
What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper
KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.
What is InputSplit in Hadoop?
When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split
How is the splitting of file invoked in Hadoop Framework ?
It is invoked by the Hadoop framework by running getInputSplit() method of the Input format class (like FileInputFormat) defined by the user
Consider case scenario: In M/R system,
- HDFS block size is 64 MB
- Input format is FileInputFormat
- We have 3 files of size 64K, 65Mb and 127Mb
then how many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows
- 1 split for 64K files
- 2 splits for 65Mb files
- 2 splits for 127Mb file
What is the purpose of RecordReader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat
After the Map phase finishes, the hadoop framework does "Partitioning, Shuffle and sort". Explain what happens in this phase?
- Partitioning
Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same
- Shuffle
After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.
- Sort
Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer
If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer?
The default partitioner computes a hash value for the key and assigns the partition based on this result
What is a Combiner?
The Combiner is a "mini-reduce" process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
What is job tracker?
Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluster
What are some typical functions of Job Tracker?
The following are some typical tasks of Job Tracker
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data
- It locates TaskTracker nodes with available slots at or near the data
- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker
What is task tracker?
Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker
Whats the relationship between Jobs and Tasks in Hadoop?
One job is broken down into one or many tasks in Hadoop.
Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will hadoop do ?
It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job
Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this ?
Speculative Execution
How does speculative execution works in Hadoop ?
How does speculative execution works in Hadoop ?
Job tracker makes different task trackers process
same input. When tasks complete, they announce this fact to the Job Tracker.
Whichever copy of a task finishes first becomes the definitive copy. If other
copies were executing speculatively, Hadoop tells the Task Trackers to abandon
the tasks and discard their outputs. The Reducers then receive their inputs
from whichever Mapper completed successfully, first.
Using command line in Linux, how will you
- see all jobs running in the hadoop cluster
- kill a job
Using command line in Linux, how will you
- see all jobs running in the hadoop cluster
- kill a job
- hadoop job -list
- hadoop job -kill jobid
What is Hadoop Streaming ?
- hadoop job -kill jobid
What is Hadoop Streaming ?
Streaming is a generic API that allows programs
written in virtually any language to be used as Hadoop Mapper and Reducer
implementations
What is the characteristic of streaming API that makes it flexible run map reduce jobs in languages like perl, ruby, awk etc. ?
What is the characteristic of streaming API that makes it flexible run map reduce jobs in languages like perl, ruby, awk etc. ?
Hadoop Streaming allows to use arbitrary programs
for the Mapper and Reducer phases of a Map Reduce job by having both Mappers
and Reducers receive their input on stdin and emit output (key, value) pairs on
stdout.
Whats is Distributed Cache in Hadoop ?
Whats is Distributed Cache in Hadoop ?
Distributed Cache is a facility provided by the
Map/Reduce framework to cache files (text, archives, jars and so on) needed by
applications during execution of the job. The framework will copy the necessary
files to the slave node before any tasks for the job are executed on that node.
What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it ?
What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it ?
This is because distributed cache is much faster.
It copies the file to all trackers at the start of the job. Now if the task
tracker runs 10 or 100 mappers or reducer, it will use the same copy of
distributed cache. On the other hand, if you put code in file to read it from
HDFS in the MR job then every mapper will try to access it from HDFS hence if a
task tracker run 100 map jobs then it will try to read this file 100 times from
HDFS. Also HDFS is not very efficient when used like this.
What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application ?
What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during runtime of the application ?
This is a trick questions. There is no such
mechanism. Distributed Cache by design is read only during the time of Job
execution
Have you ever used Counters in Hadoop. Give us an example scenario ?
Have you ever used Counters in Hadoop. Give us an example scenario ?
Anybody who claims to have worked on a Hadoop
project is expected to use counters
Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job ?
Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job ?
Yes, The input format class provides methods to add
multiple directories as input to a Hadoop job
Is it possible to have Hadoop job output in multiple directories. If yes then how ?
Is it possible to have Hadoop job output in multiple directories. If yes then how ?
Yes, by using Multiple Outputs class
What will a hadoop job do if you try to run it with an output directory that is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit
What will a hadoop job do if you try to run it with an output directory that is already present? Will it
- overwrite it
- warn you and continue
- throw an exception and exit
The hadoop job will throw an exception and exit.
How can you set an arbitrary number of mappers to be created for a job in Hadoop ?
How can you set an arbitrary number of mappers to be created for a job in Hadoop ?
This is a trick question. You cannot set it
How can you set an arbitary number of reducers to be created for a job in Hadoop ?
How can you set an arbitary number of reducers to be created for a job in Hadoop ?
You can either do it progamatically by using method
setNumReduceTasksin the JobConfclass or set it up as a configuration setting
How will you write a custom partitioner for a Hadoop job ?
How will you write a custom partitioner for a Hadoop job ?
To have hadoop use a custom partitioner you will
have to do minimum the following three
- Create a new class that extends Partitioner class
- Override method getPartition
- In the wrapper that runs the Map Reducer, either
- add the custom partitioner to the job programtically using method setPartitionerClass or
- add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)
How did you debug your Hadoop code ?
- Create a new class that extends Partitioner class
- Override method getPartition
- In the wrapper that runs the Map Reducer, either
- add the custom partitioner to the job programtically using method setPartitionerClass or
- add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)
How did you debug your Hadoop code ?
There can be several ways of doing this but most
common ways are
- By using counters
- The web interface provided by Hadoop framework
Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason?
- By using counters
- The web interface provided by Hadoop framework
Did you ever built a production process in Hadoop ? If yes then what was the process when your hadoop job fails due to any reason?
Its an open ended question but most candidates, if
they have written a production job, should talk about some type of alert
mechanisn like email is sent or there monitoring system sends an alert. Since
Hadoop works on unstructured data, its very important to have a good alerting system
for errors since unexpected data can very easily break the job.
Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it ?
Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did you handled it ?
This is an open ended question but a candidate who
claims to be an intermediate developer and has worked on large data set
(10-20GB min) should have run into this problem. There can be many ways to
handle this problem but most common way is to alter your algorithm and break
down the job into more map reduce phase or use a combiner if possible.
What is HDFS?
What is HDFS?
HDFS, the Hadoop Distributed File System, is a
distributed file system designed to hold very large amounts of data (terabytes
or even petabytes), and provide high-throughput access to this information.
Files are stored in a redundant fashion across multiple machines to ensure
their durability to failure and high availability to very parallel applications
What does the statement "HDFS is block structured file system" means?
What does the statement "HDFS is block structured file system" means?
It means that in HDFS individual files are broken
into blocks of a fixed size. These blocks are stored across a cluster of one or
more machines with data storage capacity
What does the term "Replication factor" mean?
What does the term "Replication factor" mean?
Replication factor is the number of times a file
needs to be replicated in HDFS
What is the default replication factor in HDFS?
What is the default replication factor in HDFS?
3
What is the default block size of an HDFS block?
What is the default block size of an HDFS block?
64Mb
What is the benefit of having such big block size (when compared to block
What is the benefit of having such big block size (when compared to block
size of linux file system like ext)?
It allows HDFS to decrease the amount of metadata
storage required per file (the list of blocks per file will be smaller as the
size of individual blocks increases). Furthermore, it allows for fast streaming
reads of data, by keeping large amounts of data sequentially laid out on the
disk
Why is it recommended to have few very large files instead of a lot of small files in HDFS?
Why is it recommended to have few very large files instead of a lot of small files in HDFS?
This is because the Name node contains the meta
data of each and every file in HDFS and more files means more metadata and
since namenode loads all the metadata in memory for speed hence having a lot of
files may make the metadata information big enough to exceed the size of the
memory on the Name node
True/false question. What is the lowest granularity at which you can apply replication factor in HDFS
- You can choose replication factor per directory
- You can choose replication factor per file in a directory
- You can choose replication factor per block of a file
- True
- True
- False
What is a datanode in HDFS?
Individual machines in the HDFS cluster that hold blocks of data are called datanodes
What is a Namenode in HDFS?
The Namenode stores all the metadata for the file system
What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered?
There is no way. If Namenode dies and there is no backup then there is no way to recover data
Describe how a HDFS client will read a file in HDFS, like will it talk to data node or namenode ... how will data flow etc?
To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file. These locations identify the Data Nodes which hold each block. Clients then read file data directly from the Data Node servers, possibly in parallel. The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum.
Using linux command line. how will you
- List the the number of files in a HDFS directory
- Create a directory in HDFS
- Copy file from your local directory to HDFS
hadoop fs -ls
hadoop fs -mkdir
hadoop fs -put localfile hdfsfile
Advantages of Hadoop?
True/false question. What is the lowest granularity at which you can apply replication factor in HDFS
- You can choose replication factor per directory
- You can choose replication factor per file in a directory
- You can choose replication factor per block of a file
- True
- True
- False
What is a datanode in HDFS?
Individual machines in the HDFS cluster that hold blocks of data are called datanodes
What is a Namenode in HDFS?
The Namenode stores all the metadata for the file system
What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered?
There is no way. If Namenode dies and there is no backup then there is no way to recover data
Describe how a HDFS client will read a file in HDFS, like will it talk to data node or namenode ... how will data flow etc?
To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file. These locations identify the Data Nodes which hold each block. Clients then read file data directly from the Data Node servers, possibly in parallel. The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum.
Using linux command line. how will you
- List the the number of files in a HDFS directory
- Create a directory in HDFS
- Copy file from your local directory to HDFS
hadoop fs -ls
hadoop fs -mkdir
hadoop fs -put localfile hdfsfile
Advantages of Hadoop?
• Bringing compute and storage together on
commodity hardware: The result is blazing speed at low cost.
• Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.
• Linear Scalability: Every parallel technology makes claims about scale up.Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.
• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop programming model does not solve all problems, but it is a strong solution for many tasks.
Definition of Big data?
• Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.
• Linear Scalability: Every parallel technology makes claims about scale up.Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.
• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop programming model does not solve all problems, but it is a strong solution for many tasks.
Definition of Big data?
According to Gartner, Big data can be defined as
high volume, velocity
and variety information requiring innovative and
cost effective forms of information processing for enhanced decision making.
How Big data differs from database ?
How Big data differs from database ?
Datasets which are beyond the ability of the
database to store, analyze and manage can be defined as Big. The technology
extracts required information from large volume whereas the storage area is
limited for a database.
Who are all using Hadoop? Give some examples?
Who are all using Hadoop? Give some examples?
• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!
Pig for Hadoop - Give some points?
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• NSF-Google
• IBM
• Ning
• PARC
• Rackspace
• StumbleUpon
• Yahoo!
Pig for Hadoop - Give some points?
Pig is Data-flow oriented language for analyzing
large data sets.
It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig language layer currently consists of a textual language called Pig Latin, which has the following key properties:
Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility.
Users can create their own functions to do special-purpose processing.
Features of Pig:
– data transformation functions
– datatypes include sets, associative arrays, tuples
– high-level language for marshalling data
- developed at yahoo!
Hive for Hadoop - Give some points?
It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig language layer currently consists of a textual language called Pig Latin, which has the following key properties:
Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility.
Users can create their own functions to do special-purpose processing.
Features of Pig:
– data transformation functions
– datatypes include sets, associative arrays, tuples
– high-level language for marshalling data
- developed at yahoo!
Hive for Hadoop - Give some points?
Hive is a data warehouse system for Hadoop that
facilitates easy data summarization, ad-hoc queries, and the analysis of large
datasets stored in Hadoop compatible file systems. Hive provides a mechanism to
project structure onto this data and query the data using a SQL-like language
called HiveQL. At the same time this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
Keypoints:
• SQL-based data warehousing application
– features similar to Pig
– more strictly SQL-type
• Supports SELECT, JOIN, GROUP BY,etc
• Analyzing very large data sets
– log processing, text mining, document indexing
• Developed at Facebook
Map Reduce in Hadoop?
Keypoints:
• SQL-based data warehousing application
– features similar to Pig
– more strictly SQL-type
• Supports SELECT, JOIN, GROUP BY,etc
• Analyzing very large data sets
– log processing, text mining, document indexing
• Developed at Facebook
Map Reduce in Hadoop?
Map reduce :
it is a framework for processing in parallel across huge datasets usning large no. of computers referred to cluster, it involves two processes namely Map and reduce.
Map Process:
In this process input is taken by the master node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.
Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and reduce are performed in distributed mode. Since each operation is independent, so each map can be performed in parallel and hence reducing the net computing time.
What is a heartbeat in HDFS?
it is a framework for processing in parallel across huge datasets usning large no. of computers referred to cluster, it involves two processes namely Map and reduce.
Map Process:
In this process input is taken by the master node,which divides it into smaller tasks and distribute them to the workers nodes. The workers nodes process these sub tasks and pass them back to the master node.
Reduce Process :
In this the master node combines all the answers provided by the worker nodes to get the results of the original task. The main advantage of Map reduce is that the map and reduce are performed in distributed mode. Since each operation is independent, so each map can be performed in parallel and hence reducing the net computing time.
What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive.
A data node sends heartbeat to Name node and task tracker will send its heart
beat to job tracker. If the Name node or job tracker does not receive heart
beat then they will decide that there is some problem in data node or task
tracker is unable to perform the assigned task.
What is a metadata?
What is a metadata?
Metadata is the information about the data stored
in data nodes such as location of the file, size of the file and so on.
Is Namenode also a commodity?
Is Namenode also a commodity?
No. Namenode can never be a commodity hardware
because the entire HDFS rely on it.
It is the single point of failure in HDFS. Namenode has to be a high-availability machine.
Can Hadoop be compared to NOSQL database like Cassandra?
It is the single point of failure in HDFS. Namenode has to be a high-availability machine.
Can Hadoop be compared to NOSQL database like Cassandra?
Though NOSQL is the closet technology that can be
compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL.
Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming
framework (MapReduce).
What is Key value pair in HDFS?
What is Key value pair in HDFS?
Key value pair is the intermediate data generated
by maps and sent to reduces for generating the final output.
What is the difference between MapReduce engine and HDFS cluster?
What is the difference between MapReduce engine and HDFS cluster?
HDFS cluster is the name given to the whole
configuration of master and slaves where data is stored. Map Reduce Engine is
the programming module which is used to retrieve and analyze data.
What is a rack?
What is a rack?
Rack is a storage area with all the datanodes put
together. These datanodes can be physically located at different places. Rack
is a physical collection of datanodes which are stored at a single location.
There can be multiple racks in a single location.
How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.
History of Hadoop?
How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.
History of Hadoop?
Hadoop was created by Doug Cutting, the creator of
Apache Lucene, the widely used text search library. Hadoop has its origins in
Apache Nutch, an open source web search engine, itself a part of the Lucene
project.
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.
What is meant by Volunteer Computing?
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria.
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.
What is meant by Volunteer Computing?
Volunteer computing projects work by breaking the
problem they are trying to solve into chunks called work units, which are sent
to computers around the world to be analyzed.
SETI@home is the most well-known of many volunteer computing projects.
How Hadoop differs from SETI (Volunteer computing)?
SETI@home is the most well-known of many volunteer computing projects.
How Hadoop differs from SETI (Volunteer computing)?
Although SETI (Search for Extra-Terrestrial
Intelligence) may be superficially similar to MapReduce (breaking a problem
into independent pieces to be worked on in parallel), there are some significant
differences. The SETI@home problem is very CPU-intensive, which makes it
suitable for running on hundreds of thousands of computers across the world.
Since the time to transfer the work unit is dwarfed by the time to run the
computation on it. Volunteers are donating CPU cycles, not bandwidth.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.
Compare RDBMS and MapReduce?
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality.
Compare RDBMS and MapReduce?
Data size:
RDBMS - Gigabytes
MapReduce - Petabytes
Access:
RDBMS - Interactive and batch
MapReduce - Batch
Updates:
RDBMS - Read and write many times
MapReduce - Write once, read many times
Structure:
RDBMS - Static schema
MapReduce - Dynamic schema
Integrity:
RDBMS - High
MapReduce - Low
Scaling:
RDBMS - Nonlinear
MapReduce - Linear
What is HBase?
RDBMS - Gigabytes
MapReduce - Petabytes
Access:
RDBMS - Interactive and batch
MapReduce - Batch
Updates:
RDBMS - Read and write many times
MapReduce - Write once, read many times
Structure:
RDBMS - Static schema
MapReduce - Dynamic schema
Integrity:
RDBMS - High
MapReduce - Low
Scaling:
RDBMS - Nonlinear
MapReduce - Linear
What is HBase?
A distributed, column-oriented database. HBase uses
HDFS for its underlying storage, and supports both batch-style computations
using MapReduce and point queries (random reads).
What is ZooKeeper?
What is ZooKeeper?
A distributed, highly available coordination
service. ZooKeeper provides primitives such as distributed locks that can be
used for building distributed applications.
What is Chukwa?
What is Chukwa?
A distributed data collection and analysis system.
Chukwa runs collectors that store data in HDFS, and it uses MapReduce to
produce reports. (At the time of this writing, Chukwa had only recently
graduated from a “contrib” module in Core to its own subproject.)
What is Avro?
What is Avro?
A data serialization system for efficient,
cross-language RPC, and persistent data storage. (At the time of this writing,
Avro had been created only as a new subproject, and no other Hadoop subprojects
were using it yet.)
core subproject in Hadoop - What is it?
core subproject in Hadoop - What is it?
A set of components and interfaces for distributed
filesystems and general I/O (serialization, Java RPC, persistent data
structures).
What are all Hadoop subprojects?
What are all Hadoop subprojects?
Pig, Chukwa, Hive, HBase, MapReduce, HDFS,
ZooKeeper, Core, Avro
What is a split?
What is a split?
Hadoop divides the input to a MapReduce job into
fixed-size pieces called input splits, or just splits. Hadoop creates one map
task for each split, which runs the userdefined map function for each record in
the split.
Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced.
On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default, although this can be changed for the cluster
Map tasks write their output to local disk, not to HDFS. Why is this?
Having many splits means the time taken to process each split is small compared to the time to process the whole input. So if we are processing the splits in parallel, the processing is better load-balanced.
On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default, although this can be changed for the cluster
Map tasks write their output to local disk, not to HDFS. Why is this?
Map output is intermediate output: it’s processed
by reduce tasks to produce the final output, and once the job is complete the
map output can be thrown away. So storing it in HDFS, with replication, would
be overkill. If the node running the map task fails before the map output has
been consumed by the reduce task, then Hadoop will automatically rerun the map
task on another node to recreate the map output.
MapReduce data flow with a single reduce task- Explain?
MapReduce data flow with a single reduce task- Explain?
The input to a single reduce task is normally the
output from all mappers.
The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability.
For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes.
MapReduce data flow with multiple reduce tasks- Explain?
The sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability.
For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes.
MapReduce data flow with multiple reduce tasks- Explain?
When there are multiple reducers, the map tasks
partition their output,
each creating one partition for each reduce task. There
can be many keys (and their associated values) in each partition, but the
records for every key are all in a single partition. The partitioning can be
controlled by a user-defined partitioning function, but normally the default
partitioner.
MapReduce data flow with no reduce tasks- Explain?
MapReduce data flow with no reduce tasks- Explain?
It’s also possible to have zero reduce tasks. This
can be appropriate when you don’t need the shuffle since the processing can be
carried out entirely in parallel.
In this case, the only off-node data transfer is used when the map tasks write to HDFS
What is a block in HDFS?
In this case, the only off-node data transfer is used when the map tasks write to HDFS
What is a block in HDFS?
Filesystems deal with data in blocks, which are an
integral multiple of the disk block size. Filesystem blocks are typically a few
kilobytes in size, while disk blocks are normally 512 bytes.
Why is a Block in HDFS So Large?
Why is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and
the reason is to minimize the cost of seeks. By making a block large enough,
the time to transfer the data from the disk can be made to be significantly
larger than the time to seek to the start of the block. Thus the time to
transfer a large file made of multiple blocks operates at the disk transfer
rate.
File permissions in HDFS?
File permissions in HDFS?
HDFS has a permissions model for files and
directories.
There are three types of permission: the read permission (r), the write permission (w) and the execute permission (x). The read permission is required to read files or list the contents of a directory. The write permission is required to write a file, or for a directory, to create or delete files or directories in it. The execute permission is ignored for a file since you can’t execute a file on HDFS.
What is Thrift in HDFS?
There are three types of permission: the read permission (r), the write permission (w) and the execute permission (x). The read permission is required to read files or list the contents of a directory. The write permission is required to write a file, or for a directory, to create or delete files or directories in it. The execute permission is ignored for a file since you can’t execute a file on HDFS.
What is Thrift in HDFS?
The Thrift API in the “thriftfs” contrib module
exposes Hadoop filesystems as an Apache Thrift service, making it easy for any
language that has Thrift bindings to interact with a Hadoop filesystem, such as
HDFS.
To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.
How Hadoop interacts with C?
To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.
How Hadoop interacts with C?
Hadoop provides a C library called libhdfs that
mirrors the Java FileSystem interface.
It works using the Java Native Interface (JNI) to call a Java filesystem client.
The C API is very similar to the Java one, but it typically lags the Java one, so newer features may not be supported. You can find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop distribution.
What is FUSE in HDFS Hadoop?
It works using the Java Native Interface (JNI) to call a Java filesystem client.
The C API is very similar to the Java one, but it typically lags the Java one, so newer features may not be supported. You can find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop distribution.
What is FUSE in HDFS Hadoop?
Filesystem in Userspace (FUSE) allows filesystems
that are implemented in user space to be integrated as a Unix filesystem.
Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically
HDFS) to be mounted as a standard filesystem. You can then use Unix utilities
(such as ls and cat) to interact with the filesystem.
Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.
Explain WebDAV in Hadoop?
Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.
Explain WebDAV in Hadoop?
WebDAV is a set of extensions to HTTP to support
editing and updating files. WebDAV shares can be mounted as filesystems on most
operating systems, so by exposing HDFS (or other Hadoop filesystems) over
WebDAV, it’s possible to access HDFS as a standard filesystem.
What is Sqoop in Hadoop?
What is Sqoop in Hadoop?
It is a tool design to transfer the data between Relational database management system(RDBMS) and Hadoop HDFS.
Thus, we can sqoop the data from RDBMS like mySql or Oracle into HDFS of Hadoop as well as exporting data from HDFS file to RDBMS.
Sqoop will read the table row-by-row and the import process is performed in Parallel. Thus, the output may be in multiple files.
Example:
sqoop INTO "directory";
(SELECT * FROM database.table WHERE condition;)
Hadoop
Interview Questions
1. What is Hadoop framework?
Ans:
Hadoop is a open source framework which is written in java by apchesoftware
foundation. This framework is used to wirite software application whichrequires
to process vast amount of data (It could handle multi tera bytes of data)
.
It works
in-paralle on large clusters which could have 1000 of computers (Nodes)on the
clusters. It also process data very reliably and fault-tolerant manner. Seethe
below image how does it looks.
2.
On What concept the Hadoop framework works?
Ans
: It works on MapReduce, and it is devised by the Google.
3.
What is MapReduce ?
Ans:
Map reduce is an algorithm or concept to process Huge amount of data in afaster
way. As per its name you can divide it Map and Reduce.
The main
MapReduce job usually splits the input data-set into independent chunks.(Big
data sets in the multiple small datasets)
MapTask: will
process these chunks in a completely parallel manner (One node canprocess one
or more chunks).
The framework
sorts the outputs of the maps.
Reduce Task :
And the above output will be the input for the reducetasks, producesthe final
result.Your business logic would be written in the MappedTask and
ReducedTask.Typically both the input and the output of the job are stored in a
file-system (Notdatabase). The framework takes care of scheduling tasks,
monitoring them andre-executes the failed tasks.
4.
What is compute and Storage nodes?
Ans:
Compute
Node
: This is the
computer or machine where your actual businesslogic will be executed.
Storage Node:
This is the
computer or machine where your file system reside tostore the processing
data.In most of the cases compute node and storage node would be the samemachine.
5.
How does master slave architecture in the Hadoop?
Ans:
The MapReduce framework consists of a single master
JobTracker
andmultiple
slaves, each cluster-node will have one
TaskskTracker
.The master is
responsible for scheduling the jobs' component tasks on theslaves, monitoring
them and re-executing the failed tasks. The slaves execute thetasks as directed
by the master.
6.
How does an Hadoop application look like or their basic components?
Ans:
Minimally an Hadoop application would have following components.
Input location
of data
Output location
of processed data.
A map
task.
A reduced
task.
Job
configurationThe Hadoop job client then submits the job (jar/executable etc.)
and configurationto the
JobTracker
which then
assumes the responsibility of distributing thesoftware/configuration to the
slaves, scheduling tasks and monitoring them,providing status and diagnostic
information to the job-client.
7.
Explain how input and output data format of the Hadoop framework?
Ans:
The MapReduce framework operates exclusively on pairs, that is, theframework
views the input to the job as a set of pairs and produces a set of
pairsas
the output of the job,
conceivably of
different types
. See the flow
mentioned below
(input)
->
map
-> ->
combine/sorting
->
->
reduce
->
(output)
8.
What are the restriction to the key and value class ?
Ans: The key
and value classes have to be serialized by the framework. To make
themserializable Hadoop provides a
Writable
interface. As
you know from the java itself thatthe key of the Map should be comparable,
hence the key has to implement one moreinterface
WritableComparable.
9.
Explain the WordCount implementation via Hadoop framework ?
Ans:
We will count
the words in all the input file flow as below
input
Assume
there are two files each having a sentence
Hello World
Hello World (In file 1)
Hello World
Hello World (In file 2)
Mapper : There
would be each mapper for the a file
For the given
sample input the first map output:
< Hello,
1>
< World,
1>
< Hello,
1>
< World,
1>
The second map
output:
< Hello,
1>
< World,
1>
< Hello,
1>
< World,
1>
Combiner/Sorting
(This is done for each individual map)
So output
looks like this
The output of
the first map:
< Hello,
2>
< World,
2>
The output of
the second map:
< Hello,
2>
< World,
2>
Reducer :
It sums up the
above output and generates the output as below
< Hello,
4>
< World,
4>
Output
Final
output would look likeHello 4 timesWorld 4 times
10.
Which interface needs to be implemented to create Mapper and Reducer for
the Hadoop?
Ans:
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer
11.
What Mapper does?
Ans:
Maps are the
individual tasks that transform i
nput records
into intermediate records. The transformed intermediate records do not needto
be of the same type as the input records. A given input pair may map to zero or
manyoutput pairs.
12.
What is the InputSplit in map reduce software?
Ans: An
InputSplit
is a logical
representation of a unit (A chunk) of input work for amap task; e.g., a
filename and a byte range within that file to process or a row set in a
textfile.
13.
What is the
InputFormat ?
Ans:
The
InputFormat
is responsible
for enumerate (itemise) the
InputSplits
, and producing
a
RecordReader
which will turn
those logical work units into actual physicalinput records.
14.
Where do you specify the Mapper Implementation?
Ans:
Generally mapper implementation is specified in the Job itself.
15.
How Mapper is instantiated in a running job?
Ans
: The Mapper
itself is instantiated in the running job, and will be passed aMapContext
object which it can use to configure itself.
16.
Which are the methods in the Mapper interface?
Ans :
The Mapper
contains the run() method, which call its own setup() methodonly once, it also
call a map() method for each input and finally calls it cleanup()method. All
above methods you can override in your code.
17.
What happens if you d
on’t
override the Mapper methods and keep them as
it
is?
Ans:
If you do not override any methods (leaving even map as-is), it will act asthe
identity function, emitting each input record as a separate output.
18.
What is the use of Context object?
Ans:
The Context object allows the mapper to interact with the rest of the
Hadoopsystem. ItIncludes configuration data for the job, as well as interfaces
which allow it to emitoutput.
19.
How can you add the arbitrary key-value pairs in your mapper?
Ans:
You can set arbitrary (key, value) pairs of configuration data in your Job,e.g.
with
Job.getConfiguration().set("myKey",
"myVal"),
and then retrieve
this datain your mapper with
Context.getConfiguration().get("myKey")
. This kind
of functionality is typically done in the Mapper's setup() method.
20.
How
does Mapper’s run() method works?
Ans:
The
Mapper.run() method then calls map
(KeyInType, ValInType,
Context)
for
eachkey/value pair in the
InputSplit
for that
task
21.
Which object can be used to get the progress of a particular job ?
Ans:
Context
22.
What is next step after Mapper or MapTask?
Ans : The
output of the Mapper are sorted and Partitions will be created for theoutput.
Number of partition depends on the number of reducer.
23.
How can we control particular key should go in a specific reducer?
Ans:
Users can control which keys (and hence records) go to which Reducer
byimplementing a custom Partitioner.
24.
What is the use of Combiner?
Ans:
It is an optional component or class, and can be specify via
Job.setCombinerClass(ClassName)
, to perform
local aggregation of theintermediate outputs, which helps to cut down the
amount of data transferredfrom the Mapper to theReducer.
25.
How many maps are there in a particular Job?
Ans:
The number of
maps is usually driven by the total size of the inputs, that is,the total
number of blocks of the input files.Generally it is around 10-100 maps
per-node. Task setup takes awhile, so it isbest if the maps take at least a
minute to execute.Suppose, if you expect 10TB of input data and have a
blocksize of 128MB, you'llend up with82,000 maps, to control the number of
block you can use the
mapreduce.job.maps
parameter
(which only provides a hint to
the framework).
Ultimately, the
number of tasks is controlled by the number of splits returned bythe
InputFormat.getSplits()
method (which
you can override).
26.
What is the Reducer used for?
Ans:
Reducer
reduces a set
of intermediate values which share a key to a(usually smaller) set of v
alues.
The
number of reduces for the job is set by the user via
Job.setNumReduceTasks(int)
.
27.
Explain the core methods of the Reducer?
Ans:
The API of
Reducer
is very similar
to that of Mapper, there's a
run()
methodthat
receives a
Context
containing the
job's configuration as well as interfacingmethods that return data from the
reducer itself back to the framework. The
run()
method
calls
setup()
once,
reduce()
once for each
key associated with thereduce task, and
cleanup()
once at the
end. Each of these methods can accessthe job's configuration data by using
Context.getConfiguration()
. As in
Mapper, any or all of these methods can be overridden with customimplementations.
If none of these methods are overridden, the default reducer operation is
the identity function; values are passed through without
further processing.
The heart of
Reducer is its
reduce
() method. This
is called once per key; thesecond argument is an
Iterable
which returns
all the values associated with thatkey.
28.
What are the primary phases of the Reducer?
Ans:
Shuffle, Sort and Reduce
29.
Explain the shuffle?
Ans: Input to
the
Reducer
is the sorted
output of the mappers. In this phase theframework fetches the relevant
partition of the output of all the mappers, via HTTP.
30.
Explain
the Reducer’s Sort phase?
Ans:
The framework
groups
Reducer
inputs by keys
(since different mappers mayhave output the same key) in this stage. The
shuffle and sort phases occur simultaneously;while map-outputs are being
fetched they are merged (It is similar to merge-sort).
31.
Explain
the Reducer’s reduce phase?
Ans: In this
phase the
reduce(MapOutKeyType, Iterable, Context)
method is
called for each pair in the grouped inputs. The output of the reduce task
is typically written to the
FileSystem
via
Context.write(ReduceOutKeyType, ReduceOutValType).
Applicationscan
use the Context to report progress, set application-level status messages and
updateCounters, or just indicate that they are alive. The output of the Reducer
is not sorted.
32.
How many Reducers should be configured?
Ans:
The right
number of reduces seems to be
0.95
or
1.75
multiplied by
(<
no.
of nodes
> *
mapreduce.tasktracker.reduce.tasks.maximum
).
With 0.95 all
of the reduces can launch immediately and start transfering map outputs asthe
maps finish. With 1.75 the faster nodes will finish their first round of
reduces andlaunch a second wave of reduces doing a much better job of load
balancing. Increasingthe number of reduces increases the framework overhead,
but increases load balancingand lowers the cost of failures.
33.
It can be possible that a Job has 0 reducers?
Ans: It is
legal to set the number of reduce-tasks to zero if no reduction is desired.
34.
What happens if number of reducers are 0?
Ans: In this
case the outputs of the map-tasks go directly to the FileSystem, into theoutput
path set by
setOutputPath(Path)
. The framework
does not sort the map-outputs before writing them out to the FileSystem.
35.
How many instances of JobTracker can run on a Hadoop Cluser?
Ans:
Only one
36.
What is the JobTracker and what it performs in a Hadoop Cluster?
Ans:
JobTracker is a daemon service which submits and tracks the MapReducetasks to
the Hadoop cluster. It runs its own JVM process. And usually it run on
aseparate machine, and each slave node is configured with job tracker
nodelocation.
The JobTracker
is single point of failure for the Hadoop MapReduce service. If itgoes down,
all running jobs are halted.
JobTracker
in Hadoop performs following actions
1.1.
Client
applications submit jobs to the Job tracker.
The
JobTracker talks to the NameNode to determine the location of the data
The
JobTracker locates TaskTracker nodes with available slots at or near thedata
The
JobTracker submits the work to the chosen TaskTracker nodes.
The
TaskTracker nodes are monitored. If they do not submit heartbeat signalsoften enough, they are deemed to have failed and
the work is scheduled on adifferent TaskTracker.
A
TaskTracker will notify the JobTracker when a task fails. The
JobTracker decides what to do then: it may resubmit the job elsewhere, it
may mark thatspecific record as something to avoid, and it may may even
blacklist theTaskTracker as unreliable.
When
the work is completed, the JobTracker updates its status.
Client
applications can poll the JobTracker for information.
37.
How a task is scheduled by a JobTracker?
Ans:
The TaskTrackers send out heartbeat messages to the JobTracker, usuallyevery
few minutes, to reassure the JobTracker that it is still alive. Thesemessages
also inform the JobTracker of the number of available slots, so theJobTracker
can stay up to date with where in the cluster work can be delegated.When the
JobTracker tries to find somewhere to schedule a task within theMapReduce
operations, it first looks for an empty slot on the same server thathosts the
DataNode containing the data, and if not, it looks for an empty slot on
amachine in the same rack.
38.
How many instances of Tasktracker run on a Hadoop cluster?
Ans:
There is one Daemon Tasktracker process for each slave node in theHadoop
cluster.
39.
What are the two main parts of the Hadoop framework?
Ans:
Hadoop consists of two main parts
Hadoop
distributed file system
, a distributed
file system with high throughput,
Hadoop
MapReduce
, a software
framework for processing large data sets.
40.
Explain the use of TaskTracker in the Hadoop cluster?
Ans:
A Tasktracker is a slave node in the cluster which that accepts the tasksfrom
JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs inits
own JVM Process.Every TaskTracker is configured with a set of slots; these
indicate the number of tasks that it can accept. The TaskTracker starts a
separate JVM processes to dothe actual work (called as Task Instance) this is
to ensure that process failuredoes not take down the task tracker.The
Tasktracker monitors these task instances, capturing the output and exitcodes.
When the Task instances finish, successfully or not, the task
tracker notifies the JobTracker.The TaskTrackers also send out heartbeat
messages to the JobTracker, usuallyevery few minutes, to reassure the
JobTracker that it is still alive. Thesemessages also inform the JobTracker of
the number of available slots, so theJobTracker can stay up to date with where
in the cluster work can be delegated
.
41.
What do you mean by TaskInstance?
Ans:
Task instances are the actual
MapReduce jobs
which run on
each slavenode. The TaskTracker starts a separate JVM processes to do the
actual work(called as Task Instance) this is to ensure that process failure
does not take downthe entire task tracker.Each Task Instance runs on its own
JVM process. Therecan be multiple processes of task instance running on a slave
node. This isbased on the number of slots configured on task tracker. By
default a new taskinstance JVM process is spawned for a task.
42.
How many daemon processes run on a Hadoop cluster?
Ans:
Hadoop is comprised of five separate daemons. Each of these daemonsruns in its
own JVM.Following 3 Daemons run on Master nodes.
NameNode
- This daemon
storesand maintains the metadata for HDFS.
Secondary
NameNode
- Performs
housekeeping functions for the NameNode.
JobTracker
- Manages
MapReduce jobs, distributes individual tasks to machinesrunning the Task
Tracker. Following 2 Daemons run on each Slave nodes
DataNode
–
Stores actual
HDFS data blocks.
TaskTracker
–
It is
Responsible for instantiating and monitoring individual Mapand Reduce tasks.
43.
How many maximum JVM can run on a slave node?
Ans:
One or Multiple instances of Task Instance can run on each slave node.Each task
instance is run as a separate JVM process. The number of Taskinstances can be
controlled by configuration. Typically a high end machine isconfigured to run
more task instances.
44.
What is NAS?
Ans:
It is one kind of file system where data can reside on one centralizedmachine
and all the cluster member will read write data from that shareddatabase, which
would not be as efficient as HDFS.
45.
How HDFA differs with NFS?
Ans:
Following are differences between HDFS and NAS1.
o
In HDFS Data
Blocks are distributed across local drives of all machines in acluster. Whereas
in NAS data is storedon dedicated hardware.
o
HDFS is
designed to work with MapReduce System, since computation ismoved to data. NAS
is not suitable for MapReduce since data is storedseparately from the
computations.
o
HDFS runs on a
cluster of machines and provides redundancy usingreplication protocol. Whereas
NAS is provided by a single machinetherefore does not provide data redundancy.
46.
How does a NameNode handle the failure of the data nodes?
Ans:HDFS has master/slave architecture. An
HDFS cluster consists of a single
NameNode
, a master
server that manages the file system namespace and regulatesaccess to files by
clients.In addition, there are a number of DataNodes, usually one per node in
the cluster,which manage storage attached to the nodes that they run on.
The NameNode
and DataNode are pieces of software designed to run on
commoditymachines.NameNode periodically receives a Heartbeat and a Block report
from each of theDataNodes in the cluster. Receipt of a Heartbeat implies that
the DataNode isfunctioning properly. A Blockreport contains a list of all
blocks on a DataNode. WhenNameNode notices that it has not received a heartbeat
message from a data nodeafter a certain amount of time, the data node is marked
as dead. Since blocks will beunder replicated the system begins replicating the
blocks that were stored on thedead DataNode. The NameNode Orchestrates the
replication of data blocks fromone DataNode to another. The replication data
transfer happens directly betweenDataNode and the data never passes through the
NameNode.
47.
Can Reducer talk with each other?
Ans:
No, Reducer runs in isolation.
48.
Where
the Mapper’s Intermediate data will be stored?
Ans:
The mapper output (intermediate data) is stored on the Local file system(NOT
HDFS) of each individual mapper nodes. This is typically a temporarydirectory
location which can be setup in config by the Hadoop administrator.
Theintermediate data is cleaned up after the Hadoop Job completes.
49.
What is the use of Combiners in the Hadoop framework?
Ans:
Combiners are used to increase the efficiency of a MapReduce program.They are
used to aggregate intermediate map output locally on individual
mapper outputs. Combiners can help you reduce the amount of data that
needs to betransferred across to the reducers.You can use your reducer code as
a combiner if the operation performed iscommutative and associative.The
execution of combiner is not guaranteed; Hadoop may or may not execute
acombiner. Also, if required it may execute it more than 1 times. Therefore
your MapReduce
jobs
should not depend on the combiners’ execution.
50.
What is the Hadoop MapReduce API contract for a key and value Class?
Ans:
◦The Key must
implement the
org.apache.hadoop.io.WritableComparable
interface.
◦The value must
implement the
org.apache.hadoop.io.Writable
interface.
51.
What is a IdentityMapper and IdentityReducer in MapReduce?
◦
org.apache.hadoop.mapred.lib.IdentityMapper
: Implements
the identity function,mapping inputs directly to outputs. If MapReduce
programmer does not set theMapper Class using
JobConf.setMapperClass
then
IdentityMapper.class
is usedas a
default value.
◦
org.apache.hadoop.mapred.lib.IdentityReducer
: Performs no
reduction, writingall input values directly to the output. If MapReduce
programmer does not set theReducer Class using
JobConf.setReducerClass
then
IdentityReducer.class
isused as a
default value.
52.
What is the meaning of speculative execution in Hadoop? Why is itimportant?
Ans:
Speculative execution is a way of coping with individual Machine performance.In
large clusters where hundreds or thousands of machines are involved there maybe
machines which are not performing as fast as others.This may result in delays
in a full job due to only one machine not performaing well.To avoid this,
speculative execution in hadoop can run multiple copies of same mapor reduce
task on different slave nodes. The results from first node to finish are used.
53.
When the reducers are are started in a MapReduce job?
Ans: In a
MapReduce job reducers do not start executing the reduce method untilthe all
Map jobs have completed. Reducers start copying intermediate key-valuepairs
from the mappers as soon as they are available. The programmer definedreduce
method is called only after all the mappers have finished.If reducers do not
start before all mappers finish then why does the progress onMapReduce job
shows something like Map(50%) Reduce(10%)? Why reducersprogress percentage is
displayed when mapper is not finished yet?Reducers start copying intermediate
key-value pairs from the mappers as soon asthey are available. The progress
calculation also takes in account the processing of data transfer which is
done by reduce process, therefore the reduce progress startsshowing up as soon
as any intermediate key-value pair for a mapper is available tobe transferred
to reducer.Though the reducer progress is updated still the programmer defined
reduce methodis called only after all the mappers have finished.
54.
What is HDFS ? How it is different from traditional file systems?
HDFS, the
Hadoop Distributed File System, is responsible for storing huge data onthe
cluster. This is a distributed file system designed to run on commodity
hardware.It has many similarities with existing distributed file systems.
However, thedifferences from other distributed file systems are significant.
◦HDFS is highly
fault
-tolerant and
is designed to be deployed on low-cost hardware.
◦HDFS provides
high throughput access to application data and is suitable for
applications
that have large data sets.
◦HDFS is
designed to support very large files. Applications that are compatible with
HDFS are those
that deal with large data sets. These applications write their dataonly once
but they read it one or more times and require these reads to be satisfiedat
streaming speeds. HDFS supports write-once-read-many semantics on files.
55.
What is HDFS Block size? How is it different from traditional file systemblock
size?
In HDFS data is
split into blocks and distributed across multiple nodes in the cluster.Each
block is typically 64Mb or 128Mb in size.Each block is replicated multiple
times. Default is to replicate each block three times.Replicas are stored on
different nodes. HDFS utilizes the local file system to storeeach HDFS block as
a separate file. HDFS Block size can not be compared with thetraditional file
system block size.
57.What
is a NameNode? How many instances of NameNode run on a HadoopCluster?
The NameNode is
the centerpiece of an HDFS file system. It keeps the directory treeof all files
in the file system, and tracks where across the cluster the file data is
kept.It does not store the data of these files itself.There is only One
NameNode process run on any hadoop cluster. NameNode runson its own JVM
process. In a typical production cluster its run on a separatemachine.The
NameNode is a Single Point of Failure for the HDFS Cluster. When theNameNode
goes down, the file system goes offline.
Client
applications talk to the NameNode whenever they wish to locate a file,
or when they want to add/copy/move/delete a file. The NameNode responds
thesuccessful requests by returning a list of relevant DataNode servers where
the datalives.
58.
What is a DataNode? How many instances of DataNode run on a HadoopCluster?
A
DataNode stores data in the Hadoop File System HDFS. There is only OneDataNode
process run on any hadoop slave node. DataNode runs on its own JVMprocess. On
startup, a DataNode connects to the NameNode. DataNode instancescan talk to
each other, this is mostly during replicating data.
59.
How the Client communicates with HDFS?
The Client
communication to HDFS happens using Hadoop HDFS API. Clientapplications talk to
the NameNode whenever they wish to locate a file, or when theywant to
add/copy/move/delete a file on HDFS. The NameNode responds thesuccessful
requests by returning a list of relevant DataNode servers where the datalives.
Client applications can talk directly to a DataNode, once the NameNode
hasprovided the location of the data.
60.
How the HDFS Blocks are replicated?
HDFS is
designed to reliably store very large files across machines in a large
cluster.It stores each file as a sequence of blocks; all blocks in a file
except the last blockare the same size.The blocks of a file are replicated for
fault tolerance. The block size and replicationfactor are configurable per file. An application can
specify the number of replicas of afile.
The replication factor can be specified at file creation time and can be changedlater. Files in HDFS are write-once and have strictly
one writer at any time.The NameNode
makes all decisions regarding replication of blocks. HDFS usesrack-aware replica placement policy. In default
configuration there are total 3 copiesof
a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rdcopy on a different rack.
No comments:
Post a Comment