Crack the Interviews: December 2014

Friday, 19 December 2014

HDFS Admin Commands

My title page contents HDFS Admin Commands

While the dfs module for bin/hadoop provides common file and directory manipulation commands, they all work with objects within the file system. The dfsadmin module manipulates or queries the file system as a whole. The operation of the commands in this module is described in this section.

Getting overall status: A brief status report for HDFS can be retrieved with bin/hadoop dfsadmin -report. This returns basic information about the overall health of the HDFS cluster, as well as some per-server metrics.

More involved status: If you need to know more details about what the state of the NameNode's metadata is, the command bin/hadoop dfsadmin -metasave filename will record this information in filename. The metasave command will enumerate lists of blocks which are under-replicated, in the process of being replicated, and scheduled for deletion. NB: The help for this command states that it "saves NameNode's primary data structures," but this is a misnomer; the NameNode's state cannot be restored from this information. However, it will provide good information about how the NameNode is managing HDFS's blocks.

Safemode: Safemode is an HDFS state in which the file system is mounted read-only; no replication is performed, nor can files be created or deleted. This is automatically entered as the NameNode starts, to allow all DataNodes time to check in with the NameNode and announce which blocks they hold, before the NameNode determines which blocks are under-replicated, etc. The NameNode waits until a specific percentage of the blocks are present and accounted-for; this is controlled in the configuration by thedfs.safemode.threshold.pct parameter. After this threshold is met, safemode is automatically exited, and HDFS allows normal operations. The bin/hadoop dfsadmin -safemode whatcommand allows the user to manipulate safemode based on the value of what, described below:

enter - Enters safemode
leave - Forces the NameNode to exit safemode
get - Returns a string indicating whether safemode is ON or OFF
wait - Waits until safemode has exited and returns

Changing HDFS membership - When decommissioning nodes, it is important to disconnect nodes from HDFS gradually to ensure that data is not lost. See the section on decommissioning later in this document for an explanation of the use of the -refreshNodes dfsadmin command.

Upgrading HDFS versions - When upgrading from one version of Hadoop to the next, the file formats used by the NameNode and DataNodes may change. When you first start the new version of Hadoop on the cluster, you need to tell Hadoop to change the HDFS version (or else it will not mount), using the command: bin/start-dfs.sh -upgrade. It will then begin upgrading the HDFS version. The status of an ongoing upgrade operation can be queried with the bin/hadoop dfsadmin -upgradeProgress status command. More verbose information can be retrieved with bin/hadoop dfsadmin -upgradeProgress details. If the upgrade is blocked and you would like to force it to continue, use the command: bin/hadoop dfsadmin -upgradeProgress force. (Note: be sure you know what you are doing if you use this last command.)

When HDFS is upgraded, Hadoop retains backup information allowing you to downgrade to the original HDFS version in case you need to revert Hadoop versions. To back out the changes, stop the cluster, re-install the older version of Hadoop, and then use the command: bin/start-dfs.sh -rollback. It will restore the previous HDFS state.

Only one such archival copy can be kept at a time. Thus, after a few days of operation with the new version (when it is deemed stable), the archival copy can be removed with the command bin/hadoop dfsadmin -finalizeUpgrade. The rollback command cannot be issued after this point. This must be performed before a second Hadoop upgrade is allowed.

Getting help - As with the dfs module, typing bin/hadoop dfsadmin -help cmd will provide more usage information about the particular command.

For Detail commands click on the below link

Hadoop commands

Hadoop commands 2

Latest commands

ZOOKEEPER

Apache ZooKeeper provides operational services for a Hadoop cluster. ZooKeeper provides a distributed configuration service, a synchronization service and a naming registry for distributed systems. Distributed applications use Zookeeper to store and mediate updates to important configuration information.

What ZooKeeper Does

ZooKeeper provides a very simple interface and services. ZooKeeper brings these key benefits:

Fast. ZooKeeper is especially fast with workloads where reads to the data are more common than writes. The ideal read/write ratio is about 10:1.
Reliable. ZooKeeper is replicated over a set of hosts (called an ensemble) and the servers are aware of each other. As long as a critical mass of servers is available, the ZooKeeper service will also be available. There is no single point of failure.
Simple. ZooKeeper maintain a standard hierarchical name space, similar to files and directories.
Ordered. The service maintains a record of all transactions, which can be used for higher-level abstractions, like synchronization primitives.

How ZooKeeper Works

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers, known as znodes. Every znode is identified by a path, with path elements separated by a slash (“/”). Aside from the root, every znode has a parent, and a znode cannot be deleted if it has children.
This is much like a normal file system, but ZooKeeper provides superior reliability through redundant services. A service is replicated over a set of machines and each maintains an in-memory image of the the data tree and transaction logs. Clients connect to a single ZooKeeper server and maintains a TCP connection through which they send requests and receive responses.
This architecture allows ZooKeeper to provide high throughput and availability with low latency, but the size of the database that ZooKeeper can manage is limited by memory.

For More details on Zookeeper Admin,see the below link

Zookeeper Admin Guide

SQOOP & FLUME

SQOOP

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

What Sqoop Does

Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases, Apache Sqoop:

Allows data imports from external datastores and enterprise data warehouses into Hadoop
Parallelizes data transfer for fast performance and optimal system utilization
Copies data quickly from external systems to Hadoop
Makes data analysis more efficient
Mitigates excessive loads to external systems.

How Sqoop Works

Sqoop provides a pluggable connector mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems.

Refer to the below link for complete details on the sqoop and its commands

Sqoop User Guide

hortonworks sqoop example

FLUME

Apache™ Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

What Flume Does

Flume lets Hadoop users make the most of valuable log data. Specifically, Flume allows users to:

Stream data from multiple sources into Hadoop for analysis
Collect high-volume Web logs in real time
Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination
Guarantee data delivery
Scale horizontally to handle additional data volume

How Flume Works

Flume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend. The project team has designed Flume with the following components:

Event – a singular unit of data that is transported by Flume (typically a single log entry)
Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.
Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
Client – produces and transmits the Event to the Source operating within the Agent

A flow in Flume starts from the Client. The Client transmits the event to a Source operating within the Agent. The Source receiving this event then delivers it to one or more Channels. These Channels are drained by one or more Sinks operating within the same Agent. Channels allow decoupling of ingestion rate from drain rate using the familiar producer-consumer model of data exchange. When spikes in client side activity cause data to be generated faster than what the provisioned capacity on the destination can handle, the channel size increases. This allows sources to continue normal operation for the duration of the spike. Flume agents can be chained together by connecting the sink of one agent to the source of another agent. This enables the creation of complex dataflow topologies.

Reliability & Scaling

Flume is designed to be highly reliable, thereby no data is lost during normal operation. Flume also supports dynamic reconfiguration without the need for a restart, which allows for reduction in the downtime for flume agents. Flume is architected to be fully distributed with no central coordination point. Each agent runs independent of others with no inherent single point of failure. Flume also features built-in support for load balancing and failover. Flume’s fully decentralized architecture also plays a key role in its ability to scale. Since each agent runs independently, Flume can be scaled horizontally with ease.

For more details on the Flume click on the below link

Flume User Guide

Hortonworks example on Flume

Hortonworks Example 2

HIVE Vs PIG

While I was looking at Hive and Pig for processing large amounts of data without the need to write MapReduce code I found that there is no easy way to compare them against each other without reading into both in greater detail.

In this post I am trying to give you a 10,000ft view of both and compare some of the more prominent and interesting features. The following table - which is discussed below - compares what I deemed to be such features:

Feature	Hive	Pig
Language	SQL-like	PigLatin
Schemas/Types	Yes (explicit)	Yes (implicit)
Partitions	Yes	No
Server	Optional (Thrift)	No
User Defined Functions (UDF)	Yes (Java)	Yes (Java)
Custom Serializer/Deserializer	Yes	Yes
DFS Direct Access	Yes (implicit)	Yes (explicit)
Join/Order/Sort	Yes	Yes
Shell	Yes	Yes
Streaming	Yes	Yes
Web Interface	Yes	No
JDBC/ODBC	Yes (limited)	No

Let us look now into each of these with a bit more detail.

General Purpose

The question is "What does Hive or Pig solve?". Both - and I think this lucky for us in regards to comparing them - have a very similar goal. They try to ease the complexity of writing MapReduce jobs in a programming language like Java by giving the user a set of tools that they may be more familiar with (more on this below). The raw data is stored in Hadoop's HDFS and can be any format although natively it usually is a TAB separated text file, while internally they also may make use of Hadoop's SequenceFile file format. The idea is to be able to parse the raw data file, for example a web server log file, and use the contained information to slice and dice them into what is needed for business needs. Therefore they provide means to aggregate fields based on specific keys. In the end they both emit the result again in either text or a custom file format. Efforts are also underway to have both use other systems as a source for data, for example HBase.

The features I am comparing are chosen pretty much at random because they stood out when I read into each of these two frameworks. So keep in mind that this is a subjective list.

Language

Hive lends itself to SQL. But since we can only read already existing files in HDFS it is lacking UPDATE or DELETE support for example. It focuses primarily on the query part of SQL. But even there it has its own spin on things to reflect better the underlaying MapReduce process. Overall is seems that someone familiar with SQL can very quickly learn Hive's version of it and get results fast.

Pig on the other hand looks more like a very simplistic scripting language. As with those (and this is a nearly religious topic) some are more intuitive and some are less. As with PigLatin I was able to see what the samples do, but lacking the full knowledge of its syntax I was somewhat finding myself thinking if I really would be able to get what I needed without too many trial-and-error loops. Sure, the Hive SQL needs probably as many iterations to fully grasp - but there is at least a greater understanding of what to expect.

Schemas/Types

Hive uses once more a specific variation of SQL's Data Definition Language (DDL). It defines the "tables" beforehand and stores the schema in a either shared or local database. Any JDBC offering will do, but it also comes with a built in Derby instance to get you started quickly. If the database is local then only you can run specific Hive commands. If you share the database then others can also run these - or would have to set up their own local database copy. Types are also defined upfront and supported types are INT, BIGINT, BOOLEAN, STRING and so on. There are also array types that lets you handle specific fields in the raw data files as a group.

Pig has no such metadata database. Datatypes and schemas are defined within each script. Types furthermore are usually automatically determined by their use. So if you use a field as an Integer it is handled that way by Pig. You do have the option though to override it and have explicit type definitions, again within the script you need them. Pig has a similar set of types compared to Hive. For example it also has an array type called "bag".

Partitions

Hive has a notion of partitions. They are basically subdirectories in HDFS. It allows for example processing a subset of the data by alphabet or date. It is up to the user to create these "partitions" as they are not enforced nor required.

Pig does not seem to have such a feature. It may be that filters can achieve the same but it is not immediately obvious to me.

Server

Hive can start an optional server, which is allegedly Thrift based. With the server I presume you can send queries from anywhere to the Hive server which in turn executes them.

Pig does not seem to have such a facility yet.

User Defined Functions

Hive and Pig allow for user functionality by supplying Java code to the query process. These functions can add any additional feature that is required to crunch the numbers as required.

Custom Serializer/Deserializer

Again, both Hive and Pig allow for custom Java classes that can read or write any file format required. I also assume that is how it connects to HBase eventually (just a guess). You can write a parser for Apache log files or, for example, the binary Tokyo Tyrant Ulog format. The same goes for the output, write a database output class and you can write the results back into a database.

DFS Direct Access

Hive is smart about how to access the raw data. A "select * from table limit 10" for example does a direct read from the file. If the query is too complicated it will fall back to use a full MapReduce run to determine the outcome, just as expected.

With Pig I am not sure if it does the same to speed up simple PigLatin scripts. At least it does not seem to be mentioned anywhere as an important feature.

Join/Order/Sort

Hive and Pig have support for joining, ordering or sorting data dynamically. They perform the same purpose in both pretty allowing you to aggregate and sort the result as is needed. Pig also has a COGROUP feature that allows you to do OUTER JOIN's and so on. I think this is where you spent most of your time with either package - especially when you start out. But from a cursory look it seems both can do pretty much the same.

Shell

Both Hive and Pig have a shell that allows you to query specific things or run the actual queries. Pig also passes on DFS commands such as "cat" to allow you to quickly check what an outcome of a specific PigLatin script was.

Streaming

Once more, both frameworks seem to provide streaming interfaces so that you can process data with external tools or languages, such as Ruby or Python. How the streaming performs I do not know and if they affect them differently. This is for you to tell me :)

Web Interface

Only Hive has a web interface or UI that can be used to visualize the various schemas and issue queries. This is different to the above mentioned Server as it is an interactive web UI for a human operator. The Hive Server is for use from another programming or scripting language for example.

JDBC/ODBC

Another Hive only feature is the availability of a - again limited functionality - JDBC/ODBC driver. It is another way for programmers to use Hive without having to bother with its shell or web interface, or even the Hive Server. Since only a subset of features is available it will require small adjustments on the programmers side of things but otherwise seems like a nice-to-have feature.

Conclusion

Well, it seems to me that both can help you achieve the same goals, while Hive comes more natural to database developers and Pig to "script kiddies" (just kidding). Hive has more features as far as access choices are concerned. They also have reportedly roughly the same amount of committers in each project and are going strong development wise.

This is it from me. Do you have a different opinion or comment on the above then please feel free to reply below. Over and out!

HIVE Vs HBASE

What They Do

Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows for querying data stored on HDFS for analysis via HQL, an SQL-like language that gets translated to MapReduce jobs. Despite providing SQL functionality, Hive does not provide interactive querying yet - it only runs batch processes on Hadoop.
Apache HBase is a NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. Hive is partitioned to tables, and tables are further split into column families. Column families, which must be declared in the schema, group together a certain set of columns (columns don’t require schema definition). For example, the "message" column family may include the columns: "to", "from", "date", "subject", and "body". Each key/value pair in HBase is defined as a cell, and each key consists of row-key, column family, column, and time-stamp. A row in HBase is a grouping of key/value mappings identified by the row-key. HBase enjoys Hadoop’s infrastructure and scales horizontally using off the shelf servers.

Features

Hive can help the SQL savvy to run MapReduce jobs. Since it’s JDBC compliant, it also integrates with existing SQL based tools. Running Hive queries could take a while since they go over all of the data in the table by default. Nonetheless, the amount of data can be limited via Hive’s partitioning feature. Partitioning allows running a filter query over data that is stored in separate folders, and only read the data which matches the query. It could be used, for example, to only process files created between certain dates, if the files include the date format as part of their name.
HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.

Limitations

Hive does not currently support update statements. Additionally, since it runs batch processing on Hadoop, it can take minutes or even hours to get back results for queries. Hive must also be provided with a predefined schema to map files and directories into columns and it is not ACID compliant.
HBase queries are written in a custom language that needs to be learned. SQL-like functionality can be achieved via Apache Phoenix, though it comes at the price of maintaining a schema. Furthermore, HBase isn’t fully ACID compliant, although it does support certain properties. Last but not least - in order to run HBase, ZooKeeper is required - a server for distributed coordination such as configuration, maintenance, and naming.

Use Cases

Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. Hive should not be used for real-time querying since it could take a while before any results are returned.
HBase is perfect for real-time querying of Big Data. Facebook use it for messaging and real-time analytics. They may even be using it to count Facebook likes.

Summary

Hive and HBase are two different Hadoop based technologies - Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But hey, why not use them both? Just like Google can be used for search and Facebook for social networking, Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.

You would not compare so does Hive vs Hbase - Commonly happend because of SQL-like layer on Hive - Hbase is a Database but Hive is never a Database.

Hive is a MapReduce based Analysis/ Summarisation tool running on Top of Hadoop. Hive depends on Mapreduce( Batch Processing) + HDFS

Hbase is a Database (NoSQL) - which is used to store and retrieve data.
To Query(Scans) Hbase - mapreduce is not required - So HBase depends only on HDFS - not on Mapreduce So Hbase is Online Processing System.