Crack the Interviews: SQOOP & FLUME

SQOOP

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop imports data from external structured datastores into HDFS or related systems like Hive and HBase. Sqoop can also be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

What Sqoop Does

Designed to efficiently transfer bulk data between Apache Hadoop and structured datastores such as relational databases, Apache Sqoop:

Allows data imports from external datastores and enterprise data warehouses into Hadoop
Parallelizes data transfer for fast performance and optimal system utilization
Copies data quickly from external systems to Hadoop
Makes data analysis more efficient
Mitigates excessive loads to external systems.

How Sqoop Works

Sqoop provides a pluggable connector mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems.

Refer to the below link for complete details on the sqoop and its commands

Sqoop User Guide

hortonworks sqoop example

FLUME

Apache™ Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

What Flume Does

Flume lets Hadoop users make the most of valuable log data. Specifically, Flume allows users to:

Stream data from multiple sources into Hadoop for analysis
Collect high-volume Web logs in real time
Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination
Guarantee data delivery
Scale horizontally to handle additional data volume

How Flume Works

Flume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend. The project team has designed Flume with the following components:

Event – a singular unit of data that is transported by Flume (typically a single log entry)
Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.
Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
Client – produces and transmits the Event to the Source operating within the Agent

A flow in Flume starts from the Client. The Client transmits the event to a Source operating within the Agent. The Source receiving this event then delivers it to one or more Channels. These Channels are drained by one or more Sinks operating within the same Agent. Channels allow decoupling of ingestion rate from drain rate using the familiar producer-consumer model of data exchange. When spikes in client side activity cause data to be generated faster than what the provisioned capacity on the destination can handle, the channel size increases. This allows sources to continue normal operation for the duration of the spike. Flume agents can be chained together by connecting the sink of one agent to the source of another agent. This enables the creation of complex dataflow topologies.

Reliability & Scaling

Flume is designed to be highly reliable, thereby no data is lost during normal operation. Flume also supports dynamic reconfiguration without the need for a restart, which allows for reduction in the downtime for flume agents. Flume is architected to be fully distributed with no central coordination point. Each agent runs independent of others with no inherent single point of failure. Flume also features built-in support for load balancing and failover. Flume’s fully decentralized architecture also plays a key role in its ability to scale. Since each agent runs independently, Flume can be scaled horizontally with ease.

For more details on the Flume click on the below link

Flume User Guide

Hortonworks example on Flume

Hortonworks Example 2

Crack the Interviews

Friday, 19 December 2014

SQOOP & FLUME