Hive Tutorial
Concepts
What Is Hive
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.
What Hive Is NOT
Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).
In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of the QL language with the help of some examples.
Data Units
In the order of granularity - Hive data is organized into:- Databases: Namespaces that separate tables and other data units from naming confliction.
- Tables:
Homogeneous units of data which have the same schema. An example of a
table could be page_views table, where each row could comprise of the
following columns (schema):
- timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed.
- userid - which is of BIGINT type that identifies the user who viewed the page.
- page_url - which is of STRING type that captures the location of the page.
- referer_url - which is of STRING that captures the location of the page from where the user arrived at the current page.
- IP - which is of STRING type that captures the IP address from where the page request was made.
- Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience but it is the user's job to guarantee the relationship between partition name and data content!). Partition columns are virtual columns, they are not part of the data itself but are derived on load.
- Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.
Type System
Hive supports primitive and complex data types, as described below. See Hive Data Types for additional information.Primitive Types
- Types are associated with the columns in the tables. The following Primitive types are supported:
- Integers
- TINYINT - 1 byte integer
- SMALLINT - 2 byte integer
- INT - 4 byte integer
- BIGINT - 8 byte integer
- Boolean type
- BOOLEAN - TRUE/FALSE
- Floating point numbers
- FLOAT - single precision
- DOUBLE - Double precision
- String type
- STRING - sequence of characters in a specified character set
- Type
Primitive Type Number DOUBLE FLOAT BIGINT INT SMALLINT TINYINT
STRING
BOOLEAN
Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.
Complex Types
Complex Types can be built up from primitive types and other composite types using:- Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a
- Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']
- Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.
- gender - which is a STRING.
- active - which is a BOOLEAN.
Built In Operators and Functions
The operators and functions listed below are not necessarily up-to-date. (Hive Operators and UDFs has more current information.) In the CLI, use these commands to show the latest documentation:SHOW FUNCTIONS; DESCRIBE FUNCTION <function_name>; DESCRIBE FUNCTION EXTENDED <function_name>; |
Built In Operators
- Relational Operators - The following operators compare the passed operands and generate a TRUE or FALSE value depending on whether the comparison between the operands holds or not.
Relational Operator
|
Operand types
|
Description
|
---|---|---|
A = B | all primitive types | TRUE if expression A is equivalent to expression B otherwise FALSE |
A != B | all primitive types | TRUE if expression A is not equivalent to expression B otherwise FALSE |
A < B | all primitive types | TRUE if expression A is less than expression B otherwise FALSE |
A <= B | all primitive types | TRUE if expression A is less than or equal to expression B otherwise FALSE |
A > B | all primitive types | TRUE if expression A is greater than expression B otherwise FALSE |
A >= B | all primitive types | TRUE if expression A is greater than or equal to expression B otherwise FALSE |
A IS NULL | all types | TRUE if expression A evaluates to NULL otherwise FALSE |
A IS NOT NULL | all types | FALSE if expression A evaluates to NULL otherwise TRUE |
A LIKE B | strings | TRUE
if string A matches the SQL simple regular expression B, otherwise
FALSE. The comparison is done character by character. The _ character in
B matches any character in A (similar to . in posix regular
expressions), and the % character in B matches an arbitrary number of
characters in A (similar to .* in posix regular expressions). For
example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%' .
To escape % use \ (% matches one % character). If the data contains a
semi-colon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b' |
A RLIKE B | strings | NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'. |
A REGEXP B | strings | Same as RLIKE |
- Arithmetic Operators - The following operators support various common arithmetic operations on the operands. All of them return number types.
Arithmetic Operators
|
Operand types
|
Description
|
---|---|---|
A + B | all number types | Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a float, therefore float is a containing type of integer so the + operator on a float and an int will result in a float. |
A - B | all number types | Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A * B | all number types | Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. |
A / B | all number types | Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division. |
A % B | all number types | Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A & B | all number types | Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A | B | all number types | Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A ^ B | all number types | Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
~A | all number types | Gives the result of bitwise NOT of A. The type of the result is the same as the type of A. |
- Logical Operators - The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.
Logical Operators
|
Operands types
|
Description
|
---|---|---|
A AND B | boolean | TRUE if both A and B are TRUE, otherwise FALSE |
A && B | boolean | Same as A AND B |
A OR B | boolean | TRUE if either A or B or both are TRUE, otherwise FALSE |
A || B | boolean | Same as A OR B |
NOT A | boolean | TRUE if A is FALSE, otherwise FALSE |
!A | boolean | Same as NOT A |
- Operators on Complex Types - The following operators provide mechanisms to access elements in Complex Types
Operator
|
Operand types
|
Description
|
---|---|---|
A[n] | A is an Array and n is an int | returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' |
M[key] | M is a Map<K, V> and key has type K | returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar' |
S.x | S is a struct | returns the x field of S e.g for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct. |
Built In Functions
- Hive supports the following built in functions:
(Function list in source code: FunctionRegistry.java)
Return Type
|
Function Name (Signature)
|
Description
|
---|---|---|
BIGINT | round(double a) | returns the rounded BIGINT value of the double |
BIGINT | floor(double a) | returns the maximum BIGINT value that is equal or less than the double |
BIGINT | ceil(double a) | returns the minimum BIGINT value that is equal or greater than the double |
double | rand(), rand(int seed) | returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. |
string | concat(string A, string B,...) | returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. |
string | substr(string A, int start) | returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' |
string | substr(string A, int start, int length) | returns the substring of A starting from start position with the given length e.g. substr('foobar', 4, 2) results in 'ba' |
string | upper(string A) | returns the string resulting from converting all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBAR' |
string | ucase(string A) | Same as upper |
string | lower(string A) | returns the string resulting from converting all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar' |
string | lcase(string A) | Same as lower |
string | trim(string A) | returns the string resulting from trimming spaces from both ends of A e.g. trim(' foobar ') results in 'foobar' |
string | ltrim(string A) | returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' |
string | rtrim(string A) | returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' |
string | regexp_replace(string A, string B, string C) | returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' |
int | size(Map<K.V>) | returns the number of elements in the map type |
int | size(Array<T>) | returns the number of elements in the array type |
value of <type> | cast(<expr> as <type>) | converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed. |
string | from_unixtime(int unixtime) | convert the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" |
string | to_date(string timestamp) | Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" |
int | year(string date) | Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 |
int | month(string date) | Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 |
int | day(string date) | Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 |
string | get_json_object(string json_string, string path) | Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid |
- The following built in aggregate functions are supported in Hive:
Return Type
|
Aggregation Function Name (Signature)
|
Description
|
---|---|---|
BIGINT | count(*), count(expr), count(DISTINCT expr[, expr_.]) | count(*) - Returns the total number of retrieved rows, including rows containing NULL values; count(expr) - Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. |
DOUBLE | sum(col), sum(DISTINCT col) | returns the sum of the elements in the group or the sum of the distinct values of the column in the group |
DOUBLE | avg(col), avg(DISTINCT col) | returns the average of the elements in the group or the average of the distinct values of the column in the group |
DOUBLE | min(col) | returns the minimum value of the column in the group |
DOUBLE | max(col) | returns the maximum value of the column in the group |
Language Capabilities
Hive query language provides the basic SQL like operations. These operations work on tables or partitions. These operations are:- Ability to filter rows from a table using a where clause.
- Ability to select certain columns from the table using a select clause.
- Ability to do equi-joins between two tables.
- Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
- Ability to store the results of a query into another table.
- Ability to download the contents of a table to a local (e.g., nfs) directory.
- Ability to store the results of a query in a hadoop dfs directory.
- Ability to manage tables and partitions (create, drop and alter).
- Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.
Usage and Examples
The following examples highlight some salient features of the system. A detailed set of query test cases can be found at Hive Query Test Cases and the corresponding results can be found at Query Test Case Results.Creating, Showing, Altering, and Dropping Tables
See Hive Data Definition Language for detailed information about creating, showing, altering, and dropping tables.Creating Tables
An example statement that would create the page_view table mentioned above would be like:CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User' ) COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; |
The field delimiter can be parametrized if the data is not in the above format as illustrated in the following example:
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User' ) COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE; |
It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column:
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User' ) COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; |
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties MAP<STRING, STRING> ip STRING COMMENT 'IP Address of the User' ) COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; |
Table names and column names are case insensitive.
Browsing Tables and Partitions
SHOW TABLES; |
SHOW TABLES 'page.*' ; |
SHOW PARTITIONS page_view; |
DESCRIBE page_view; |
DESCRIBE EXTENDED page_view; |
DESCRIBE EXTENDED page_view PARTITION (ds= '2008-08-08' ); |
Altering Tables
To rename existing table to a new name. If a table with new name already exists then an error is returned:ALTER TABLE old_table_name RENAME TO new_table_name; |
ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...); |
ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column' , c2 STRING DEFAULT 'def val' ); |
In the later versions we can make the behavior of assuming certain values as opposed to throwing an error in case the column is not found in a particular partition configurable.
Dropping Tables and Partitions
Dropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this is a future feature) that would have been built on the table. The associated command isDROP TABLE pv_users; |
ALTER TABLE pv_users DROP PARTITION (ds= '2008-08-08' ) |
- Note that any data for this table or partitions will be dropped and may not be recoverable. *
Loading Data
There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. In this particular usage, the user can copy a file into the specified location using the HDFS put or copy commands and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can achieve this:CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User' , country STRING COMMENT 'country of origination' ) COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view' ; hadoop dfs -put /tmp/pv_2008- 06 - 08 .txt /user/data/staging/page_view FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt= '2008-06-08' , country= 'US' ) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null , null , pvs.ip WHERE pvs.country = 'US' ; |
This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive.
Additionally, the system also supports syntax that can load the data from a file in the local files system directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-06-08_us.txt already contains the data for US, then we do not need any additional filtering as shown in the previous example. The load in this case can be done using the following syntax:
LOAD DATA LOCAL INPATH /tmp/pv_2008- 06 -08_us.txt INTO TABLE page_view PARTITION(date= '2008-06-08' , country= 'US' ) |
In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can be used to load the data into a Hive table:
LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date= '2008-06-08' , country= 'US' ) |
See Hive Data Manipulation Language for more information about loading data into Hive tables.
No comments:
Post a Comment