HIVE INTERVIEW QUESTIONS
1. What is Hive Metastore?
Ans
: Hive metastore is a database that stores metadata about your Hive tables (eg.
table name, column names and types, table location, storage handler being used,
number of buckets in the table, sorting columns if any, partition columns if
any, etc.). When you create a table, this metastore gets updated with the
information related to the new table which gets queried when you issue queries
on that table.
2. Wherever (Different Directory) I run hive
query, it creates new metastore_db, please explain the reason for it?
Ans:
Whenever you run the hive in embedded mode, it creates the local metastore. And
before creating the metastore it looks whether metastore already exist or not.
This property is defined in configuration file hive-site.xml. Property is
“javax.jdo.option.ConnectionURL” with default value
“jdbc:derby:;databaseName=metastore_db;create=true”. So to change the behavior
change the location to absolute path, so metastore will be used from that
location.
3.
Is it possible to use same metastore by multiple users, in case of embedded
hive? Ans:
No, it is not possible to use metastore in sharing mode. It is recommended to
use standalone “real” database like MySQL or PostGresSQL.
4. Is multiline
comment supported in Hive Script ?
Ans: No.
5.
If you run hive as a server, what are the available mechanism for connecting it
from application? Ans: There are following ways by which you
can connect with the Hive Server: 1. Thrift Client: Using thrift you can call
hive commands from a various programming languages e.g. C++, Java, PHP, Python
and Ruby. 2. JDBC Driver : It supports the Type 4 (pure Java) JDBC Driver 3.
ODBC Driver: It supports ODBC protocol.
6.
What is SerDe in Apache Hive ?
Ans
: A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe (and
FileFormat) to read and write data from tables. An important concept behind
Hive is that it DOES NOT own the Hadoop File System (HDFS) format that data is
stored in. Users are able to write files to HDFS with whatever tools/mechanism
takes their fancy("CREATE EXTERNAL TABLE" or "LOAD DATA
INPATH," ) and use Hive to correctly "parse" that file format in
a way that can be used by Hive. A SerDe is a powerful (and customizable)
mechanism that Hive uses to "parse" data stored in HDFS to be used by
Hive.
7.
Which classes are used by the Hive to Read and Write HDFS Files Ans :
Following classes are used by Hive to read and write HDFS files
•TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data
in plain text file format. •SequenceFileInputFormat/SequenceFileOutputFormat:
These 2 classes read/write data in hadoop SequenceFile format.
8.
Give examples of the SerDe classes whihc hive uses to Serializa and Deserilize
data ? Ans
: Hive currently use these SerDe classes to serialize and deserialize data: •
MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records
like CSV, tab-separated control-A separated records (quote is not supported
yet.) • ThriftSerDe: This SerDe is used to read/write thrift serialized
objects. The class file for the Thrift object must be loaded first. •
DynamicSerDe: This SerDe also read/write thrift serialized objects, but it
understands thrift DDL so the schema of the object can be provided at runtime.
Also it supports a lot of different protocols, including TBinaryProtocol,
TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).
9. How do you write your own custom SerDe ?
Ans : •In most cases, users want to write a Deserializer instead of a
SerDe, because users just want to read their own data format instead of writing
to it. •For example, the RegexDeserializer will deserialize the data using the
configuration parameter 'regex', and possibly a list of column names •If your
SerDe supports DDL (basically, SerDe with parameterized columns and column
types), you probably want to implement a Protocol based on DynamicSerDe,
instead of writing a SerDe from scratch. The reason is that the framework
passes DDL to SerDe through
"thrift DDL" format, and it's
non-trivial to write a "thrift DDL" parser.
10.
What is ObjectInspector functionality ? Ans : Hive uses ObjectInspector to
analyze the internal structure of the row object and also the structure of the
individual columns. ObjectInspector provides a uniform way to access complex
objects that can be stored in multiple formats in the memory, including: •Instance
of a Java class (Thrift or native Java) •A standard Java object (we use
java.util.List to represent Struct and Array, and use java.util.Map to
represent Map) •A lazily-initialized object (For example, a Struct of string
fields stored in a single Java string object with starting offset for each
field) A complex object can be represented by a pair of ObjectInspector and
Java Object. The ObjectInspector not only tells us the structure of the Object,
but also gives us ways to access the internal fields inside the Object.
11. What is the functionality of Query
Processor in Apached Hive ? Ans: This component
implements the processing framework for converting SQL to a graph of map/reduce
jobs and the execution time framework to run those jobs in the order of
dependencies.
What
is Hive?Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive was originally developed at Facebook. It’s now a Hadoop subproject with many contributors. Users need to concentrate only on the top level hive language rather than java map reduce programs. One of the main advantages of Hive is its SQLish nature. Thus it leverages the usability to a higher extend.
A hive program will be automatically compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries.
Hive example:
selecting the employee names whose salary more than 100 dollars from a hive table called tbl_employee.
SELECT employee_name FROM tbl_employee WHERE salary > 100;
Users are excited to use Hive since it is very similar to SQL.
What are the types of tables in Hive?
There are two types of tables.
1. Managed tables.
2. External tables.
Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.
Does Hive support record level Insert, delete or update?
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.
What kind of datawarehouse application is suitable for Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.
Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
How can the columns of a table in hive be written to a file?
By using awk command in shell, the output from HiveQL (Describe) can be written to a file.
hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.
CONCAT function in Hive with Example?
CONCAT function will concat the input strings. You can specify any number of strings separated by comma.
Example:
CONCAT (‘Hive’,’-’,’performs’,’-’,’good’,’-’,’in’,’-’,’Hadoop’);
Output:
Hive-performs-good-in-Hadoop
So, every time you delimit the strings by ‘-’. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.
CONCAT_WS (‘-’,’Hive’,’performs’,’good’,’in’,’Hadoop’);
Output: Hive-performs-good-in-Hadoop
REPEAT function in Hive with example?
REPEAT function will repeat the input string n times specified in the command.
Example:
REPEAT(‘Hadoop’,3);
Output:
HadoopHadoopHadoop.
Note: You can add a space with the input string also.
TRIM function in Hive with example?
TRIM function will remove the spaces associated with a string.
Example:
TRIM(‘ Hadoop ‘);
Output:
Hadoop.
Note: If you want to remove only leading or trialing spaces then you can specify the below commands respectively.
LTRIM(‘ Hadoop’);
RTRIM(‘Hadoop ‘);
REVERSE function in Hive with example?
REVERSE function will reverse the characters in a string.
Example:
REVERSE(‘Hadoop’);
Output:
poodaH
LOWER or LCASE function in Hive with example?
LOWER or LCASE function will convert the input string to lower case characters.
Example:
LOWER(‘Hadoop’);
LCASE(‘Hadoop’);
Output:
hadoop
Note:
If the characters are already in lower case then they will be preserved.
UPPER or UCASE function in Hive with example?
UPPER or UCASE function will convert the input string to upper case characters.
Example:
UPPER(‘Hadoop’);
UCASE(‘Hadoop’);
Output:
HADOOP
Note:
If the characters are already in upper case then they will be preserved.
Double type in Hive – Important points?
It is important to know about the double type in Hive. Double type in Hive will present the data differently unlike RDBMS.
See the double type data below:
24624.0
32556.0
3.99893E5
4366.0
E5 represents 10^5 here. So, the value 3.99893E5 represents 399893. All the calculations will be accurately performed using double type. The maximum value for a IEEE 754 double is about 2.22E308.
It is crucial while exporting the double type data to any RDBMS since the type may be wrongly interpreted. So, it is advised to cast the double type into appropriate type before exporting.
Rename a table in Hive – How to do it?
Using ALTER command, we can rename a table in Hive.
ALTER TABLE hive_table_name RENAME TO new_name;
There is another way to rename a table in Hive. Sometimes, ALTER may take more time if the underlying table has more partitions/functions. In that case, Import and export options can be utilized. Here you are saving the hive data into HDFS and importing back to new table like below.
EXPORT TABLE tbl_name TO ‘HDFS_location’;
IMPORT TABLE new_tbl_name FROM ‘HDFS_location’;
If you prefer to just preserve the data, you can create a new table from old table like below.
CREATE TABLE new_tbl_name AS SELECT * FROM old_tbl_name;
DROP TABLE old_tbl_name;
How to change a column data type in Hive?
ALTER TABLE table_name CHANGE column_name column_name new_datatype;
Example: If you want to change the data type of ID column from integer to bigint in a table called employee.
ALTER TABLE employee CHANGE id id BIGINT;
Difference between order by and sort by in hive?
SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses single reducer.
ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered final results
RLIKE in Hive?
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE.
Examples:
‘Express’ RLIKE ‘Exp’ –> True
‘Express’ RLIKE ‘^E.*’ –> True (Regular expression)
Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value ‘Express ‘ (2 spaces additionally) and B has value ‘Express’ RLIKE will work better without using TRIM.
‘Express’ RLIKE ‘Express’ –> True
Note:
RLIKE evaluates to NULL if A or B is NULL.
Difference between partitioning and Bucketing?
What is Schema on Read and Schema on Write?
Schema-on-Read vs. Schema-on-Write:
In traditional databases, the table’s schema is imposed during the data load time, if the data being loaded does not conform to the schema then the data load is rejected, this process is know as Schema-on-Write. Here the data is being checked against the schema when written into the database(during data load).
Now in HIVE, the data schema is not verified during the load time, rather it is verified while processing the query. Hence this process in HIVE called Schema-on-Read.
Now, which way is better? Schema-on-Read or Schema-on-Write?
Schema-on-Read:
Schema-on-Read helps in very fast initial data load, since the data does not have to follow any internal schema(internal database format) to read or parse or serialize, as it is just a copy/move of a file.
This type of movement of data is more flexible incase of huge data or having two schemas for same underlying data.
Schema-on-Read helps in very fast initial data load, since the data does not have to follow any internal schema(internal database format) to read or parse or serialize, as it is just a copy/move of a file.
This type of movement of data is more flexible incase of huge data or having two schemas for same underlying data.
Schema-on-Write:
Schema-on-Write helps in faster performance of the query, as the data is already loaded in a particular format and it is easy to locate the column index or compress the data. However, it takes longer time to load data into the database.
Schema-on-Write helps in faster performance of the query, as the data is already loaded in a particular format and it is easy to locate the column index or compress the data. However, it takes longer time to load data into the database.
So, in scenarios of large data load or where the schema is not known at load time and there are no indexes to apply, as the query is not formulated, Schema-on-Read(property of HIVE) is more efficient than Schema-on-write.
1. Tez execution engine
2. Partitioning
3. Bucketing
4. File formats like ORC,Parquet
5. Vectorization
6. hive indexing
7. cost based optimization
When you go for internal tables and External Tables in Hive?
Can we do bucketing without partition to a table?
Yes,we can do bucketing with out partition
What are performance optimizations can do in hive?
What is Vectorization in Hive?
Difference between Rank(), Dense_rank(),row_number() in hive?
What is Cluster by and distributed By ?
Distribute by will simply ensure that the same keys got to same Reducer but it wont do the sorting.
Cluster by is a short cut of ‘distribute by sort by’ . So cluster by ensures that the same keys will go to same reducer and also within the reducer the data is sorted.
How metadata is maintained for parquet files?
SOME
BASIC HIVE QUERIES
DROP TABLE IF EXISTS wiki;
CREATE EXTERNAL TABLE wiki(projectname STRING,pagename STRING,pageview INT,pagesize INT)
ROW FORMAT
DELIMITED FIELDS
TERMINATED BY ' '
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION
'/user/training/wikidata';
DROP TABLE IF EXISTS wiki1;
CREATE TABLE wiki1(pagename
STRING,sum INT);
INSERT OVERWRITE TABLE wiki1
SELECT pagename,sum(pageview)
FROM wiki
WHERE projectname='en' group by
pagename;Select * from wiki1 order by sum DESC limit 20
////////////////WORD COUNT PROBLEM
DROP DATABASE IF EXISTS documents CASCADE;CREATE DATABASE documents;USE documents;CREATE TABLE docs(words string);LOAD DATA LOCAL INPATH '/home/training/training_material/data/shakespeare' INTO TABLE docs;CREATE TABLE word_count AS
SELECT word, count(*) AS count
FROM
(SELECT explode(split(words,
'\\W+')) AS word FROM docs) w
GROUP BY word;SELECT * FROM
word_count limit 100;
///////////CREATING EXTERNAL TABLES
DROP TABLE IF EXISTS
WIKI2;CREATE EXTERNAL TABLE location_table(location_code STRING,
location_name STRING)
ROW FORMAT
DELIMITED FIELDS TERMINATED by
','
LINES TERMINATED by '\n'
LOCATION
'/home/training/training_material/data/ReduceAndMapSideJoin/us_states.csv';
////////////////USING PARTITIONING
////////////////USING PARTITIONING
DROP TABLE IF EXISTS wikisample;CREATE
TABLE wikisample (project_name STRING COMMENT 'Represents page name of the
record',
page_name STRING COMMENT
'Represents project name of the record',
page_count INT COMMENT 'Number
of times the page has been visited,
page_size INT COMMENT "
size of the page')
PARTITIONED BY (pagename
STRING);LOAD DATA LOCAL INPATH '/home/training/training_material/data/wikisample'
INTO TABLE wikisample partition (projectname='en');
///////////////DYNAMIC PARTITIONING INSERTS
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
DROP TABLE IF EXISTS wikisample;CREATE TABLE wikisample (project_name STRING COMMENT 'Represents page name of the record',
page_name STRING COMMENT
'Represents project name of the record',
page_count INT COMMENT 'Number
of times the page has been visited',
page_size INT COMMENT 'size of
the page');LOAD DATA LOCAL INPATH '/home/training/training_material/data/wikisample'
INTO TABLE wikisample;DROP TABLE IF EXISTS wiki;CREATE TABLE wiki(
project_name STRING COMMENT
'Represents page name of the record',
page_name STRING COMMENT
'Represents project name of the record',
page_count INT COMMENT 'Number
of times the page has been visited'
)
PARTITIONED BY (projectname
STRING);
INSERT INTO TABLE wiki
PARTITION (projectname)
SELECT
w.project_name,w.page_name,w.page_count
FROM wikisample w;
Hive Vs Pig
Feature
|
Hive
|
Pig
|
Language
|
SQL-like
|
PigLatin
|
Schemas/Types
|
Yes
(explicit)
|
Yes
(implicit)
|
Partitions
|
Yes
|
No
|
Server
|
Optional
(Thrift)
|
No
|
User
Defined Functions (UDF)
|
Yes
(Java)
|
Yes
(Java)
|
Custom
Serializer/Deserializer
|
Yes
|
Yes
|
DFS
Direct Access
|
Yes
(implicit)
|
Yes
(explicit)
|
Join/Order/Sort
|
Yes
|
Yes
|
Shell
|
Yes
|
Yes
|
Streaming
|
Yes
|
Yes
|
Web
Interface
|
Yes
|
No
|
JDBC/ODBC
|
Yes
(limited)
|
No
|
Apache Pig and Hive are two
projects that layer on top of Hadoop, and provide a higher-level language for
using Hadoop's MapReduce library. Apache Pig provides a scripting language for
describing operations like reading, filtering, transforming, joining, and
writing data -- exactly the operations that MapReduce was originally designed
for. Rather than expressing these operations in thousands
of lines of Java code that uses MapReduce directly, Pig lets users express them
in a language not unlike a bash or perl script. Pig is excellent for
prototyping and rapidly developing MapReduce-based jobs, as opposed to coding
MapReduce jobs in Java itself.
If Pig is "Scripting for
Hadoop", then Hive is "SQL queries for Hadoop".
Apache Hive offers an even more specific and higher-level language, for
querying data by running Hadoop jobs, rather than directly scripting
step-by-step the operation of several MapReduce jobs on Hadoop. The language
is, by design, extremely SQL-like. Hive is still intended as a tool for long-running
batch-oriented queries over massive data; it's not "real-time" in any
sense. Hive is an excellent tool for analysts and business development types
who are accustomed to SQL-like queries and Business Intelligence systems; it
will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc
queries or generate report data across data stored in storage systems mentioned
above.
Thanks reading...Please check other spark questions...
ReplyDelete