Introduction to Hive | Apache Hive

Hive was initially developed by Facebook in 2007 to help the company handle massive amounts of new data. At the time Hive was created, Facebook had a 15TB dataset they needed to work with. A few short years later, that data had grown to 700TB. Their RDBMS data warehouse was taking too long to process daily jobs so the company decided to move their data into the scalable open-source Hadoop.

The company found that creating MapReduce programs was not easy and was time consuming for many users.

When they created Hive, their vision was to bring familiar database concepts to Hadoop, making it easier for all users to work with.

In 2008 Hive was open sourced. Facebook has since used Hive for reporting dashboards and ad-hoc analysis.

The motivation behind Hive is to enable users with different backgrounds and roles who are not familiar with coding, to use and leverage the power of Hadoop by enabling users to analyze data using SQL.

What is Hive?

Hive is a data warehouse system built on top of Hadoop. Hive facilitates easy data summarization, ad-hoc queries, and the analysis of very large datasets that are stored in Hadoop.

Hive provides a SQL interface, better known as HiveQL or HQL for short, which allows for easy querying of data in Hadoop. HQL has its own Data Definition and Data Manipulation languages which are very similar to the DML and DDL many of us already have experience with.

In Hive, the HQL queries are implicitly translated into one or more MapReduce jobs, shielding the user from much more advanced and time consuming programming.

Hive provides a mechanism to project structure (like tables and partitions) onto the data in Hadoop and uses a metastore to map file structure to tabular form.

Hive Architecture

The major components of the Apache Hive

  • Hive Client
  • Hive Meta store
  • Hive Driver

Hive Client

There are a variety of different ways that you can interface with Hive. You can use a web browser to access Hive via the Hive Web Interface.

You could also access Hive using an application over JDBC, ODBC, or the Thrift API, each made possible by Hive’s Thrift Server referred to as HiveServer.

HiveServer2 was released in Hive 0.11 and serves as a replacement for HiveServer1, though you still have the choice of which HiveServer to run, or can even run them concurrently. HiveServer2 brings many enhancements including the ability to handle concurrent clients and more.

Hive also comes with some powerful Command Line interfaces (often referred to as the “CLI”). The introduction of HiveServer2 brings with it a new Hive CLI called Beeline, which can be run in embedded mode or thin client mode. In thin-client mode, the Beeline CLI connects to Hive via JDBC and HiveServer2. The original CLI is also included with Hive and can run in embedded mode or as a client to the HiveServer1.

Hive MetaStore

Hive comes with a catalog known as the Metastore. The Metastore stores the system catalog and metadata about tables, columns, partitions, and so on. The metastore makes mapping file structure to a tabular form possible in Hive.

Hive metastore is simply a relational database which stores the metadata information for Hive tables and other Hive metadata, be default Hive comes with built in Derby database, but you can configure Hive server to use metastore on other types of relational databases such as MySQL or Oracle

An important component of Hive is called HCatalog. HCatalog is built on top of the Hive metastore and incorporates the Hive’s DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command-line interface for issuing data definition and metadata exploration commands.

Essentially, HCatalog makes it easier for users of Pig, MapReduce, and Hive, to read and write data on the grid. Simply HCatalog enables users to deal with data available in HDFS without thinking of how the data is stored?, or on which format data is stored?, you interact with the data available as a pure relational data, the same as you do in MySQL or Oracle or any other relational databases.

Hive Driver

The Hive Driver, Compiler, Optimizer, and Executor work together to turn a query into a set of Hadoop jobs. The Driver piece manages the lifecycle of a HiveQL statement as it moves through Hive.

It maintains a session handle and any session statistics. The Query Compiler compiles HiveQL queries into a DAG of MapReduce tasks.

The Execution Engine executes the tasks produced by the compiler in proper dependency order. The Execution Engine interacts with the underlying Hadoop instance, working with the Name Node, Job Tracker, and so on.

Hive VS RDBS

HiveRDBMS
Hive is based on the notion of Write once, Read many times.RDBMS is designed for Read and Write many times.
Hive data size is PetabytesIn RDBMS, maximum data size is Terabytes
Hive doesn’t support OLTP (Online Transaction Processing) but it support OLAP (Online Analytical Processing)RDBMS supports only OLTP.
Hive is suited for static data analysis(non real time data) example text file.RDBMS is best suited for dynamic data analysis(real-time data) example data from the sensors and web feeds.
Record level updates is not possible in HiveRecord level updates, insertions,
Deletes and transactions are possible.
Hive is very easily scalable at low costRDBMS is not scalable to low cost,because it provide solution to the customers.
Hive resembles a traditional DB by supporting SQL but it is not a database.It is a database.
No support for indexes because data is always scanned.Supports indexes, it is very important for Performance.
Focus on only analytics.Focus on analytics or online(device connected to network).
Distributed processing done via map/reduceDistributed processing varies by vendor(company or person).
Scales up to hundred of nodes.Scales to beyond 20 nodes.
Comparison between Hive and RDBMS

Mohamed Tarek

Mohamed Tarek is a Data Engineer with a great passion about data and all technologies around data Ecosystem, he specialized in data integration and data streaming technologies, he is a part of a development team who delivers end to end solutions including data integration, data model designs, and data analytics.

guest
2 Comments
Inline Feedbacks
View all comments
Abdelrhman Omar
Abdelrhman Omar
July 7, 2020 9:06 pm

One of the best articles That I have read about hive Good job Mr Mohamed and well done 👏