What is Apache Hive?
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. A data warehouse provides a central store of information that can easily be analyzed to make informed, data driven decisions. Hive allows users to read, write, and manage petabytes of data using SQL.
Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data. What makes Hive unique is the ability to query large datasets, leveraging Apache Tez or MapReduce, with a SQL-like interface.
How does Hive work?
Hive was created to allow non-programmers familiar with SQL to work with petabytes of data, using a SQL-like interface called HiveQL. Traditional relational databases are designed for interactive queries on small to medium datasets and do not process huge datasets well. Hive instead uses batch processing so that it works quickly across a very large distributed database. Hive transforms HiveQL queries into MapReduce or Tez jobs that run on Apache Hadoop’s distributed job scheduling framework, Yet Another Resource Negotiator (YARN). It queries data stored in a distributed storage solution, like the Hadoop Distributed File System (HDFS) or Amazon S3. Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery.
Hive includes HCatalog, which is a table and storage management layer that reads data from the Hive metastore to facilitate seamless integration between Hive, Apache Pig, and MapReduce. By using the metastore, HCatalog allows Pig and MapReduce to use the same data structures as Hive, so that the metadata doesn’t have to be redefined for each engine. Custom applications or third party integrations can use WebHCat, which is a RESTful API for HCatalog to access and reuse Hive metadata.
What are the benefits of Hive?
Fast
Hive is designed to quickly handle petabytes of data using batch processing.
Familiar
Hive provides a familiar, SQL-like interface that is accessible to non-programmers.
Scalable
Hive is easy to distribute and scale based on your needs.
What are the differences between Apache Hive vs Apache HBase?
Apache HBase is a NoSQL distributed database that enables random, strictly consistent, real-time access to petabytes of data. Apache Hive is a distributed data warehouse system that provides SQL-like querying capabilities.
Characteristics |
Apache Hive |
Apache HBase |
Function |
SQL-like query engine designed for high volume data stores. Multiple file-formats are supported. |
Low-latency distributed key-value store with custom query capabilities. Data is stored in a column-oriented format. |
Processing Type |
Batch processing using Apache Tez or MapReduce compute frameworks. |
Real-time processing. |
Latency |
Medium to high, depending on the responsiveness of the compute engine. The distributed execution model provides superior performance compared to monolithic query systems, like RDBMS, for the same data volumes. |
Low, but it can be inconsistent. Structural limitations of the HBase architecture can result in latency spikes under intense write loads. |
Hadoop Integration |
Runs on top of Hadoop, with Apache Tez or MapReduce for processing and HDFS or Amazon S3 for storage. |
Runs on top of HDFS or Amazon S3. |
SQL Support |
Provides SQL-like querying capabilities with HiveQL. |
No SQL support on its own. You can use Apache Phoenix for SQL capabilities. |
Schema |
Defined schema for all tables. |
Schema-free. |
Data Types |
Supports structured and unstructured data. Provides native support for common SQL data types, like INT, FLOAT, and VARCHAR. |
Supports unstructured data only. The user defines mappings of data fields to Java-supported data types. |
What are use cases for Hive?
Guardian gives 27 million members the security they deserve through insurance and wealth management products and services. Guardian uses Amazon EMR to run Apache Hive on a S3 data lake. Apache Hive is used for batch processing. The S3 data lake fuels Guardian Direct, a digital platform that allows consumers to research and purchase both Guardian products and third party products in the insurance sector.
FINRA – the Financial Industry Regulatory Authority – is the largest independent securities regulator in the United States, and monitors and regulates financial trading practices. FINRA uses Amazon EMR to run Apache Hive on a S3 data lake. Running Hive on the EMR clusters enables FINRA to process and analyze trade data of up to 90 billion events using SQL. The cloud data lake resulted in cost savings of up to $20 million compared to FINRA’s on-premises solution, and drastically reduced the time needed for recovery and upgrades.
Vanguard, an American registered investment advisor, is the largest provider of mutual funds and the second largest provider of exchange traded funds. Vanguard uses Amazon EMR to run Apache Hive on a S3 data lake. Data is stored in S3 and EMR builds a Hive metastore on top of that data. The Hive metastore contains all the metadata about the data and tables in the EMR cluster, which allows for easy data analysis. Hive also enables analysts to perform ad hoc SQL queries on data stored in the S3 data lake. Migrating to a S3 data lake with Amazon EMR has enabled 150+ data analysts to realize operational efficiency and has reduced EC2 and EMR costs by $600k.
How can AWS support Hive?
Amazon EMR provides the easiest, fastest, and most cost-effective managed Hadoop framework, enabling customers to process vast amounts of data across dynamically scalable EC2 instances. Customers can also run other popular distributed frameworks such as Apache Hive, Spark, HBase, Presto, and Flink in EMR.
Learn more about Amazon EMR
Get started with Apache Hive on AWS by creating a free AWS account today.