hive Archives - Donotsad where can learn any thing work project and make money

Aug 10

Hive – Introduction

Hive – Introduction ”; Previous Next The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and a variety of data that is increasing day by day. Using traditional data management systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data management and processing challenges. Hadoop Hadoop is an open-source framework to store and process Big Data in a distributed environment. It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS). MapReduce: It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware. HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware. The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules. Sqoop: It is used to import and export data to and from between HDFS and RDBMS. Pig: It is a procedural language platform used to develop a script for MapReduce operations. Hive: It is a platform used to develop SQL type scripts to do MapReduce operations. Note: There are various ways to execute MapReduce operations: The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data. The scripting approach for MapReduce to process structured and semi structured data using Pig. The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using Hive. What is Hive Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce. Hive is not A relational database A design for OnLine Transaction Processing (OLTP) A language for real-time queries and row-level updates Features of Hive It stores schema in a database and processed data into HDFS. It is designed for OLAP. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible. Architecture of Hive The following component diagram depicts the architecture of Hive: This component diagram contains different units. The following table describes each unit: Unit Name Operation User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping. HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system. Working of Hive The following diagram depicts the workflow between Hive and Hadoop. The following table defines how Hive interacts with Hadoop framework: Step No. Operation 1 Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute. 2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 Get Metadata The compiler sends metadata request to Metastore (any database). 4 Send Metadata Metastore sends metadata as a response to the compiler. 5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. 6 Execute Plan The driver sends the execute plan to the execution engine. 7 Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job. 7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore. 8 Fetch Result The execution engine receives the results from Data nodes. 9 Send Results The execution engine sends those resultant values to the driver. 10 Send Results The driver sends the results to Hive Interfaces. Print Page Previous Next Advertisements ”;

Aug 10

HiveQL – Select Where

HiveQL – Select-Where ”; Previous Next The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore. This chapter explains how to use the SELECT statement with WHERE clause. SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters the data using the condition and gives you a finite result. The built-in operators and functions generate an expression, which fulfils the condition. Syntax Given below is the syntax of the SELECT query: SELECT [ALL | DISTINCT] select_expr, select_expr, … FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number]; Example Let us take an example for SELECT…WHERE clause. Assume we have the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. +——+————–+————-+——————-+——–+ | ID | Name | Salary | Designation | Dept | +——+————–+————-+——————-+——–+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali | 40000 | Technical writer | TP | |1204 | Krian | 40000 | Hr Admin | HR | |1205 | Kranthi | 30000 | Op Admin | Admin | +——+————–+————-+——————-+——–+ The following query retrieves the employee details using the above scenario: hive> SELECT * FROM employee WHERE salary>30000; On successful execution of the query, you get to see the following response: +——+————–+————-+——————-+——–+ | ID | Name | Salary | Designation | Dept | +——+————–+————-+——————-+——–+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali | 40000 | Technical writer | TP | |1204 | Krian | 40000 | Hr Admin | HR | +——+————–+————-+——————-+——–+ JDBC Program The JDBC program to apply where clause for the given example is as follows. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveQLWhere { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement Resultset res = stmt.executeQuery(“SELECT * FROM employee WHERE salary>30000;”); System.out.println(“Result:”); System.out.println(” ID t Name t Salary t Designation t Dept “); while (res.next()) { System.out.println(res.getInt(1) + ” ” + res.getString(2) + ” ” + res.getDouble(3) + ” ” + res.getString(4) + ” ” + res.getString(5)); } con.close(); } } Save the program in a file named HiveQLWhere.java. Use the following commands to compile and execute this program. $ javac HiveQLWhere.java $ java HiveQLWhere Output: ID Name Salary Designation Dept 1201 Gopal 45000 Technical manager TP 1202 Manisha 45000 Proofreader PR 1203 Masthanvali 40000 Technical writer TP 1204 Krian 40000 Hr Admin HR Print Page Previous Next Advertisements ”;

Aug 10

Hive – Questions and Answers

Hive Questions and Answers ”; Previous Next Hive Questions and Answers has been designed with a special intention of helping students and professionals preparing for various Certification Exams and Job Interviews. This section provides a useful collection of sample Interview Questions and Multiple Choice Questions (MCQs) and their answers with appropriate explanations. SN Question/Answers Type 1 Hive Interview Questions This section provides a huge collection of Hive Interview Questions with their answers hidden in a box to challenge you to have a go at them before discovering the correct answer. 2 Hive Online Quiz This section provides a great collection of Hive Multiple Choice Questions (MCQs) on a single page along with their correct answers and explanation. If you select the right option, it turns green; else red. 3 Hive Online Test If you are preparing to appear for a Java and Hive Framework related certification exam, then this section is a must for you. This section simulates a real online test along with a given timer which challenges you to complete the test within a given time-frame. Finally you can check your overall test score and how you fared among millions of other candidates who attended this online test. 4 Hive Mock Test This section provides various mock tests that you can download at your local machine and solve offline. Every mock test is supplied with a mock test key to let you verify the final score and grade yourself. Print Page Previous Next Advertisements ”;

Aug 10

Hive – Useful Resources

Hive – Useful Resources ”; Previous Next The following resources contain additional information on Hive. Please use them to get more in-depth knowledge on this topic. Useful Video Courses Big Data Analytics Using Hive In Hadoop 21 Lectures 2 hours Mukund Kumar Mishra More Detail Advance Big Data Analytics using Hive & Sqoop Best Seller 51 Lectures 4 hours Navdeep Kaur More Detail Apache Hive for Data Engineers (Hands On) Most Popular 92 Lectures 6 hours Bigdata Engineer More Detail Apache Hive Interview Question and Answer (100+ FAQ) 109 Lectures 2 hours Bigdata Engineer More Detail Flutter Course – Master Flutter From Scratch and Create Platform Independent Apps 55 Lectures 9 hours Code Studio More Detail Learn Hive – Course for Beginners 22 Lectures 2.5 hours Corporate Bridge Consultancy Private Limited More Detail Print Page Previous Next Advertisements ”;

Aug 10

HiveQL – Select Joins

HiveQL – Select-Joins ”; Previous Next JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is used to combine records from two or more tables in the database. Syntax join_table: table_reference JOIN table_factor [join_condition] | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition | table_reference LEFT SEMI JOIN table_reference join_condition | table_reference CROSS JOIN table_reference [join_condition] Example We will use the following two tables in this chapter. Consider the following table named CUSTOMERS.. +—-+———-+—–+———–+———-+ | ID | NAME | AGE | ADDRESS | SALARY | +—-+———-+—–+———–+———-+ | 1 | Ramesh | 32 | Ahmedabad | 2000.00 | | 2 | Khilan | 25 | Delhi | 1500.00 | | 3 | kaushik | 23 | Kota | 2000.00 | | 4 | Chaitali | 25 | Mumbai | 6500.00 | | 5 | Hardik | 27 | Bhopal | 8500.00 | | 6 | Komal | 22 | MP | 4500.00 | | 7 | Muffy | 24 | Indore | 10000.00 | +—-+———-+—–+———–+———-+ Consider another table ORDERS as follows: +—–+———————+————-+——–+ |OID | DATE | CUSTOMER_ID | AMOUNT | +—–+———————+————-+——–+ | 102 | 2009-10-08 00:00:00 | 3 | 3000 | | 100 | 2009-10-08 00:00:00 | 3 | 1500 | | 101 | 2009-11-20 00:00:00 | 2 | 1560 | | 103 | 2008-05-20 00:00:00 | 4 | 2060 | +—–+———————+————-+——–+ There are different types of joins given as follows: JOIN LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN JOIN JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables. The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records: hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); On successful execution of the query, you get to see the following response: +—-+———-+—–+——–+ | ID | NAME | AGE | AMOUNT | +—-+———-+—–+——–+ | 3 | kaushik | 23 | 3000 | | 3 | kaushik | 23 | 1500 | | 2 | Khilan | 25 | 1560 | | 4 | Chaitali | 25 | 2060 | +—-+———-+—–+——–+ LEFT OUTER JOIN The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in the right table. This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still returns a row in the result, but with NULL in each column from the right table. A LEFT JOIN returns all the values from the left table, plus the matched values from the right table, or NULL in case of no matching JOIN predicate. The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables: hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); On successful execution of the query, you get to see the following response: +—-+———-+——–+———————+ | ID | NAME | AMOUNT | DATE | +—-+———-+——–+———————+ | 1 | Ramesh | NULL | NULL | | 2 | Khilan | 1560 | 2009-11-20 00:00:00 | | 3 | kaushik | 3000 | 2009-10-08 00:00:00 | | 3 | kaushik | 1500 | 2009-10-08 00:00:00 | | 4 | Chaitali | 2060 | 2008-05-20 00:00:00 | | 5 | Hardik | NULL | NULL | | 6 | Komal | NULL | NULL | | 7 | Muffy | NULL | NULL | +—-+———-+——–+———————+ RIGHT OUTER JOIN The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in the result, but with NULL in each column from the left table. A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or NULL in case of no matching join predicate. The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables. notranslate”> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); On successful execution of the query, you get to see the following response: +——+———-+——–+———————+ | ID | NAME | AMOUNT | DATE | +——+———-+——–+———————+ | 3 | kaushik | 3000 | 2009-10-08 00:00:00 | | 3 | kaushik | 1500 | 2009-10-08 00:00:00 | | 2 | Khilan | 1560 | 2009-11-20 00:00:00 | | 4 | Chaitali | 2060 | 2008-05-20 00:00:00 | +——+———-+——–+———————+ FULL OUTER JOIN The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that fulfil the JOIN condition. The joined table contains either all the records from both the tables, or fills in NULL values for missing matches on either side. The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables: hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); On successful execution of the query, you get to see the following response: +——+———-+——–+———————+ | ID | NAME | AMOUNT | DATE | +——+———-+——–+———————+ | 1 | Ramesh | NULL | NULL | | 2 | Khilan | 1560 | 2009-11-20 00:00:00 | | 3 | kaushik | 3000 | 2009-10-08 00:00:00 | | 3 | kaushik | 1500 | 2009-10-08 00:00:00 | | 4 | Chaitali | 2060 | 2008-05-20 00:00:00 | | 5 | Hardik | NULL | NULL | | 6 | Komal | NULL | NULL | | 7 | Muffy | NULL | NULL | | 3 | kaushik | 3000 | 2009-10-08 00:00:00 | | 3 | kaushik | 1500 | 2009-10-08 00:00:00 | | 2 | Khilan | 1560 | 2009-11-20 00:00:00 | | 4 | Chaitali | 2060

Aug 10

Hive – Create Table

Hive – Create Table ”; Previous Next This chapter explains how to create a table and how to insert data into it. The conventions of creating a table in HIVE is quite similar to creating a table using SQL. Create Table Statement Create Table is a statement used to create a table in Hive. The syntax and example are as follows: Syntax CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name data_type [COMMENT col_comment], …)] [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] Example Let us assume you need to create a table named employee using CREATE TABLE statement. The following table lists the fields and their data types in employee table: Sr.No Field Name Data Type 1 Eid int 2 Name String 3 Salary Float 4 Designation string The following data is a Comment, Row formatted fields such as Field terminator, Lines terminator, and Stored File type. COMMENT ‘Employee details’ FIELDS TERMINATED BY ‘t’ LINES TERMINATED BY ‘n’ STORED IN TEXT FILE The following query creates a table named employee using the above data. hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination String) COMMENT ‘Employee details’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ LINES TERMINATED BY ‘n’ STORED AS TEXTFILE; If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists. On successful creation of table, you get to see the following response: OK Time taken: 5.905 seconds hive> JDBC Program The JDBC program to create a table is given example. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveCreateTable { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement stmt.executeQuery(“CREATE TABLE IF NOT EXISTS ” +” employee ( eid int, name String, ” +” salary String, destignation String)” +” COMMENT ‘Employee details’” +” ROW FORMAT DELIMITED” +” FIELDS TERMINATED BY ‘t’” +” LINES TERMINATED BY ‘n’” +” STORED AS TEXTFILE;”); System.out.println(“ Table employee created.”); con.close(); } } Save the program in a file named HiveCreateDb.java. The following commands are used to compile and execute this program. $ javac HiveCreateDb.java $ java HiveCreateDb Output Table employee created. Load Data Statement Generally, after creating a table in SQL, we can insert data using the Insert statement. But in Hive, we can insert data using the LOAD DATA statement. While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are two ways to load data: one is from local file system and second is from Hadoop file system. Syntax The syntax for load data is as follows: LOAD DATA [LOCAL] INPATH ”filepath” [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)] LOCAL is identifier to specify the local path. It is optional. OVERWRITE is optional to overwrite the data in the table. PARTITION is optional. Example We will insert the following data into the table. It is a text file named sample.txt in /home/user directory. 1201 Gopal 45000 Technical manager 1202 Manisha 45000 Proof reader 1203 Masthanvali 40000 Technical writer 1204 Kiran 40000 Hr Admin 1205 Kranthi 30000 Op Admin The following query loads the given text into the table. hive> LOAD DATA LOCAL INPATH ”/home/user/sample.txt” OVERWRITE INTO TABLE employee; On successful download, you get to see the following response: OK Time taken: 15.905 seconds hive> JDBC Program Given below is the JDBC program to load given data into the table. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveLoadData { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement stmt.executeQuery(“LOAD DATA LOCAL INPATH ”/home/user/sample.txt”” + “OVERWRITE INTO TABLE employee;”); System.out.println(“Load Data into employee successful”); con.close(); } } Save the program in a file named HiveLoadData.java. Use the following commands to compile and execute this program. $ javac HiveLoadData.java $ java HiveLoadData Output: Load Data into employee successful Print Page Previous Next Advertisements ”;

Aug 10

HiveQL – Select Group By

HiveQL – Select-Group By ”; Previous Next This chapter explains the details of GROUP BY clause in a SELECT statement. The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records. Syntax The syntax of GROUP BY clause is as follows: SELECT [ALL | DISTINCT] select_expr, select_expr, … FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [ORDER BY col_list]] [LIMIT number]; Example Let us take an example of SELECT…GROUP BY clause. Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the number of employees in each department. +——+————–+————-+——————-+——–+ | ID | Name | Salary | Designation | Dept | +——+————–+————-+——————-+——–+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali | 40000 | Technical writer | TP | |1204 | Krian | 45000 | Proofreader | PR | |1205 | Kranthi | 30000 | Op Admin | Admin | +——+————–+————-+——————-+——–+ The following query retrieves the employee details using the above scenario. hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT; On successful execution of the query, you get to see the following response: +——+————–+ | Dept | Count(*) | +——+————–+ |Admin | 1 | |PR | 2 | |TP | 3 | +——+————–+ JDBC Program Given below is the JDBC program to apply the Group By clause for the given example. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveQLGroupBy { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager. getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ” + “FROM employee GROUP BY DEPT; ”); System.out.println(” Dept t count(*)”); while (res.next()) { System.out.println(res.getString(1) + ” ” + res.getInt(2)); } con.close(); } } Save the program in a file named HiveQLGroupBy.java. Use the following commands to compile and execute this program. $ javac HiveQLGroupBy.java $ java HiveQLGroupBy Output: Dept Count(*) Admin 1 PR 2 TP 3 Print Page Previous Next Advertisements ”;

Aug 10

Hive – Quick Guide

Hive – Quick Guide ”; Previous Next Hive – Introduction The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and a variety of data that is increasing day by day. Using traditional data management systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data management and processing challenges. Hadoop Hadoop is an open-source framework to store and process Big Data in a distributed environment. It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS). MapReduce: It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware. HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware. The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules. Sqoop: It is used to import and export data to and fro between HDFS and RDBMS. Pig: It is a procedural language platform used to develop a script for MapReduce operations. Hive: It is a platform used to develop SQL type scripts to do MapReduce operations. Note: There are various ways to execute MapReduce operations: The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data. The scripting approach for MapReduce to process structured and semi structured data using Pig. The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using Hive. What is Hive Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce. Hive is not A relational database A design for OnLine Transaction Processing (OLTP) A language for real-time queries and row-level updates Features of Hive It stores schema in a database and processed data into HDFS. It is designed for OLAP. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible. Architecture of Hive The following component diagram depicts the architecture of Hive: This component diagram contains different units. The following table describes each unit: Unit Name Operation User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping. HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system. Working of Hive The following diagram depicts the workflow between Hive and Hadoop. The following table defines how Hive interacts with Hadoop framework: Step No. Operation 1 Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute. 2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 Get Metadata The compiler sends metadata request to Metastore (any database). 4 Send Metadata Metastore sends metadata as a response to the compiler. 5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. 6 Execute Plan The driver sends the execute plan to the execution engine. 7 Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job. 7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore. 8 Fetch Result The execution engine receives the results from Data nodes. 9 Send Results The execution engine sends those resultant values to the driver. 10 Send Results The driver sends the results to Hive Interfaces. Hive – Installation All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install any Linux flavored OS. The following simple steps are executed for Hive installation: Step 1: Verifying JAVA Installation Java must be installed on your system before installing Hive. Let us verify java installation using the following command: $ java –version If Java is already installed on your system, you get to see the following response: java version “1.7.0_71” Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) If java is not installed in your system, then follow the steps given below for installing java. Installing Java Step I: Download java (JDK <latest version> – X64.tar.gz) by visiting the following link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system. Step II: Generally you will find the downloaded java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands. $ cd Downloads/ $ ls jdk-7u71-linux-x64.gz $ tar

Aug 10

Hive – Create Database

Hive – Create Database ”; Previous Next Hive is a database technology that can define databases and tables to analyze structured data. The theme for structured data analysis is to store the data in a tabular manner, and pass queries to analyze it. This chapter explains how to create Hive database. Hive contains a default database named default. Create Database Statement Create Database is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables. The syntax for this statement is as follows: CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name> Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists. We can use SCHEMA in place of DATABASE in this command. The following query is executed to create a database named userdb: hive> CREATE DATABASE [IF NOT EXISTS] userdb; or hive> CREATE SCHEMA userdb; The following query is used to verify a databases list: hive> SHOW DATABASES; default userdb JDBC Program The JDBC program to create a database is given below. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveCreateDb { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/default”, “”, “”); Statement stmt = con.createStatement(); stmt.executeQuery(“CREATE DATABASE userdb”); System.out.println(“Database userdb created successfully.”); con.close(); } } Save the program in a file named HiveCreateDb.java. The following commands are used to compile and execute this program. $ javac HiveCreateDb.java $ java HiveCreateDb Output: Database userdb created successfully. Print Page Previous Next Advertisements ”;

Aug 10

Hive – Drop Database

Hive – Drop Database ”; Previous Next This chapter describes how to drop a database in Hive. The usage of SCHEMA and DATABASE are same. Drop Database Statement Drop Database is a statement that drops all the tables and deletes the database. Its syntax is as follows: DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE]; The following queries are used to drop a database. Let us assume that the database name is userdb. hive> DROP DATABASE IF EXISTS userdb; The following query drops the database using CASCADE. It means dropping respective tables before dropping the database. hive> DROP DATABASE IF EXISTS userdb CASCADE; The following query drops the database using SCHEMA. hive> DROP SCHEMA userdb; This clause was added in Hive 0.6. JDBC Program The JDBC program to drop a database is given below. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveDropDb { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/default”, “”, “”); Statement stmt = con.createStatement(); stmt.executeQuery(“DROP DATABASE userdb”); System.out.println(“Drop userdb database successful.”); con.close(); } } Save the program in a file named HiveDropDb.java. Given below are the commands to compile and execute this program. $ javac HiveDropDb.java $ java HiveDropDb Output: Drop userdb database successful. Print Page Previous Next Advertisements ”;