HiveQL – Select-Group By ”; Previous Next This chapter explains the details of GROUP BY clause in a SELECT statement. The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records. Syntax The syntax of GROUP BY clause is as follows: SELECT [ALL | DISTINCT] select_expr, select_expr, … FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [ORDER BY col_list]] [LIMIT number]; Example Let us take an example of SELECT…GROUP BY clause. Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the number of employees in each department. +——+————–+————-+——————-+——–+ | ID | Name | Salary | Designation | Dept | +——+————–+————-+——————-+——–+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali | 40000 | Technical writer | TP | |1204 | Krian | 45000 | Proofreader | PR | |1205 | Kranthi | 30000 | Op Admin | Admin | +——+————–+————-+——————-+——–+ The following query retrieves the employee details using the above scenario. hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT; On successful execution of the query, you get to see the following response: +——+————–+ | Dept | Count(*) | +——+————–+ |Admin | 1 | |PR | 2 | |TP | 3 | +——+————–+ JDBC Program Given below is the JDBC program to apply the Group By clause for the given example. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveQLGroupBy { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager. getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ” + “FROM employee GROUP BY DEPT; ”); System.out.println(” Dept t count(*)”); while (res.next()) { System.out.println(res.getString(1) + ” ” + res.getInt(2)); } con.close(); } } Save the program in a file named HiveQLGroupBy.java. Use the following commands to compile and execute this program. $ javac HiveQLGroupBy.java $ java HiveQLGroupBy Output: Dept Count(*) Admin 1 PR 2 TP 3 Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Hive – Quick Guide
Hive – Quick Guide ”; Previous Next Hive – Introduction The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and a variety of data that is increasing day by day. Using traditional data management systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data management and processing challenges. Hadoop Hadoop is an open-source framework to store and process Big Data in a distributed environment. It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS). MapReduce: It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware. HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware. The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules. Sqoop: It is used to import and export data to and fro between HDFS and RDBMS. Pig: It is a procedural language platform used to develop a script for MapReduce operations. Hive: It is a platform used to develop SQL type scripts to do MapReduce operations. Note: There are various ways to execute MapReduce operations: The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data. The scripting approach for MapReduce to process structured and semi structured data using Pig. The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using Hive. What is Hive Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce. Hive is not A relational database A design for OnLine Transaction Processing (OLTP) A language for real-time queries and row-level updates Features of Hive It stores schema in a database and processed data into HDFS. It is designed for OLAP. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible. Architecture of Hive The following component diagram depicts the architecture of Hive: This component diagram contains different units. The following table describes each unit: Unit Name Operation User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping. HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system. Working of Hive The following diagram depicts the workflow between Hive and Hadoop. The following table defines how Hive interacts with Hadoop framework: Step No. Operation 1 Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute. 2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 Get Metadata The compiler sends metadata request to Metastore (any database). 4 Send Metadata Metastore sends metadata as a response to the compiler. 5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. 6 Execute Plan The driver sends the execute plan to the execution engine. 7 Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job. 7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore. 8 Fetch Result The execution engine receives the results from Data nodes. 9 Send Results The execution engine sends those resultant values to the driver. 10 Send Results The driver sends the results to Hive Interfaces. Hive – Installation All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install any Linux flavored OS. The following simple steps are executed for Hive installation: Step 1: Verifying JAVA Installation Java must be installed on your system before installing Hive. Let us verify java installation using the following command: $ java –version If Java is already installed on your system, you get to see the following response: java version “1.7.0_71” Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) If java is not installed in your system, then follow the steps given below for installing java. Installing Java Step I: Download java (JDK <latest version> – X64.tar.gz) by visiting the following link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system. Step II: Generally you will find the downloaded java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands. $ cd Downloads/ $ ls jdk-7u71-linux-x64.gz $ tar
Hive – Drop Table
Hive – Drop Table ”; Previous Next This chapter describes how to drop a table in Hive. When you drop a table from Hive Metastore, it removes the table/column data and their metadata. It can be a normal table (stored in Metastore) or an external table (stored in local file system); Hive treats both in the same manner, irrespective of their types. Drop Table Statement The syntax is as follows: DROP TABLE [IF EXISTS] table_name; The following query drops a table named employee: hive> DROP TABLE IF EXISTS employee; On successful execution of the query, you get to see the following response: OK Time taken: 5.3 seconds hive> JDBC Program The following JDBC program drops the employee table. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveDropTable { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement stmt.executeQuery(“DROP TABLE IF EXISTS employee;”); System.out.println(“Drop table successful.”); con.close(); } } Save the program in a file named HiveDropTable.java. Use the following commands to compile and execute this program. $ javac HiveDropTable.java $ java HiveDropTable Output: Drop table successful The following query is used to verify the list of tables: hive> SHOW TABLES; emp ok Time taken: 2.1 seconds hive> Print Page Previous Next Advertisements ”;
Hive – Alter Table
Hive – Alter Table ”; Previous Next This chapter explains how to alter the attributes of a table such as changing its table name, changing column names, adding columns, and deleting or replacing columns. Alter Table Statement It is used to alter a table in Hive. Syntax The statement takes any of the following syntaxes based on what attributes we wish to modify in a table. ALTER TABLE name RENAME TO new_name ALTER TABLE name ADD COLUMNS (col_spec[, col_spec …]) ALTER TABLE name DROP [COLUMN] column_name ALTER TABLE name CHANGE column_name new_name new_type ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec …]) Rename To… Statement The following query renames the table from employee to emp. hive> ALTER TABLE employee RENAME TO emp; JDBC Program The JDBC program to rename a table is as follows. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveAlterRenameTo { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement stmt.executeQuery(“ALTER TABLE employee RENAME TO emp;”); System.out.println(“Table Renamed Successfully”); con.close(); } } Save the program in a file named HiveAlterRenameTo.java. Use the following commands to compile and execute this program. $ javac HiveAlterRenameTo.java $ java HiveAlterRenameTo Output: Table renamed successfully. Change Statement The following table contains the fields of employee table and it shows the fields to be changed (in bold). Field Name Convert from Data Type Change Field Name Convert to Data Type eid int eid int name String ename String salary Float salary Double designation String designation String The following queries rename the column name and column data type using the above data: hive> ALTER TABLE employee CHANGE name ename String; hive> ALTER TABLE employee CHANGE salary salary Double; JDBC Program Given below is the JDBC program to change a column. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveAlterChangeColumn { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement stmt.executeQuery(“ALTER TABLE employee CHANGE name ename String;”); stmt.executeQuery(“ALTER TABLE employee CHANGE salary salary Double;”); System.out.println(“Change column successful.”); con.close(); } } Save the program in a file named HiveAlterChangeColumn.java. Use the following commands to compile and execute this program. $ javac HiveAlterChangeColumn.java $ java HiveAlterChangeColumn Output: Change column successful. Add Columns Statement The following query adds a column named dept to the employee table. hive> ALTER TABLE employee ADD COLUMNS ( dept STRING COMMENT ”Department name”); JDBC Program The JDBC program to add a column to a table is given below. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveAlterAddColumn { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement stmt.executeQuery(“ALTER TABLE employee ADD COLUMNS ” + ” (dept STRING COMMENT ”Department name”);”); System.out.prinln(“Add column successful.”); con.close(); } } Save the program in a file named HiveAlterAddColumn.java. Use the following commands to compile and execute this program. $ javac HiveAlterAddColumn.java $ java HiveAlterAddColumn Output: Add column successful. Replace Statement The following query deletes all the columns from the employee table and replaces it with emp and name columns: hive> ALTER TABLE employee REPLACE COLUMNS ( eid INT empid Int, ename STRING name String); JDBC Program Given below is the JDBC program to replace eid column with empid and ename column with name. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveAlterReplaceColumn { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/userdb”, “”, “”); // create statement Statement stmt = con.createStatement(); // execute statement stmt.executeQuery(“ALTER TABLE employee REPLACE COLUMNS ” +” (eid INT empid Int,” +” ename STRING name String);”); System.out.println(” Replace column successful”); con.close(); } } Save the program in a file named HiveAlterReplaceColumn.java. Use the following commands to compile and execute this program. $ javac HiveAlterReplaceColumn.java $ java HiveAlterReplaceColumn Output: Replace column successful. Print Page Previous Next Advertisements ”;
Hive – Installation
Hive – Installation ”; Previous Next All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install any Linux flavored OS. The following simple steps are executed for Hive installation: Step 1: Verifying JAVA Installation Java must be installed on your system before installing Hive. Let us verify java installation using the following command: $ java –version If Java is already installed on your system, you get to see the following response: java version “1.7.0_71” Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) If java is not installed in your system, then follow the steps given below for installing java. Installing Java Step I: Download java (JDK <latest version> – X64.tar.gz) by visiting the following link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system. Step II: Generally you will find the downloaded java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands. $ cd Downloads/ $ ls jdk-7u71-linux-x64.gz $ tar zxf jdk-7u71-linux-x64.gz $ ls jdk1.7.0_71 jdk-7u71-linux-x64.gz Step III: To make java available to all the users, you have to move it to the location “/usr/local/”. Open root, and type the following commands. $ su password: # mv jdk1.7.0_71 /usr/local/ # exit Step IV: For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file. export JAVA_HOME=/usr/local/jdk1.7.0_71 export PATH=$PATH:$JAVA_HOME/bin Now apply all the changes into the current running system. $ source ~/.bashrc Step V: Use the following commands to configure java alternatives: # alternatives –install /usr/bin/java/java/usr/local/java/bin/java 2 # alternatives –install /usr/bin/javac/javac/usr/local/java/bin/javac 2 # alternatives –install /usr/bin/jar/jar/usr/local/java/bin/jar 2 # alternatives –set java/usr/local/java/bin/java # alternatives –set javac/usr/local/java/bin/javac # alternatives –set jar/usr/local/java/bin/jar Now verify the installation using the command java -version from the terminal as explained above. Step 2: Verifying Hadoop Installation Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop installation using the following command: $ hadoop version If Hadoop is already installed on your system, then you will get the following response: Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 If Hadoop is not installed on your system, then proceed with the following steps: Downloading Hadoop Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands. $ su password: # cd /usr/local # wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/ hadoop-2.4.1.tar.gz # tar xzf hadoop-2.4.1.tar.gz # mv hadoop-2.4.1/* to hadoop/ # exit Installing Hadoop in Pseudo Distributed Mode The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode. Step I: Setting up Hadoop You can set Hadoop environment variables by appending the following commands to ~/.bashrc file. export HADOOP_HOME=/usr/local/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin Now apply all the changes into the current running system. $ source ~/.bashrc Step II: Hadoop Configuration You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration files according to your Hadoop infrastructure. $ cd $HADOOP_HOME/etc/hadoop In order to develop Hadoop programs using java, you have to reset the java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system. export JAVA_HOME=/usr/local/jdk1.7.0_71 Given below are the list of files that you have to edit to configure Hadoop. core-site.xml The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers. Open the core-site.xml and add the following properties in between the <configuration> and </configuration> tags. <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml The hdfs-site.xml file contains information such as the value of replication data, the namenode path, and the datanode path of your local file systems. It means the place where you want to store the Hadoop infra. Let us assume the following data. dfs.replication (data replication value) = 1 (In the following path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created by hdfs file system.) namenode path = //home/hadoop/hadoopinfra/hdfs/namenode (hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path = //home/hadoop/hadoopinfra/hdfs/datanode Open this file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value > </property> </configuration> Note: In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure. yarn-site.xml This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> mapred-site.xml This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command. $ cp mapred-site.xml.template mapred-site.xml Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Verifying Hadoop Installation The following steps are used to verify the Hadoop installation. Step I: Name Node Setup Set up the namenode using the command “hdfs namenode -format” as follows. $ cd ~ $ hdfs namenode -format The expected result is as follows. 10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/192.168.1.11 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.4.1 … … 10/24/14 21:30:56 INFO common.Storage: Storage directory /home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 ************************************************************/ Step II: Verifying Hadoop dfs The following command is used to start dfs. Executing this command will start your Hadoop file system. $ start-dfs.sh The expected output is as follows: 10/24/14 21:37:56 Starting namenodes on
HCatalog – Reader Writer
HCatalog – Reader Writer ”; Previous Next HCatalog contains a data transfer API for parallel input and output without using MapReduce. This API uses a basic storage abstraction of tables and rows to read data from Hadoop cluster and write data into it. The Data Transfer API contains mainly three classes; those are − HCatReader − Reads data from a Hadoop cluster. HCatWriter − Writes data into a Hadoop cluster. DataTransferFactory − Generates reader and writer instances. This API is suitable for master-slave node setup. Let us discuss more on HCatReader and HCatWriter. HCatReader HCatReader is an abstract class internal to HCatalog and abstracts away the complexities of the underlying system from where the records are to be retrieved. Sr.No. Method Name & Description 1 Public abstract ReaderContext prepareRead() throws HCatException This should be called at master node to obtain ReaderContext which then should be serialized and sent slave nodes. 2 Public abstract Iterator <HCatRecorder> read() throws HCaException This should be called at slaves nodes to read HCatRecords. 3 Public Configuration getConf() It will return the configuration class object. The HCatReader class is used to read the data from HDFS. Reading is a two-step process in which the first step occurs on the master node of an external system. The second step is carried out in parallel on multiple slave nodes. Reads are done on a ReadEntity. Before you start to read, you need to define a ReadEntity from which to read. This can be done through ReadEntity.Builder. You can specify a database name, table name, partition, and filter string. For example − ReadEntity.Builder builder = new ReadEntity.Builder(); ReadEntity entity = builder.withDatabase(“mydb”).withTable(“mytbl”).build(); 10. The above code snippet defines a ReadEntity object (“entity”), comprising a table named mytbl in a database named mydb, which can be used to read all the rows of this table. Note that this table must exist in HCatalog prior to the start of this operation. After defining a ReadEntity, you obtain an instance of HCatReader using the ReadEntity and cluster configuration − HCatReader reader = DataTransferFactory.getHCatReader(entity, config); The next step is to obtain a ReaderContext from reader as follows − ReaderContext cntxt = reader.prepareRead(); HCatWriter This abstraction is internal to HCatalog. This is to facilitate writing to HCatalog from external systems. Don”t try to instantiate this directly. Instead, use DataTransferFactory. Sr.No. Method Name & Description 1 Public abstract WriterContext prepareRead() throws HCatException External system should invoke this method exactly once from a master node. It returns a WriterContext. This should be serialized and sent to slave nodes to construct HCatWriter there. 2 Public abstract void write(Iterator<HCatRecord> recordItr) throws HCaException This method should be used at slave nodes to perform writes. The recordItr is an iterator object that contains the collection of records to be written into HCatalog. 3 Public abstract void abort(WriterContext cntxt) throws HCatException This method should be called at the master node. The primary purpose of this method is to do cleanups in case of failures. 4 public abstract void commit(WriterContext cntxt) throws HCatException This method should be called at the master node. The purpose of this method is to do metadata commit. Similar to reading, writing is also a two-step process in which the first step occurs on the master node. Subsequently, the second step occurs in parallel on slave nodes. Writes are done on a WriteEntity which can be constructed in a fashion similar to reads − WriteEntity.Builder builder = new WriteEntity.Builder(); WriteEntity entity = builder.withDatabase(“mydb”).withTable(“mytbl”).build(); The above code creates a WriteEntity object entity which can be used to write into a table named mytbl in the database mydb. After creating a WriteEntity, the next step is to obtain a WriterContext − HCatWriter writer = DataTransferFactory.getHCatWriter(entity, config); WriterContext info = writer.prepareWrite(); All of the above steps occur on the master node. The master node then serializes the WriterContext object and makes it available to all the slaves. On slave nodes, you need to obtain an HCatWriter using WriterContext as follows − HCatWriter writer = DataTransferFactory.getHCatWriter(context); Then, the writer takes an iterator as the argument for the write method − writer.write(hCatRecordItr); The writer then calls getNext() on this iterator in a loop and writes out all the records attached to the iterator. The TestReaderWriter.java file is used to test the HCatreader and HCatWriter classes. The following program demonstrates how to use HCatReader and HCatWriter API to read data from a source file and subsequently write it onto a destination file. import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.ObjectInputStream; import java.io.ObjectOutputStream; import java.util.ArrayList; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Map.Entry; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hive.metastore.api.MetaException; import org.apache.hadoop.hive.ql.CommandNeedRetryException; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hive.HCatalog.common.HCatException; import org.apache.hive.HCatalog.data.transfer.DataTransferFactory; import org.apache.hive.HCatalog.data.transfer.HCatReader; import org.apache.hive.HCatalog.data.transfer.HCatWriter; import org.apache.hive.HCatalog.data.transfer.ReadEntity; import org.apache.hive.HCatalog.data.transfer.ReaderContext; import org.apache.hive.HCatalog.data.transfer.WriteEntity; import org.apache.hive.HCatalog.data.transfer.WriterContext; import org.apache.hive.HCatalog.mapreduce.HCatBaseTest; import org.junit.Assert; import org.junit.Test; public class TestReaderWriter extends HCatBaseTest { @Test public void test() throws MetaException, CommandNeedRetryException, IOException, ClassNotFoundException { driver.run(“drop table mytbl”); driver.run(“create table mytbl (a string, b int)”); Iterator<Entry<String, String>> itr = hiveConf.iterator(); Map<String, String> map = new HashMap<String, String>(); while (itr.hasNext()) { Entry<String, String> kv = itr.next(); map.put(kv.getKey(), kv.getValue()); } WriterContext cntxt = runsInMaster(map); File writeCntxtFile = File.createTempFile(“hcat-write”, “temp”); writeCntxtFile.deleteOnExit(); // Serialize context. ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(writeCntxtFile)); oos.writeObject(cntxt); oos.flush(); oos.close(); // Now, deserialize it. ObjectInputStream ois = new ObjectInputStream(new FileInputStream(writeCntxtFile)); cntxt = (WriterContext) ois.readObject(); ois.close(); runsInSlave(cntxt); commit(map, true, cntxt); ReaderContext readCntxt = runsInMaster(map, false); File readCntxtFile = File.createTempFile(“hcat-read”, “temp”); readCntxtFile.deleteOnExit(); oos = new ObjectOutputStream(new FileOutputStream(readCntxtFile)); oos.writeObject(readCntxt); oos.flush(); oos.close(); ois = new ObjectInputStream(new FileInputStream(readCntxtFile)); readCntxt = (ReaderContext) ois.readObject(); ois.close(); for (int i = 0; i < readCntxt.numSplits(); i++) { runsInSlave(readCntxt, i); } } private WriterContext runsInMaster(Map<String, String> config) throws HCatException { WriteEntity.Builder builder = new WriteEntity.Builder(); WriteEntity entity = builder.withTable(“mytbl”).build(); HCatWriter writer = DataTransferFactory.getHCatWriter(entity, config); WriterContext info = writer.prepareWrite(); return info; } private ReaderContext runsInMaster(Map<String, String> config, boolean bogus) throws HCatException { ReadEntity entity = new ReadEntity.Builder().withTable(“mytbl”).build(); HCatReader reader = DataTransferFactory.getHCatReader(entity, config); ReaderContext cntxt = reader.prepareRead(); return cntxt; } private void runsInSlave(ReaderContext cntxt, int slaveNum) throws HCatException { HCatReader reader = DataTransferFactory.getHCatReader(cntxt, slaveNum); Iterator<HCatRecord> itr
Hive – Views And Indexes
Hive – View and Indexes ”; Previous Next This chapter describes how to create and manage views. Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all DML operations on a view. Creating a View You can create a view at the time of executing a SELECT statement. The syntax is as follows: CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], …) ] [COMMENT table_comment] AS SELECT … Example Let us take an example for view. Assume employee table as given below, with the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. We store the result in a view named emp_30000. +——+————–+————-+——————-+——–+ | ID | Name | Salary | Designation | Dept | +——+————–+————-+——————-+——–+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali | 40000 | Technical writer | TP | |1204 | Krian | 40000 | Hr Admin | HR | |1205 | Kranthi | 30000 | Op Admin | Admin | +——+————–+————-+——————-+——–+ The following query retrieves the employee details using the above scenario: hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE salary>30000; Dropping a View Use the following syntax to drop a view: DROP VIEW view_name The following query drops a view named as emp_30000: hive> DROP VIEW emp_30000; Creating an Index An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. Its syntax is as follows: CREATE INDEX index_name ON TABLE base_table_name (col_name, …) AS ”index.handler.class.name” [WITH DEFERRED REBUILD] [IDXPROPERTIES (property_name=property_value, …)] [IN TABLE index_table_name] [PARTITIONED BY (col_name, …)] [ [ ROW FORMAT …] STORED AS … | STORED BY … ] [LOCATION hdfs_path] [TBLPROPERTIES (…)] Example Let us take an example for index. Use the same employee table that we have used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the salary column of the employee table. The following query creates an index: hive> CREATE INDEX inedx_salary ON TABLE employee(salary) AS ”org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler”; It is a pointer to the salary column. If the column is modified, the changes are stored using an index value. Dropping an Index The following syntax is used to drop an index: DROP INDEX <index_name> ON <table_name> The following query drops an index named index_salary: hive> DROP INDEX index_salary ON employee; Print Page Previous Next Advertisements ”;
Hive – Built-In Operators
Hive – Built-in Operators ”; Previous Next This chapter explains the built-in operators of Hive. There are four types of operators in Hive: Relational Operators Arithmetic Operators Logical Operators Complex Operators Relational Operators These operators are used to compare two operands. The following table describes the relational operators available in Hive: Operator Operand Description A = B all primitive types TRUE if expression A is equivalent to expression B otherwise FALSE. A != B all primitive types TRUE if expression A is not equivalent to expression B otherwise FALSE. A < B all primitive types TRUE if expression A is less than expression B otherwise FALSE. A <= B all primitive types TRUE if expression A is less than or equal to expression B otherwise FALSE. A > B all primitive types TRUE if expression A is greater than expression B otherwise FALSE. A >= B all primitive types TRUE if expression A is greater than or equal to expression B otherwise FALSE. A IS NULL all types TRUE if expression A evaluates to NULL otherwise FALSE. A IS NOT NULL all types FALSE if expression A evaluates to NULL otherwise TRUE. A LIKE B Strings TRUE if string pattern A matches to B otherwise FALSE. A RLIKE B Strings NULL if A or B is NULL, TRUE if any substring of A matches the Java regular expression B , otherwise FALSE. A REGEXP B Strings Same as RLIKE. Example Let us assume the employee table is composed of fields named Id, Name, Salary, Designation, and Dept as shown below. Generate a query to retrieve the employee details whose Id is 1205. +—–+————–+——–+—————————+——+ | Id | Name | Salary | Designation | Dept | +—–+————–+————————————+——+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali | 40000 | Technical writer | TP | |1204 | Krian | 40000 | Hr Admin | HR | |1205 | Kranthi | 30000 | Op Admin | Admin| +—–+————–+——–+—————————+——+ The following query is executed to retrieve the employee details using the above table: hive> SELECT * FROM employee WHERE Id=1205; On successful execution of query, you get to see the following response: +—–+———–+———–+———————————-+ | ID | Name | Salary | Designation | Dept | +—–+—————+——-+———————————-+ |1205 | Kranthi | 30000 | Op Admin | Admin | +—–+———–+———–+———————————-+ The following query is executed to retrieve the employee details whose salary is more than or equal to Rs 40000. hive> SELECT * FROM employee WHERE Salary>=40000; On successful execution of query, you get to see the following response: +—–+————+——–+—————————-+——+ | ID | Name | Salary | Designation | Dept | +—–+————+——–+—————————-+——+ |1201 | Gopal | 45000 | Technical manager | TP | |1202 | Manisha | 45000 | Proofreader | PR | |1203 | Masthanvali| 40000 | Technical writer | TP | |1204 | Krian | 40000 | Hr Admin | HR | +—–+————+——–+—————————-+——+ Arithmetic Operators These operators support various common arithmetic operations on the operands. All of them return number types. The following table describes the arithmetic operators available in Hive: Operators Operand Description A + B all number types Gives the result of adding A and B. A – B all number types Gives the result of subtracting B from A. A * B all number types Gives the result of multiplying A and B. A / B all number types Gives the result of dividing B from A. A % B all number types Gives the reminder resulting from dividing A by B. A & B all number types Gives the result of bitwise AND of A and B. A | B all number types Gives the result of bitwise OR of A and B. A ^ B all number types Gives the result of bitwise XOR of A and B. ~A all number types Gives the result of bitwise NOT of A. Example The following query adds two numbers, 20 and 30. hive> SELECT 20+30 ADD FROM temp; On successful execution of the query, you get to see the following response: +——–+ | ADD | +——–+ | 50 | +——–+ Logical Operators The operators are logical expressions. All of them return either TRUE or FALSE. Operators Operands Description A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE. A && B boolean Same as A AND B. A OR B boolean TRUE if either A or B or both are TRUE, otherwise FALSE. A || B boolean Same as A OR B. NOT A boolean TRUE if A is FALSE, otherwise FALSE. !A boolean Same as NOT A. Example The following query is used to retrieve employee details whose Department is TP and Salary is more than Rs 40000. hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP; On successful execution of the query, you get to see the following response: +——+————–+————-+——————-+——–+ | ID | Name | Salary | Designation | Dept | +——+————–+————-+——————-+——–+ |1201 | Gopal | 45000 | Technical manager | TP | +——+————–+————-+——————-+——–+ Complex Operators These operators provide an expression to access the elements of Complex Types. Operator Operand Description A[n] A is an Array and n is an int It returns the nth element in the array A. The first element has index 0. M[key] M is a Map<K, V> and key has type K It returns the value corresponding to the key in the map. S.x S is a struct It returns the x field of S. Print Page Previous Next Advertisements ”;
Hive – Partitioning
Hive – Partitioning ”; Previous Next Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data. Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table. For example, a table named Tab1 contains employee data such as id, name, dept, and yoj (i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time. The following example shows how to partition a file and its data: The following file contains employeedata table. /tab1/employeedata/file1 id, name, dept, yoj 1, gopal, TP, 2012 2, kiran, HR, 2012 3, kaleel,SC, 2013 4, Prasanth, SC, 2013 The above data is partitioned into two files using year. /tab1/employeedata/2012/file2 1, gopal, TP, 2012 2, kiran, HR, 2012 /tab1/employeedata/2013/file3 3, kaleel,SC, 2013 4, Prasanth, SC, 2013 Adding a Partition We can add partitions to a table by altering the table. Let us assume we have a table called employee with fields such as Id, Name, Salary, Designation, Dept, and yoj. Syntax: ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION ”location1”] partition_spec [LOCATION ”location2”] …; partition_spec: : (p_column = p_col_value, p_column = p_col_value, …) The following query is used to add a partition to the employee table. hive> ALTER TABLE employee > ADD PARTITION (year=’2012’) > location ”/2012/part2012”; Renaming a Partition The syntax of this command is as follows. ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec; The following query is used to rename a partition: hive> ALTER TABLE employee PARTITION (year=’1203’) > RENAME TO PARTITION (Yoj=’1203’); Dropping a Partition The following syntax is used to drop a partition: ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,…; The following query is used to drop a partition: hive> ALTER TABLE employee DROP [IF EXISTS] > PARTITION (year=’1203’); Print Page Previous Next Advertisements ”;
Hive – Create Database
Hive – Create Database ”; Previous Next Hive is a database technology that can define databases and tables to analyze structured data. The theme for structured data analysis is to store the data in a tabular manner, and pass queries to analyze it. This chapter explains how to create Hive database. Hive contains a default database named default. Create Database Statement Create Database is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables. The syntax for this statement is as follows: CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name> Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists. We can use SCHEMA in place of DATABASE in this command. The following query is executed to create a database named userdb: hive> CREATE DATABASE [IF NOT EXISTS] userdb; or hive> CREATE SCHEMA userdb; The following query is used to verify a databases list: hive> SHOW DATABASES; default userdb JDBC Program The JDBC program to create a database is given below. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveCreateDb { private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”; public static void main(String[] args) throws SQLException { // Register driver and create driver instance Class.forName(driverName); // get connection Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10000/default”, “”, “”); Statement stmt = con.createStatement(); stmt.executeQuery(“CREATE DATABASE userdb”); System.out.println(“Database userdb created successfully.”); con.close(); } } Save the program in a file named HiveCreateDb.java. The following commands are used to compile and execute this program. $ javac HiveCreateDb.java $ java HiveCreateDb Output: Database userdb created successfully. Print Page Previous Next Advertisements ”;