Apache Tajo – Database Creation

Apache Tajo – Database Creation ”; Previous Next This section explains the Tajo DDL commands. Tajo has a built-in database named default. Create Database Statement Create Database is a statement used to create a database in Tajo. The syntax for this statement is as follows − CREATE DATABASE [IF NOT EXISTS] <database_name> Query default> default> create database if not exists test; Result The above query will generate the following result. OK Database is the namespace in Tajo. A database can contain multiple tables with a unique name. Show Current Database To check the current database name, issue the following command − Query default> c Result The above query will generate the following result. You are now connected to database “default” as user “user1″. default> Connect to Database As of now, you have created a database named “test”. The following syntax is used to connect the “test” database. c <database name> Query default> c test Result The above query will generate the following result. You are now connected to database “test” as user “user1”. test> You can now see the prompt changes from default database to test database. Drop Database To drop a database, use the following syntax − DROP DATABASE <database-name> Query test> c default You are now connected to database “default” as user “user1″. default> drop database test; Result The above query will generate the following result. OK Print Page Previous Next Advertisements ”;

Apache Tajo – Installation

Apache Tajo – Installation ”; Previous Next To install Apache Tajo, you must have the following software on your system − Hadoop version 2.3 or greater Java version 1.7 or higher Linux or Mac OS Let us now continue with the following steps to install Tajo. Verifying Java installation Hopefully, you have already installed Java version 8 on your machine. Now, you just need to proceed by verifying it. To verify, use the following command − $ java -version If Java is successfully installed on your machine, you could see the present version of the installed Java. If Java is not installed follow these steps to install Java 8 on your machine. Download JDK Download the latest version of JDK by visiting the following link and then, download the latest version. https://www.oracle.com The latest version is JDK 8u 92 and the file is “jdk-8u92-linux-x64.tar.gz”. Please download the file on your machine. Following this, extract the files and move them to a specific directory. Now, set the Java alternatives. Finally, Java is installed on your machine. Verifying Hadoop Installation You have already installed Hadoop on your system. Now, verify it using the following command − $ hadoop version If everything is fine with your setup, then you could see the version of Hadoop. If Hadoop is not installed, download and install Hadoop by visiting the following link − https://www.apache.org Apache Tajo Installation Apache Tajo provides two execution modes — local mode and fully distributed mode. After verifying Java and Hadoop installation proceed with the following steps to install Tajo cluster on your machine. A local mode Tajo instance requires very easy configurations. Download the latest version of Tajo by visiting the following link − https://www.apache.org/dyn/closer.cgi/tajo Now you can download the file “tajo-0.11.3.tar.gz” from your machine. Extract Tar File Extract the tar file by using the following command − $ cd opt/ $ tar tajo-0.11.3.tar.gz $ cd tajo-0.11.3 Set Environment Variable Add the following changes to “conf/tajo-env.sh” file $ cd tajo-0.11.3 $ vi conf/tajo-env.sh # Hadoop home. Required export HADOOP_HOME = /Users/path/to/Hadoop/hadoop-2.6.2 # The java implementation to use. Required. export JAVA_HOME = /path/to/jdk1.8.0_92.jdk/ Here, you must specify Hadoop and Java path to “tajo-env.sh” file. After the changes are made, save the file and quit the terminal. Start Tajo Server To launch the Tajo server, execute the following command − $ bin/start-tajo.sh You will receive a response similar to the following − Starting single TajoMaster starting master, logging to /Users/path/to/Tajo/tajo-0.11.3/bin/../ localhost: starting worker, logging to /Users/path/toe/Tajo/tajo-0.11.3/bin/../logs/ Tajo master web UI: http://local:26080 Tajo Client Service: local:26002 Now, type the command “jps” to see the running daemons. $ jps 1010 TajoWorker 1140 Jps 933 TajoMaster Launch Tajo Shell (Tsql) To launch the Tajo shell client, use the following command − $ bin/tsql You will receive the following output − welcome to _____ ___ _____ ___ /_ _/ _ |/_ _/ / / // /_| |_/ // / / /_//_/ /_/___/ __/ 0.11.3 Try ? for help. Quit Tajo Shell Execute the following command to quit Tsql − default> q bye! Here, the default refers to the catalog in Tajo. Web UI Type the following URL to launch the Tajo web UI − http://localhost:26080/ You will now see the following screen which is similar to the ExecuteQuery option. Stop Tajo To stop the Tajo server, use the following command − $ bin/stop-tajo.sh You will get the following response − localhost: stopping worker stopping master Print Page Previous Next Advertisements ”;

Apache Tajo – Configuration Settings

Apache Tajo – Configuration Settings ”; Previous Next Tajo’s configuration is based on Hadoop’s configuration system. This chapter explains Tajo configuration settings in detail. Basic Settings Tajo uses the following two config files − catalog-site.xml − configuration for the catalog server. tajo-site.xml − configuration for other Tajo modules. Distributed Mode Configuration Distributed mode setup runs on Hadoop Distributed File System (HDFS). Let’s follow the steps to configure Tajo distributed mode setup. tajo-site.xml This file is available @ /path/to/tajo/conf directory and acts as configuration for other Tajo modules. To access Tajo in a distributed mode, apply the following changes to “tajo-site.xml”. <property> <name>tajo.rootdir</name> <value>hdfs://hostname:port/tajo</value> </property> <property> <name>tajo.master.umbilical-rpc.address</name> <value>hostname:26001</value> </property> <property> <name>tajo.master.client-rpc.address</name> <value>hostname:26002</value> </property> <property> <name>tajo.catalog.client-rpc.address</name> <value>hostname:26005</value> </property> Master Node Configuration Tajo uses HDFS as a primary storage type. The configuration is as follows and should be added to “tajo-site.xml”. <property> <name>tajo.rootdir</name> <value>hdfs://namenode_hostname:port/path</value> </property> Catalog Configuration If you want to customize the catalog service, copy $path/to/Tajo/conf/catalogsite.xml.template to $path/to/Tajo/conf/catalog-site.xml and add any of the following configuration as needed. For example, if you use “Hive catalog store” to access Tajo, then the configuration should be like the following − <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.HCatalogStore</value> </property> If you need to store MySQL catalog, then apply the following changes − <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.MySQLStore</value> </property> <property> <name>tajo.catalog.jdbc.connection.id</name> <value><mysql user name></value> </property> <property> <name>tajo.catalog.jdbc.connection.password</name> <value><mysql user password></value> </property> <property> <name>tajo.catalog.jdbc.uri</name> <value>jdbc:mysql://<mysql host name>:<mysql port>/<database name for tajo> ?createDatabaseIfNotExist = true</value> </property> Similarly, you can register the other Tajo supported catalogs in the configuration file. Worker Configuration By default, the TajoWorker stores temporary data on the local file system. It is defined in the “tajo-site.xml” file as follows − <property> <name>tajo.worker.tmpdir.locations</name> <value>/disk1/tmpdir,/disk2/tmpdir,/disk3/tmpdir</value> </property> To increase the capacity of running tasks of each worker resource, choose the following configuration − <property> <name>tajo.worker.resource.cpu-cores</name> <value>12</value> </property> <property> <name>tajo.task.resource.min.memory-mb</name> <value>2000</value> </property> <property> <name>tajo.worker.resource.disks</name> <value>4</value> </property> To make the Tajo worker run in a dedicated mode, choose the following configuration − <property> <name>tajo.worker.resource.dedicated</name> <value>true</value> </property> Print Page Previous Next Advertisements ”;

Apache Tajo – Custom Functions

Apache Tajo – Custom Functions ”; Previous Next Apache Tajo supports the custom / user defined functions (UDFs). The custom functions can be created in python. The custom functions are just plain python functions with decorator “@output_type(<tajo sql datatype>)” as follows − @ouput_type(“integer”) def sum_py(a, b): return a + b; The python scripts with UDFs can be registered by adding the below configuration in “tajosite.xml”. <property> <name>tajo.function.python.code-dir</name> <value>file:///path/to/script1.py,file:///path/to/script2.py</value> </property> Once the scripts are registered, restart the cluster and the UDFs will be available right in the SQL query as follows − select sum_py(10, 10) as pyfn; Apache Tajo supports user defined aggregate functions as well but does not support user defined window functions. Print Page Previous Next Advertisements ”;

Apache Tajo – DateTime Functions

Apache Tajo – DateTime Functions ”; Previous Next Apache Tajo supports the following DateTime functions. S.No. Function & Description 1 add_days(date date or timestamp, int day Returns date added by the given day value. 2 add_months(date date or timestamp, int month) Returns date added by the given month value. 3 current_date() Returns today’s date. 4 current_time() Returns today’s time. 5 extract(century from date/timestamp) Extracts century from the given parameter. 6 extract(day from date/timestamp) Extracts day from the given parameter. 7 extract(decade from date/timestamp) Extracts decade from the given parameter. 8 extract(day dow date/timestamp) Extracts day of week from the given parameter. 9 extract(doy from date/timestamp) Extracts day of year from the given parameter. 10 select extract(hour from timestamp) Extracts hour from the given parameter. 11 select extract(isodow from timestamp) Extracts day of week from the given parameter. This is identical to dow except for Sunday. This matches the ISO 8601 day of the week numbering. 12 select extract(isoyear from date) Extracts ISO year from the specified date. ISO year may be different from the Gregorian year. 13 extract(microseconds from time) Extracts microseconds from the given parameter. The seconds field, including fractional parts, multiplied by 1 000 000; 14 extract(millennium from timestamp ) Extracts millennium from the given parameter.one millennium corresponds to 1000 years. Hence, the third millennium started January 1, 2001. 15 extract(milliseconds from time) Extracts milliseconds from the given parameter. 16 extract(minute from timestamp ) Extracts minute from the given parameter. 17 extract(quarter from timestamp) Extracts quarter of the year(1 – 4) from the given parameter. 18 date_part(field text, source date or timestamp or time) Extracts date field from text. 19 now() Returns current timestamp. 20 to_char(timestamp, format text) Converts timestamp to text. 21 to_date(src text, format text) Converts text to date. 22 to_timestamp(src text, format text) Converts text to timestamp. Print Page Previous Next Advertisements ”;

Apache Tajo – Architecture

Apache Tajo – Architecture ”; Previous Next The following illustration depicts the architecture of Apache Tajo. The following table describes each of the components in detail. S.No. Component & Description 1 Client Client submits the SQL statements to the Tajo Master to get the result. 2 Master Master is the main daemon. It is responsible for query planning and is the coordinator for workers. 3 Catalog server Maintains the table and index descriptions. It is embedded in the Master daemon. The catalog server uses Apache Derby as the storage layer and connects via JDBC client. 4 Worker Master node assigns task to worker nodes. TajoWorker processes data. As the number of TajoWorkers increases, the processing capacity also increases linearly. 5 Query Master Tajo master assigns query to the Query Master. The Query Master is responsible for controlling a distributed execution plan. It launches the TaskRunner and schedules tasks to TaskRunner. The main role of the Query Master is to monitor the running tasks and report them to the Master node. 6 Node Managers Manages the resource of the worker node. It decides on allocating requests to the node. 7 TaskRunner Acts as a local query execution engine. It is used to run and monitor query process. The TaskRunner processes one task at a time. It has the following three main attributes − Logical plan − An execution block which created the task. A fragment − an input path, an offset range, and schema. Fetches URIs 8 Query Executor It is used to execute a query. 9 Storage service Connects the underlying data storage to Tajo. Workflow Tajo uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and assigns to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. The web-based interface of Tajo has the following capabilities − Option to find how the submitted queries are planned Option to find how the queries are distributed across nodes Option to check the status of the cluster and nodes Print Page Previous Next Advertisements ”;

Apache Tajo – Integration with Hive

Apache Tajo – Integration with Hive ”; Previous Next Tajo supports the HiveCatalogStore to integrate with Apache Hive. This integration allows Tajo to access tables in Apache Hive. Set Environment Variable Add the following changes to “conf/tajo-env.sh” file. $ vi conf/tajo-env.sh export HIVE_HOME = /path/to/hive After you have included the Hive path, Tajo will set the Hive library file to the classpath. Catalog Configuration Add the following changes to the “conf/catalog-site.xml” file. $ vi conf/catalog-site.xml <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.HiveCatalogStore</value> </property> Once HiveCatalogStore is configured, you can access Hive’s table in Tajo. Print Page Previous Next Advertisements ”;

Apache Tajo – String Functions

Apache Tajo – String Functions ”; Previous Next The following table lists out the string functions in Tajo. S.No. Function & Description 1 concat(string1, …, stringN) Concatenate the given strings. 2 length(string) Returns the length of the given string. 3 lower(string) Returns the lowercase format for the string. 4 upper(string) Returns the uppercase format for the given string. 5 ascii(string text) Returns the ASCII code of the first character of the text. 6 bit_length(string text) Returns the number of bits in a string. 7 char_length(string text) Returns the number of characters in a string. 8 octet_length(string text) Returns the number of bytes in a string. 9 digest(input text, method text) Calculates the Digest hash of string. Here, the second arg method refers to the hash method. 10 initcap(string text) Converts the first letter of each word to upper case. 11 md5(string text) Calculates the MD5 hash of string. 12 left(string text, int size) Returns the first n characters in the string. 13 right(string text, int size) Returns the last n characters in the string. 14 locate(source text, target text, start_index) Returns the location of specified substring. 15 strposb(source text, target text) Returns the binary location of specified substring. 16 substr(source text, start index, length) Returns the substring for the specified length. 17 trim(string text[, characters text]) Removes the characters (a space by default) from the start/end/both ends of the string. 18 split_part(string text, delimiter text, field int) Splits a string on delimiter and returns the given field (counting from one). 19 regexp_replace(string text, pattern text, replacement text) Replaces substrings matched to a given regular expression pattern. 20 reverse(string) Reverse operation performed for the string. Print Page Previous Next Advertisements ”;