Apache Tajo – Configuration Settings

Apache Tajo – Configuration Settings ”; Previous Next Tajo’s configuration is based on Hadoop’s configuration system. This chapter explains Tajo configuration settings in detail. Basic Settings Tajo uses the following two config files − catalog-site.xml − configuration for the catalog server. tajo-site.xml − configuration for other Tajo modules. Distributed Mode Configuration Distributed mode setup runs on Hadoop Distributed File System (HDFS). Let’s follow the steps to configure Tajo distributed mode setup. tajo-site.xml This file is available @ /path/to/tajo/conf directory and acts as configuration for other Tajo modules. To access Tajo in a distributed mode, apply the following changes to “tajo-site.xml”. <property> <name>tajo.rootdir</name> <value>hdfs://hostname:port/tajo</value> </property> <property> <name>tajo.master.umbilical-rpc.address</name> <value>hostname:26001</value> </property> <property> <name>tajo.master.client-rpc.address</name> <value>hostname:26002</value> </property> <property> <name>tajo.catalog.client-rpc.address</name> <value>hostname:26005</value> </property> Master Node Configuration Tajo uses HDFS as a primary storage type. The configuration is as follows and should be added to “tajo-site.xml”. <property> <name>tajo.rootdir</name> <value>hdfs://namenode_hostname:port/path</value> </property> Catalog Configuration If you want to customize the catalog service, copy $path/to/Tajo/conf/catalogsite.xml.template to $path/to/Tajo/conf/catalog-site.xml and add any of the following configuration as needed. For example, if you use “Hive catalog store” to access Tajo, then the configuration should be like the following − <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.HCatalogStore</value> </property> If you need to store MySQL catalog, then apply the following changes − <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.MySQLStore</value> </property> <property> <name>tajo.catalog.jdbc.connection.id</name> <value><mysql user name></value> </property> <property> <name>tajo.catalog.jdbc.connection.password</name> <value><mysql user password></value> </property> <property> <name>tajo.catalog.jdbc.uri</name> <value>jdbc:mysql://<mysql host name>:<mysql port>/<database name for tajo> ?createDatabaseIfNotExist = true</value> </property> Similarly, you can register the other Tajo supported catalogs in the configuration file. Worker Configuration By default, the TajoWorker stores temporary data on the local file system. It is defined in the “tajo-site.xml” file as follows − <property> <name>tajo.worker.tmpdir.locations</name> <value>/disk1/tmpdir,/disk2/tmpdir,/disk3/tmpdir</value> </property> To increase the capacity of running tasks of each worker resource, choose the following configuration − <property> <name>tajo.worker.resource.cpu-cores</name> <value>12</value> </property> <property> <name>tajo.task.resource.min.memory-mb</name> <value>2000</value> </property> <property> <name>tajo.worker.resource.disks</name> <value>4</value> </property> To make the Tajo worker run in a dedicated mode, choose the following configuration − <property> <name>tajo.worker.resource.dedicated</name> <value>true</value> </property> Print Page Previous Next Advertisements ”;

Apache Tajo – Custom Functions

Apache Tajo – Custom Functions ”; Previous Next Apache Tajo supports the custom / user defined functions (UDFs). The custom functions can be created in python. The custom functions are just plain python functions with decorator “@output_type(<tajo sql datatype>)” as follows − @ouput_type(“integer”) def sum_py(a, b): return a + b; The python scripts with UDFs can be registered by adding the below configuration in “tajosite.xml”. <property> <name>tajo.function.python.code-dir</name> <value>file:///path/to/script1.py,file:///path/to/script2.py</value> </property> Once the scripts are registered, restart the cluster and the UDFs will be available right in the SQL query as follows − select sum_py(10, 10) as pyfn; Apache Tajo supports user defined aggregate functions as well but does not support user defined window functions. Print Page Previous Next Advertisements ”;

Apache Tajo – DateTime Functions

Apache Tajo – DateTime Functions ”; Previous Next Apache Tajo supports the following DateTime functions. S.No. Function & Description 1 add_days(date date or timestamp, int day Returns date added by the given day value. 2 add_months(date date or timestamp, int month) Returns date added by the given month value. 3 current_date() Returns today’s date. 4 current_time() Returns today’s time. 5 extract(century from date/timestamp) Extracts century from the given parameter. 6 extract(day from date/timestamp) Extracts day from the given parameter. 7 extract(decade from date/timestamp) Extracts decade from the given parameter. 8 extract(day dow date/timestamp) Extracts day of week from the given parameter. 9 extract(doy from date/timestamp) Extracts day of year from the given parameter. 10 select extract(hour from timestamp) Extracts hour from the given parameter. 11 select extract(isodow from timestamp) Extracts day of week from the given parameter. This is identical to dow except for Sunday. This matches the ISO 8601 day of the week numbering. 12 select extract(isoyear from date) Extracts ISO year from the specified date. ISO year may be different from the Gregorian year. 13 extract(microseconds from time) Extracts microseconds from the given parameter. The seconds field, including fractional parts, multiplied by 1 000 000; 14 extract(millennium from timestamp ) Extracts millennium from the given parameter.one millennium corresponds to 1000 years. Hence, the third millennium started January 1, 2001. 15 extract(milliseconds from time) Extracts milliseconds from the given parameter. 16 extract(minute from timestamp ) Extracts minute from the given parameter. 17 extract(quarter from timestamp) Extracts quarter of the year(1 – 4) from the given parameter. 18 date_part(field text, source date or timestamp or time) Extracts date field from text. 19 now() Returns current timestamp. 20 to_char(timestamp, format text) Converts timestamp to text. 21 to_date(src text, format text) Converts text to date. 22 to_timestamp(src text, format text) Converts text to timestamp. Print Page Previous Next Advertisements ”;

Apache Tajo – Architecture

Apache Tajo – Architecture ”; Previous Next The following illustration depicts the architecture of Apache Tajo. The following table describes each of the components in detail. S.No. Component & Description 1 Client Client submits the SQL statements to the Tajo Master to get the result. 2 Master Master is the main daemon. It is responsible for query planning and is the coordinator for workers. 3 Catalog server Maintains the table and index descriptions. It is embedded in the Master daemon. The catalog server uses Apache Derby as the storage layer and connects via JDBC client. 4 Worker Master node assigns task to worker nodes. TajoWorker processes data. As the number of TajoWorkers increases, the processing capacity also increases linearly. 5 Query Master Tajo master assigns query to the Query Master. The Query Master is responsible for controlling a distributed execution plan. It launches the TaskRunner and schedules tasks to TaskRunner. The main role of the Query Master is to monitor the running tasks and report them to the Master node. 6 Node Managers Manages the resource of the worker node. It decides on allocating requests to the node. 7 TaskRunner Acts as a local query execution engine. It is used to run and monitor query process. The TaskRunner processes one task at a time. It has the following three main attributes − Logical plan − An execution block which created the task. A fragment − an input path, an offset range, and schema. Fetches URIs 8 Query Executor It is used to execute a query. 9 Storage service Connects the underlying data storage to Tajo. Workflow Tajo uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and assigns to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. The web-based interface of Tajo has the following capabilities − Option to find how the submitted queries are planned Option to find how the queries are distributed across nodes Option to check the status of the cluster and nodes Print Page Previous Next Advertisements ”;

Apache Tajo – Integration with Hive

Apache Tajo – Integration with Hive ”; Previous Next Tajo supports the HiveCatalogStore to integrate with Apache Hive. This integration allows Tajo to access tables in Apache Hive. Set Environment Variable Add the following changes to “conf/tajo-env.sh” file. $ vi conf/tajo-env.sh export HIVE_HOME = /path/to/hive After you have included the Hive path, Tajo will set the Hive library file to the classpath. Catalog Configuration Add the following changes to the “conf/catalog-site.xml” file. $ vi conf/catalog-site.xml <property> <name>tajo.catalog.store.class</name> <value>org.apache.tajo.catalog.store.HiveCatalogStore</value> </property> Once HiveCatalogStore is configured, you can access Hive’s table in Tajo. Print Page Previous Next Advertisements ”;

Apache Tajo – Operators

Apache Tajo – Operators ”; Previous Next The following operators are used in Tajo to perform desired operations. S.No. Operator & Description 1 Arithmetic operators Presto supports arithmetic operators such as &plus;, −, &ast;, /, %. 2 Relational operators <, >, <=, >=, =, <> 3 Logical operators AND, OR, NOT 4 String operators The ‘||’ operator performs string concatenation. 5 Range operators Range operator is used to test the value in a specific range. Tajo supports BETWEEN, IS NULL, IS NOT NULL operators. Print Page Previous Next Advertisements ”;

Apache Tajo – Database Creation

Apache Tajo – Database Creation ”; Previous Next This section explains the Tajo DDL commands. Tajo has a built-in database named default. Create Database Statement Create Database is a statement used to create a database in Tajo. The syntax for this statement is as follows − CREATE DATABASE [IF NOT EXISTS] <database_name> Query default> default> create database if not exists test; Result The above query will generate the following result. OK Database is the namespace in Tajo. A database can contain multiple tables with a unique name. Show Current Database To check the current database name, issue the following command − Query default> c Result The above query will generate the following result. You are now connected to database “default” as user “user1″. default> Connect to Database As of now, you have created a database named “test”. The following syntax is used to connect the “test” database. c <database name> Query default> c test Result The above query will generate the following result. You are now connected to database “test” as user “user1”. test> You can now see the prompt changes from default database to test database. Drop Database To drop a database, use the following syntax − DROP DATABASE <database-name> Query test> c default You are now connected to database “default” as user “user1″. default> drop database test; Result The above query will generate the following result. OK Print Page Previous Next Advertisements ”;

Apache Tajo – String Functions

Apache Tajo – String Functions ”; Previous Next The following table lists out the string functions in Tajo. S.No. Function & Description 1 concat(string1, …, stringN) Concatenate the given strings. 2 length(string) Returns the length of the given string. 3 lower(string) Returns the lowercase format for the string. 4 upper(string) Returns the uppercase format for the given string. 5 ascii(string text) Returns the ASCII code of the first character of the text. 6 bit_length(string text) Returns the number of bits in a string. 7 char_length(string text) Returns the number of characters in a string. 8 octet_length(string text) Returns the number of bytes in a string. 9 digest(input text, method text) Calculates the Digest hash of string. Here, the second arg method refers to the hash method. 10 initcap(string text) Converts the first letter of each word to upper case. 11 md5(string text) Calculates the MD5 hash of string. 12 left(string text, int size) Returns the first n characters in the string. 13 right(string text, int size) Returns the last n characters in the string. 14 locate(source text, target text, start_index) Returns the location of specified substring. 15 strposb(source text, target text) Returns the binary location of specified substring. 16 substr(source text, start index, length) Returns the substring for the specified length. 17 trim(string text[, characters text]) Removes the characters (a space by default) from the start/end/both ends of the string. 18 split_part(string text, delimiter text, field int) Splits a string on delimiter and returns the given field (counting from one). 19 regexp_replace(string text, pattern text, replacement text) Replaces substrings matched to a given regular expression pattern. 20 reverse(string) Reverse operation performed for the string. Print Page Previous Next Advertisements ”;