Apache Tajo – Quick Guide


Table of Contents

Apache Tajo – Quick Guide


”;


Apache Tajo – Introduction

Distributed Data Warehouse System

Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It is a subject-oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization but relational data volumes are increased day by day.

To overcome the challenges, distributed data warehouse system shares data across multiple data repositories for the purpose of Online Analytical Processing(OLAP). Each data warehouse may belong to one or more organizations. It performs load balancing and scalability. Metadata is replicated and centrally distributed.

Apache Tajo is a distributed data warehouse system which uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of MapReduce framework.

Overview of SQL on Hadoop

Hadoop is an open-source framework that allows to store and process big data in a distributed environment. It is extremely fast and powerful. However, Hadoop has limited querying capabilities so its performance can be made even better with the help of SQL on Hadoop. This allows users to interact with Hadoop through easy SQL commands.

Some of the examples of SQL on Hadoop applications are Hive, Impala, Drill, Presto, Spark, HAWQ and Apache Tajo.

What is Apache Tajo

Apache Tajo is a relational and distributed data processing framework. It is designed for low latency and scalable ad-hoc query analysis.

  • Tajo supports standard SQL and various data formats. Most of the Tajo queries can be executed without any modification.

  • Tajo has fault-tolerance through a restart mechanism for failed tasks and extensible query rewrite engine.

  • Tajo performs the necessary ETL (Extract Transform and Load process) operations to summarize large datasets stored on HDFS. It is an alternative choice to Hive/Pig.

The latest version of Tajo has greater connectivity to Java programs and third-party databases such as Oracle and PostGreSQL.

Features of Apache Tajo

Apache Tajo has the following features −

  • Superior scalability and optimized performance
  • Low latency
  • User-defined functions
  • Row/columnar storage processing framework.
  • Compatibility with HiveQL and Hive MetaStore
  • Simple data flow and easy maintenance.

Benefits of Apache Tajo

Apache Tajo offers the following benefits −

  • Easy to use
  • Simplified architecture
  • Cost-based query optimization
  • Vectorized query execution plan
  • Fast delivery
  • Simple I/O mechanism and supports various type of storage.
  • Fault tolerance

Use Cases of Apache Tajo

The following are some of the use cases of Apache Tajo −

Data warehousing and analysis

Korea’s SK Telecom firm ran Tajo against 1.7 terabytes worth of data and found it could complete queries with greater speed than either Hive or Impala.

Data discovery

The Korean music streaming service Melon uses Tajo for analytical processing. Tajo executes ETL (extract-transform-load process) jobs 1.5 to 10 times faster than Hive.

Log analysis

Bluehole Studio, a Korean based company developed TERA — a fantasy multiplayer online game. The company uses Tajo for game log analysis and finding principal causes of service quality interrupts.

Storage and Data Formats

Apache Tajo supports the following data formats −

  • JSON
  • Text file(CSV)
  • Parquet
  • Sequence File
  • AVRO
  • Protocol Buffer
  • Apache Orc

Tajo supports the following storage formats −

  • HDFS
  • JDBC
  • Amazon S3
  • Apache HBase
  • Elasticsearch

Apache Tajo – Architecture

The following illustration depicts the architecture of Apache Tajo.

Architecture

The following table describes each of the components in detail.

S.No. Component & Description
1

Client

Client submits the SQL statements to the Tajo Master to get the result.

2

Master

Master is the main daemon. It is responsible for query planning and is the coordinator for workers.

3

Catalog server

Maintains the table and index descriptions. It is embedded in the Master daemon. The catalog server uses Apache Derby as the storage layer and connects via JDBC client.

4

Worker

Master node assigns task to worker nodes. TajoWorker processes data. As the number of TajoWorkers increases, the processing capacity also increases linearly.

5

Query Master

Tajo master assigns query to the Query Master. The Query Master is responsible for controlling a distributed execution plan. It launches the TaskRunner and schedules tasks to TaskRunner. The main role of the Query Master is to monitor the running tasks and report them to the Master node.

6

Node Managers

Manages the resource of the worker node. It decides on allocating requests to the node.

7

TaskRunner

Acts as a local query execution engine. It is used to run and monitor query process. The TaskRunner processes one task at a time.

It has the following three main attributes −

  • Logical plan − An execution block which created the task.
  • A fragment − an input path, an offset range, and schema.
  • Fetches URIs
8

Query Executor

It is used to execute a query.

9

Storage service

Connects the underlying data storage to Tajo.

Workflow

Tajo uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes.

The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and assigns to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators.

In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques.

The web-based interface of Tajo has the following capabilities −

  • Option to find how the submitted queries are planned
  • Option to find how the queries are distributed across nodes
  • Option to check the status of the cluster and nodes

Apache Tajo – Installation

To install Apache Tajo, you must have the following software on your system −

  • Hadoop version 2.3 or greater
  • Java version 1.7 or higher
  • Linux or Mac OS

Let us now continue with the following steps to install Tajo.

Verifying Java installation

Hopefully, you have already installed Java version 8 on your machine. Now, you just need to proceed by verifying it.

To verify, use the following command −

$ java -version

If Java is successfully installed on your machine, you could see the present version of the installed Java. If Java is not installed follow these steps to install Java 8 on your machine.

Download JDK

Download the latest version of JDK by visiting the following link and then, download the latest version.

https://www.oracle.com

The latest version is JDK 8u 92 and the file is “jdk-8u92-linux-x64.tar.gz”. Please download the file on your machine. Following this, extract the files and move them to a specific directory. Now, set the Java alternatives. Finally, Java is installed on your machine.

Verifying Hadoop Installation

You have already installed Hadoop on your system. Now, verify it using the following command −

$ hadoop version 

If everything is fine with your setup, then you could see the version of Hadoop. If Hadoop is not installed, download and install Hadoop by visiting the following link − https://www.apache.org

Apache Tajo Installation

Apache Tajo provides two execution modes — local mode and fully distributed mode. After verifying Java and Hadoop installation proceed with the following steps to install Tajo cluster on your machine. A local mode Tajo instance requires very easy configurations.

Download the latest version of Tajo by visiting the following link − https://www.apache.org/dyn/closer.cgi/tajo

Now you can download the file “tajo-0.11.3.tar.gz” from your machine.

Extract Tar File

Extract the tar file by using the following command −

$ cd opt/ 
$ tar tajo-0.11.3.tar.gz 
$ cd tajo-0.11.3

Set Environment Variable

Add the following changes to “conf/tajo-env.sh” file

$ cd tajo-0.11.3 
$ vi conf/tajo-env.sh
 
# Hadoop home. Required 
export HADOOP_HOME = /Users/path/to/Hadoop/hadoop-2.6.2
 
# The java implementation to use.  Required. 
export JAVA_HOME = /path/to/jdk1.8.0_92.jdk/ 

Here, you must specify Hadoop and Java path to “tajo-env.sh” file. After the changes are made, save the file and quit the terminal.

Start Tajo Server

To launch the Tajo server, execute the following command −

$ bin/start-tajo.sh 

You will receive a response similar to the following −

Starting single TajoMaster 
starting master, logging to /Users/path/to/Tajo/tajo-0.11.3/bin/../ 
localhost: starting worker, logging to /Users/path/toe/Tajo/tajo-0.11.3/bin/../logs/
 
Tajo master web UI: http://local:26080 
Tajo Client Service: local:26002

Now, type the command “jps” to see the running daemons.

$ jps  
1010 TajoWorker 
1140 Jps 
933 TajoMaster 

Launch Tajo Shell (Tsql)

To launch the Tajo shell client, use the following command −

$ bin/tsql

You will receive the following output −

welcome to
   _____ ___  _____ ___ 
  /_  _/ _  |/_  _/   / 
   / // /_| |_/ // / / 
  /_//_/ /_/___/ __/  0.11.3
  
Try ? for help. 

Quit Tajo Shell

Execute the following command to quit Tsql −

default> q 
bye! 

Here, the default refers to the catalog in Tajo.

Web UI

Type the following URL to launch the Tajo web UI − http://localhost:26080/

You will now see the following screen which is similar to the ExecuteQuery option.

Web UI

Stop Tajo

To stop the Tajo server, use the following command −

$ bin/stop-tajo.sh 

You will get the following response −

localhost: stopping worker 
stopping master

Apache Tajo – Configuration Settings

Tajo’s configuration is based on Hadoop’s configuration system. This chapter explains Tajo configuration settings in detail.

Basic Settings

Tajo uses the following two config files −

  • catalog-site.xml − configuration for the catalog server.
  • tajo-site.xml − configuration for other Tajo modules.

Distributed Mode Configuration

Distributed mode setup runs on Hadoop Distributed File System (HDFS). Let’s follow the steps to configure Tajo distributed mode setup.

tajo-site.xml

This file is available @ /path/to/tajo/conf directory and acts as configuration for other Tajo modules. To access Tajo in a distributed mode, apply the following changes to “tajo-site.xml”.

<property> 
   <name>tajo.rootdir</name> 
   <value>hdfs://hostname:port/tajo</value> 
</property>
  
<property> 
   <name>tajo.master.umbilical-rpc.address</name> 
   <value>hostname:26001</value> 
</property> 
 
<property> 
   <name>tajo.master.client-rpc.address</name> 
   <value>hostname:26002</value> 
</property>
  
<property> 
   <name>tajo.catalog.client-rpc.address</name> 
   <value>hostname:26005</value> 
</property>   

Master Node Configuration

Tajo uses HDFS as a primary storage type. The configuration is as follows and should be added to “tajo-site.xml”.

<property> 
   <name>tajo.rootdir</name> 
   <value>hdfs://namenode_hostname:port/path</value> 
</property> 

Catalog Configuration

If you want to customize the catalog service, copy $path/to/Tajo/conf/catalogsite.xml.template to $path/to/Tajo/conf/catalog-site.xml and add any of the following configuration as needed.

For example, if you use “Hive catalog store” to access Tajo, then the configuration should be like the following −

<property> 
   <name>tajo.catalog.store.class</name> 
   <value>org.apache.tajo.catalog.store.HCatalogStore</value> 
</property> 

If you need to store MySQL catalog, then apply the following changes −

<property> 
   <name>tajo.catalog.store.class</name> 
   <value>org.apache.tajo.catalog.store.MySQLStore</value> 
</property> 

<property> 
   <name>tajo.catalog.jdbc.connection.id</name> 
   <value><mysql user name></value> 
</property>
 
<property> 
   <name>tajo.catalog.jdbc.connection.password</name> 
   <value><mysql user password></value> 
</property>
 
<property> 
   <name>tajo.catalog.jdbc.uri</name> 
   <value>jdbc:mysql://<mysql host name>:<mysql port>/<database name for tajo>
      ?createDatabaseIfNotExist = true</value> 
</property> 

Similarly, you can register the other Tajo supported catalogs in the configuration file.

Worker Configuration

By default, the TajoWorker stores temporary data on the local file system. It is defined in the “tajo-site.xml” file as follows −

<property> 
   <name>tajo.worker.tmpdir.locations</name> 
   <value>/disk1/tmpdir,/disk2/tmpdir,/disk3/tmpdir</value> 
</property> 

To increase the capacity of running tasks of each worker resource, choose the following configuration −

<property> 
   <name>tajo.worker.resource.cpu-cores</name> 
   <value>12</value> 
</property>
 
<property> 
   <name>tajo.task.resource.min.memory-mb</name> 
   <value>2000</value> 
</property>
  
<property> 
   <name>tajo.worker.resource.disks</name> 
   <value>4</value> 
</property> 

To make the Tajo worker run in a dedicated mode, choose the following configuration −

<property> 
   <name>tajo.worker.resource.dedicated</name> 
   <value>true</value> 
</property> 

Apache Tajo – Shell Commands

In this chapter, we will understand the Tajo Shell commands in detail.

To execute the Tajo shell commands, you need to start the Tajo server and the Tajo shell using the following commands −

Start server

$ bin/start-tajo.sh 

Start Shell

$ bin/tsql 

The above commands are now ready for execution.

Meta Commands

Let us now discuss the Meta Commands. Tsql meta commands start with a backslash (‘’).

Help Command

“?” Command is used to show the help option.

Query

default> ? 

Result

The above ? Command list out all the basic usage options in Tajo. You will receive the following output −

Command

List Database

To list out all the databases in Tajo, use the following command −

Query

default> l 

Result

You will receive the following output −

information_schema 
default

At present, we have not created any database so it shows two built in Tajo databases.

Current Database

c option is used to display the current database name.

Query

default> c

Result

You are now connected to database “default” as user “username”.

List out Built-in Functions

To list out all the built-in function, type the query as follows −

Query

default> df 

Result

You will receive the following output −

List out Builtin Functions

Describe Function

df function name − This query returns the complete description of the given function.

Query

default> df sqrt 

Result

You will receive the following output −

Describe Function

Quit Terminal

To quit the terminal, type the following query −

Query

default> q 

Result

You will receive the following output −

bye! 

Admin Commands

Tajo shell provides admin option to list out all the admin features.

Query

default> admin 

Result

You will receive the following output −

Admin Commands

Cluster Info

To display the cluster information in Tajo, use the following query

Query

default> admin -cluster 

Result

You will receive the following output −

Cluster Info

Show master

The following query displays the current master information.

Query

default> admin -showmasters 

Result

localhost 

Similarly, you can try other admin commands.

Session Variables

The Tajo client connects to the Master via a unique session id. The session is live until the client is disconnected or expires.

The following command is used to list out all session variables.

Query

default> set 

Result

''SESSION_LAST_ACCESS_TIME'' = ''1470206387146'' 
''CURRENT_DATABASE'' = ''default'' 
‘USERNAME’ = ''user'' 
''SESSION_ID'' = ''c60c9b20-dfba-404a-822f-182bc95d6c7c'' 
''TIMEZONE'' = ''Asia/Kolkata'' 
''FETCH_ROWNUM'' = ''200'' 
‘COMPRESSED_RESULT_TRANSFER'' = ''false'' 

The set key val will set the session variable named key with the value val. For example,

Query

default> set ‘current_database’=''default'' 

Result

usage: set [[NAME] VALUE] 

Here, you can assign the key and value in the set command. If you need to revert the changes then use the unset command.

Apache Tajo – Data Types

To execute a query in a Tajo shell, open your terminal and move to the Tajo installed directory and then type the following command −

$ bin/tsql 

You will now see the response as shown in the following program −

default> 

You can now execute your queries. Otherwise you can run your queries through web console application to the following URL − http://localhost:26080/

Primitive Data Types

Apache Tajo supports the following list of primitive data types −

S.No. Data type & Description
1

integer

Used for storing integer value with 4 bytes storage.

2

tinyint

Tiny integer value is 1 byte

3

smallint

Used for storing small size integer 2 bytes value.

4

bigint

Big range integer value has 8 bytes storage.

5

boolean

Returns true/false.

6

real

Used for storing real value. Size is 4 bytes.

7

float

Floating point precision value which has 4 or 8 bytes storage space.

8

double

Double point precision value stored in 8 bytes.

9

char[(n)]

Character value.

10

varchar[(n)]

Variable-length non-Unicode data.

11

number

Decimal values.

12

binary

Binary values.

13

date

Calendar date (year, month, day).

Example − DATE ”2016-08-22”

14

time

Time of day (hour, minute, second, millisecond) without a time zone. Values of this type are parsed and rendered in the session time zone.

15

timezone

Time of day (hour, minute, second, millisecond) with a time zone. Values of this type are rendered using the time zone from the value.

Example − TIME ”01:02:03.456 Asia/kolkata”

16

timestamp

Instant in time that includes the date and time of day without a time zone.

Example − TIMESTAMP ”2016-08-22 03:04:05.321”

17

text

Variable-length Unicode text.

Apache Tajo – Operators

The following operators are used in Tajo to perform desired operations.

S.No. Operator & Description
1 Arithmetic operators

Presto supports arithmetic operators such as &plus;, −, &ast;, /, %.

2 Relational operators

<, >, <=, >=, =, <>

3 Logical operators

AND, OR, NOT

4 String operators

The ‘||’ operator performs string concatenation.

5 Range operators

Range operator is used to test the value in a specific range. Tajo supports BETWEEN, IS NULL, IS NOT NULL operators.

Apache Tajo – SQL Functions

As of now, you were aware of running simple basic queries on Tajo. In the next subsequent few chapters, we will discuss the following SQL functions −

Apache Tajo – Math Functions

Math functions operate on mathematical formulae. The following table describes the list of functions in detail.

S.No. Function & Description
1 abs(x)

Returns the absolute value of x.

2 cbrt(x)

Returns the cube root of x.

3 ceil(x)

Returns x value rounded up to the nearest integer.

4 floor(x)

Returns x rounded down to the nearest integer.

5 pi()

Returns pi value. Result will be returned as double value.

6 radians(x)

converts the angle x in degree radians.

7 degrees(x)

Returns degree value for x.

8 pow(x,p)

Returns power of value‘p’ to the x value.

9 div(x,y)

Returns the division result for the given two x,y integer values.

10 exp(x)

Returns Euler’s number e raised to the power of a number.

11 sqrt(x)

Returns the square root of x.

12 sign(x)

Returns the signum function of x, that is −

  • 0 if the argument is 0

  • 1 if the argument is greater than 0

  • 1 if the argument is less than 0

13 mod(n,m)

Returns the modulus (remainder) of n divided by m.

14 round(x)

Returns the rounded value for x.

15 cos(x)

Returns the cosine value(x).

16 asin(x)

Returns the inverse sine value(x).

17 acos(x)

Returns the inverse cosine value(x).

18 atan(x)

Returns the inverse tangent value(x).

19 atan2(y,x)

Returns the inverse tangent value(y/x).

Data Type Functions

The following table lists out the data type functions available in Apache Tajo.

S.No. Function & Description
1 to_bin(x)

Returns the binary representation of integer.

2 to_char(int,text)

Converts integer to string.

3 to_hex(x)

Converts the x value into hexadecimal.

Apache Tajo – String Functions

The following table lists out the string functions in Tajo.

S.No. Function & Description
1 concat(string1, …, stringN)

Concatenate the given strings.

2 length(string)

Returns the length of the given string.

3 lower(string)

Returns the lowercase format for the string.

4 upper(string)

Returns the uppercase format for the given string.

5 ascii(string text)

Returns the ASCII code of the first character of the text.

6 bit_length(string text)

Returns the number of bits in a string.

7 char_length(string text)

Returns the number of characters in a string.

8 octet_length(string text)

Returns the number of bytes in a string.

9 digest(input text, method text)

Calculates the Digest hash of string. Here, the second arg method refers to the hash method.

10 initcap(string text)

Converts the first letter of each word to upper case.

11 md5(string text)

Calculates the MD5 hash of string.

12 left(string text, int size)

Returns the first n characters in the string.

13 right(string text, int size)

Returns the last n characters in the string.

14 locate(source text, target text, start_index)

Returns the location of specified substring.

15 strposb(source text, target text)

Returns the binary location of specified substring.

16 substr(source text, start index, length)

Returns the substring for the specified length.

17 trim(string text[, characters text])

Removes the characters (a space by default) from the start/end/both ends of the string.

18 split_part(string text, delimiter text, field int)

Splits a string on delimiter and returns the given field (counting from one).

19 regexp_replace(string text, pattern text, replacement text)

Replaces substrings matched to a given regular expression pattern.

20 reverse(string)

Reverse operation performed for the string.

Apache Tajo – DateTime Functions

Apache Tajo supports the following DateTime functions.

S.No. Function & Description
1 add_days(date date or timestamp, int day

Returns date added by the given day value.

2 add_months(date date or timestamp, int month)

Returns date added by the given month value.

3 current_date()

Returns today’s date.

4 current_time()

Returns today’s time.

5 extract(century from date/timestamp)

Extracts century from the given parameter.

6 extract(day from date/timestamp)

Extracts day from the given parameter.

7 extract(decade from date/timestamp)

Extracts decade from the given parameter.

8 extract(day dow date/timestamp)

Extracts day of week from the given parameter.

9 extract(doy from date/timestamp)

Extracts day of year from the given parameter.

10 select extract(hour from timestamp)

Extracts hour from the given parameter.

11 select extract(isodow from timestamp)

Extracts day of week from the given parameter. This is identical to dow except for Sunday. This matches the ISO 8601 day of the week numbering.

12 select extract(isoyear from date)

Extracts ISO year from the specified date. ISO year may be different from the Gregorian year.

13 extract(microseconds from time)

Extracts microseconds from the given parameter. The seconds field, including fractional parts, multiplied by 1 000 000;

14 extract(millennium from timestamp )

Extracts millennium from the given parameter.one millennium corresponds to 1000 years. Hence, the third millennium started January 1, 2001.

15 extract(milliseconds from time)

Extracts milliseconds from the given parameter.

16 extract(minute from timestamp )

Extracts minute from the given parameter.

17 extract(quarter from timestamp)

Extracts quarter of the year(1 – 4) from the given parameter.

18 date_part(field text, source date or timestamp or time)

Extracts date field from text.

19 now()

Returns current timestamp.

20 to_char(timestamp, format text)

Converts timestamp to text.

21 to_date(src text, format text)

Converts text to date.

22 to_timestamp(src text, format text)

Converts text to timestamp.

Apache Tajo – JSON Functions

The JSON functions are listed in the following table −

S.No. Function & Description
1 json_extract_path_text(js on text, json_path text)

Extracts JSON string from a JSON string based on json path specified.

2 json_array_get(json_array text, index int4)

Returns the element at the specified index into the JSON array.

3 json_array_contains(json_ array text, value any)

Determine if the given value exists in the JSON array.

4 json_array_length(json_ar ray text)

Returns the length of json array.

Apache Tajo – Database Creation

This section explains the Tajo DDL commands. Tajo has a built-in database named default.

Create Database Statement

Create Database is a statement used to create a database in Tajo. The syntax for this statement is as follows −

CREATE DATABASE [IF NOT EXISTS] <database_name> 

Query

default> default> create database if not exists test; 

Result

The above query will generate the following result.

OK

Database is the namespace in Tajo. A database can contain multiple tables with a unique name.

Show Current Database

To check the current database name, issue the following command −

Query

default> c 

Result

The above query will generate the following result.

You are now connected to database "default" as user “user1". 
default>  

Connect to Database

As of now, you have created a database named “test”. The following syntax is used to connect the “test” database.

c <database name> 

Query

default> c test 

Result

The above query will generate the following result.

You are now connected to database "test" as user “user1”. 
test>

You can now see the prompt changes from default database to test database.

Drop Database

To drop a database, use the following syntax −

DROP DATABASE <database-name> 

Query

test> c default 
You are now connected to database "default" as user “user1". 
default> drop database test; 

Result

The above query will generate the following result.

OK 

Apache Tajo – Table Management

A table is a logical view of one data source. It consists of a logical schema, partitions, URL, and various properties. A Tajo table can be a directory in HDFS, a single file, one HBase table, or a RDBMS table.

Tajo supports the following two types of tables −

  • external table
  • internal table

External Table

External table needs the location property when the table is created. For example, if your data is already there as Text/JSON files or HBase table, you can register it as Tajo external table.

The following query is an example of external table creation.

create external table sample(col1 int,col2 text,col3 int) location ‘hdfs://path/to/table'';

Here,

  • External keyword − This is used to create an external table. This helps to create a table in the specified location.

  • Sample refers to the table name.

  • Location − It is a directory for HDFS,Amazon S3, HBase or local file system. To assign a location property for directories, use the below URI examples −

    • HDFS − hdfs://localhost:port/path/to/table

    • Amazon S3 − s3://bucket-name/table

    • local file system − file:///path/to/table

    • Openstack Swift − swift://bucket-name/table

Table Properties

An external table has the following properties −

  • TimeZone − Users can specify a time zone for reading or writing a table.

  • Compression format − Used to make data size compact. For example, the text/json file uses compression.codec property.

Internal Table

A Internal table is also called an Managed Table. It is created in a pre-defined physical location called the Tablespace.

Syntax

create table table1(col1 int,col2 text);

By default, Tajo uses “tajo.warehouse.directory” located in “conf/tajo-site.xml” . To assign new location for the table, you can use Tablespace configuration.

Tablespace

Tablespace is used to define locations in the storage system. It is supported for only internal tables. You can access the tablespaces by their names. Each tablespace can use a different storage type. If you don’t specify tablespaces then, Tajo uses the default tablespace in the root directory.

Tablespace Configuration

You have “conf/tajo-site.xml.template” in Tajo. Copy the file and rename it to “storagesite.json”. This file will act as a configuration for Tablespaces. Tajo data formats uses the following configuration −

HDFS Configuration

$ vi conf/storage-site.json { 
   "spaces": {  
      "${tablespace_name}": {  
         "uri": “hdfs://localhost:9000/path/to/Tajo"  
      } 
   } 
}

HBase Configuration

$ vi conf/storage-site.json { 
   "spaces": {  
      "${tablespace_name}": {  
         "uri": “hbase:zk://quorum1:port,quorum2:port/"  
      } 
   } 
}

Text File Configuration

$ vi conf/storage-site.json { 
   "spaces": {  
      "${tablespace_name}": {  
         “uri”: “hdfs://localhost:9000/path/to/Tajo” 
      } 
   } 
}

Tablespace Creation

Tajo’s internal table records can be accessed from another table only. You can configure it with tablespace.

Syntax

CREATE TABLE [IF NOT EXISTS] <table_name> [(column_list)] [TABLESPACE tablespace_name] 
[using <storage_type> [with (<key> = <value>, ...)]] [AS <select_statement>]

Here,

  • IF NOT EXISTS − This avoids an error if the same table has not been created already.

  • TABLESPACE − This clause is used to assign the tablespace name.

  • Storage type − Tajo data supports formats like text,JSON,HBase,Parquet,Sequencefile and ORC.

  • AS select statement − Select records from another table.

Configure Tablespace

Start your Hadoop services and open the file “conf/storage-site.json”, then add the following changes −

$ vi conf/storage-site.json { 
   "spaces": {  
      “space1”: {  
         "uri": “hdfs://localhost:9000/path/to/Tajo" 
      } 
   } 
} 

Here, Tajo will refer to the data from HDFS location and space1 is the tablespace name. If you do not start Hadoop services, you can’t register tablespace.

Query

default> create table table1(num1 int,num2 text,num3 float) tablespace space1;

The above query creates a table named “table1” and “space1” refers to the tablespace name.

Data formats

Tajo supports data formats. Let’s go through each of the formats one by one in detail.

Text

A character-separated values’ plain text file represents a tabular data set consisting of rows and columns. Each row is a plain text line.

Creating Table

default> create external table customer(id int,name text,address text,age int) 
   using text with(''text.delimiter''='','') location ‘file:/Users/workspace/Tajo/customers.csv’;

Here, “customers.csv” file refers to a comma separated value file located in the Tajo installation directory.

To create internal table using text format, use the following query −

default> create table customer(id int,name text,address text,age int) using text; 

In the above query, you have not assigned any tablespace so it will take Tajo’s default tablespace.

Properties

A text file format has the following properties −

  • text.delimiter − This is a delimiter character. Default is ‘|’.

  • compression.codec − This is a compression format. By default, it is disabled. you can change the settings using specified algorithm.

  • timezone − The table used for reading or writing.

  • text.error-tolerance.max-num − The maximum number of tolerance levels.

  • text.skip.headerlines − The number of header lines per skipped.

  • text.serde − This is serialization property.

JSON

Apache Tajo supports JSON format for querying data. Tajo treats a JSON object as SQL record. One object equals one row in a Tajo table. Let’s consider “array.json” as follows −

$ hdfs dfs -cat /json/array.json { 
   "num1" : 10, 
   "num2" : "simple json array", 
   "num3" : 50.5 
}

After you create this file, switch to the Tajo shell and type the following query to create a table using the JSON format.

Query

default> create external table sample (num1 int,num2 text,num3 float) 
   using json location ‘json/array.json’;

Always remember that the file data must match with the table schema. Otherwise, you can omit the column names and use * which doesn’t require columns list.

To create an internal table, use the following query −

default> create table sample (num1 int,num2 text,num3 float) using json;

Parquet

Parquet is a columnar storage format. Tajo uses Parquet format for easy, fast and efficient access.

Table creation

The following query is an example for table creation −

CREATE TABLE parquet (num1 int,num2 text,num3 float) USING PARQUET; 

Parquet file format has the following properties −

  • parquet.block.size − size of a row group being buffered in memory.

  • parquet.page.size − The page size is for compression.

  • parquet.compression − The compression algorithm used to compress pages.

  • parquet.enable.dictionary − The boolean value is to enable/disable dictionary encoding.

RCFile

RCFile is the Record Columnar File. It consists of binary key/value pairs.

Table creation

The following query is an example for table creation −

CREATE TABLE Record(num1 int,num2 text,num3 float) USING RCFILE; 

RCFile has the following properties −

  • rcfile.serde − custom deserializer class.

  • compression.codec − compression algorithm.

  • rcfile.null − NULL character.

SequenceFile

SequenceFile is a basic file format in Hadoop which consists of key/value pairs.

Table creation

The following query is an example for table creation −

CREATE TABLE seq(num1 int,num2 text,num3 float) USING sequencefile; 

This sequence file has Hive compatibility. This can be written in Hive as,

CREATE TABLE table1 (id int, name string, score float, type string) 
STORED AS sequencefile; 

ORC

ORC (Optimized Row Columnar) is a columnar storage format from Hive.

Table creation

The following query is an example for table creation −

CREATE TABLE optimized(num1 int,num2 text,num3 float) USING ORC; 

The ORC format has the following properties −

  • orc.max.merge.distance − ORC file is read, it merges when the distance is lower.

  • orc.stripe.size − This is the size of each stripe.

  • orc.buffer.size − The default is 256KB.

  • orc.rowindex.stride − This is the ORC index stride in number of rows.

Apache Tajo – SQL Statements

In the previous chapter, you have understood how to create tables in Tajo. This chapter explains about the SQL statement in Tajo.

Create Table Statement

Before moving to create a table, create a text file “students.csv” in Tajo installation directory path as follows −

students.csv

Id Name Address Age Marks
1 Adam 23 New Street 21 90
2 Amit 12 Old Street 13 95
3 Bob 10 Cross Street 12 80
4 David 15 Express Avenue 12 85
5 Esha 20 Garden Street 13 50
6 Ganga 25 North Street 12 55
7 Jack 2 Park Street 12 60
8 Leena 24 South Street 12 70
9 Mary 5 West Street 12 75
10 Peter 16 Park Avenue 12 95

After the file has been created, move to the terminal and start the Tajo server and shell one by one.

Create Database

Create a new database using the following command −

Query

default> create database sampledb; 
OK

Connect to the database “sampledb” which is now created.

default> c sampledb 
You are now connected to database "sampledb" as user “user1”.

Then, create a table in “sampledb” as follows −

Query

sampledb>  create external table mytable(id int,name text,address text,age int,mark int) 
   using text with(''text.delimiter'' = '','') location ‘file:/Users/workspace/Tajo/students.csv’;

Result

The above query will generate the following result.

OK

Here, the external table is created. Now, you just have to enter the file location. If you have to assign the table from hdfs then use hdfs instead of file.

Next, the “students.csv” file contains comma separated values. The text.delimiter field is assigned with ‘,’.

You have now created “mytable” successfully in “sampledb”.

Show Table

To show tables in Tajo, use the following query.

Query

sampledb> d  
mytable  
sampledb> d mytable

Result

The above query will generate the following result.

table name: sampledb.mytable 
table uri: file:/Users/workspace/Tajo/students.csv 
store type: TEXT 
number of rows: unknown 
volume: 261 B
Options: 
''timezone'' = ''Asia/Kolkata'' 
''text.null'' = ''\N'' 
''text.delimiter'' = '',''  
schema: 
id INT4 
name TEXT 
address TEXT 
age INT4 
mark INT4 

List table

To fetch all the records in the table, type the following query −

Query

sampledb> select * from mytable;

Result

The above query will generate the following result.

List Table

Insert Table Statement

Tajo uses the following syntax to insert records in table.

Syntax

create table table1 (col1 int8, col2 text, col3 text);   
--schema should be same for target table schema  
Insert overwrite into table1 select * from table2;  
                     (or)  
Insert overwrite into LOCATION ''/dir/subdir'' select * from table; 

Tajo’s insert statement is similar to the INSERT INTO SELECT statement of SQL.

Query

Let’s create a table to overwrite table data of an existing table.

sampledb> create table test(sno int,name text,addr text,age int,mark int);  
OK 
sampledb> d

Result

The above query will generate the following result.

mytable 
test

Insert Records

To insert records in the “test” table, type the following query.

Query

sampledb> insert overwrite into test select * from mytable; 

Result

The above query will generate the following result.

Progress: 100%, response time: 0.518 sec

Here, “mytable” records overwrite the “test” table. If you don’t want to create the “test” table, then straight away assign the physical path location as mentioned in an alternative option for insert query.

Fetch records

Use the following query to list out all the records in the “test” table −

Query

sampledb> select * from test;

Result

The above query will generate the following result.

Fetch Records

This statement is used to add, remove or modify columns of an existing table.

To rename the table use the following syntax −

Alter table table1 RENAME TO table2;

Query

sampledb> alter table test rename to students;

Result

The above query will generate the following result.

OK

To check the changed table name, use the following query.

sampledb> d  
mytable 
students

Now the table “test” is changed to “students” table.

Add Column

To insert new column in the “students” table, type the following syntax −

Alter table <table_name> ADD COLUMN <column_name> <data_type>

Query

sampledb> alter table students add column grade text; 

Result

The above query will generate the following result.

OK

Set Property

This property is used to change the table’s property.

Query

sampledb> ALTER TABLE students SET PROPERTY ''compression.type'' = ''RECORD'',
   ''compression.codec'' = ''org.apache.hadoop.io.compress.Snappy Codec'' ; 
OK

Here, compression type and codec properties are assigned.

To change the text delimiter property, use the following −

Query

ALTER TABLE students  SET PROPERTY ‘text.delimiter''='','';  
OK 

Result

The above query will generate the following result.

sampledb> d students  
table name: sampledb.students 
table uri: file:/tmp/tajo-user1/warehouse/sampledb/students 
store type: TEXT 
number of rows: 10 
volume: 228 B 
Options: 
''compression.type'' = ''RECORD'' 
''timezone'' = ''Asia/Kolkata'' 
''text.null'' = ''\N'' 
''compression.codec'' = ''org.apache.hadoop.io.compress.SnappyCodec'' 
''text.delimiter'' = '',''  
schema: 
id INT4 
name TEXT 
addr TEXT 
age INT4 
mark INT4 
grade TEXT

The above result shows that the table’s properties are changed using the “SET” property.

Select Statement

The SELECT statement is used to select data from a database.

The syntax for the Select statement is as follows −

SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, ...] 
   [FROM <table reference> [[AS] <table alias name>] [, ...]] 
   [WHERE <condition>] 
   [GROUP BY <expression> [, ...]] 
   [HAVING <condition>] 
   [ORDER BY <expression> [ASC|DESC] [NULLS (FIRST|LAST)] [, …]]

Where Clause

The Where clause is used to filter records from the table.

Query

sampledb> select * from mytable where id > 5;

Result

The above query will generate the following result.

Where Clause

The query returns the records of those students whose id is greater than 5.

Query

sampledb> select * from mytable where name = ‘Peter’; 

Result

The above query will generate the following result.

Progress: 100%, response time: 0.117 sec 
 
  id,  name,  address,   age  
------------------------------- 
10,  Peter,  16 park avenue , 12 

The result filters Peter’s records only.

Distinct Clause

A table column may contain duplicate values. The DISTINCT keyword can be used to return only distinct (different) values.

Syntax

SELECT DISTINCT column1,column2 FROM table_name;

Query

sampledb> select distinct age from mytable; 

Result

The above query will generate the following result.

Progress: 100%, response time: 0.216 sec  
age 
------------------------------- 
13 
12 

The query returns the distinct age of students from mytable.

Group By Clause

The GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups.

Syntax

SELECT column1, column2 FROM table_name WHERE [ conditions ] GROUP BY column1, column2;

Query

select age,sum(mark) as sumofmarks from mytable group by age;

Result

The above query will generate the following result.

age,  sumofmarks 
------------------------------- 
13,  145 
12,  610

Here, the “mytable” column has two types of ages — 12 and 13. Now the query groups the records by age and produces the sum of marks for the corresponding ages of students.

Having Clause

The HAVING clause enables you to specify conditions that filter which group results appear in the final results. The WHERE clause places conditions on the selected columns, whereas the HAVING clause places conditions on the groups created by the GROUP BY clause.

Syntax

SELECT column1, column2 FROM table1 GROUP BY column HAVING [ conditions ]

Query

sampledb> select age from mytable group by age  having  sum(mark) > 200; 

Result

The above query will generate the following result.

age 
------------------------------- 
12 

The query groups the records by age and returns the age when the condition result sum(mark) > 200.

Order By Clause

The ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. The Tajo database sorts query results in ascending order by default.

Syntax

SELECT column-list FROM table_name  
[WHERE condition]  
[ORDER BY column1, column2, .. columnN] [ASC | DESC]; 

Query

sampledb> select * from mytable where mark > 60 order by name desc; 

Result

The above query will generate the following result.

Order by Clause

The query returns the names of those students in descending order whose marks are greater than 60.

Create Index Statement

The CREATE INDEX statement is used to create indexes in tables. Index is used for fast retrieval of data. Current version supports index for only plain TEXT formats stored on HDFS.

Syntax

CREATE INDEX [ name ] ON table_name ( { column_name | ( expression ) }

Query

create index student_index on mytable(id);

Result

The above query will generate the following result.

id 
——————————————— 

To view assigned index for the column, type the following query.

default> d mytable  
table name: default.mytable 
table uri: file:/Users/deiva/workspace/Tajo/students.csv 
store type: TEXT 
number of rows: unknown  
volume: 307 B 
Options: 
   ''timezone'' = ''Asia/Kolkata'' 
   ''text.null'' = ''\N'' 
   ''text.delimiter'' = '',''  
schema: 
id INT4 
name TEXT 
address TEXT 
age INT4 
mark INT4   
Indexes:  
"student_index" TWO_LEVEL_BIN_TREE (id ASC NULLS LAST ) 

Here, TWO_LEVEL_BIN_TREE method is used by default in Tajo.

Drop Table Statement

The Drop Table Statement is used to drop a table from the database.

Syntax

drop table table name;

Query

sampledb> drop table mytable;

To check if the table has been dropped from the table, type the following query.

sampledb> d mytable;

Result

The above query will generate the following result.

ERROR: relation ''mytable'' does not exist 

You can also check the query using “d” command to list out the available Tajo tables.

Aggregate & Window Functions

This chapter explains the aggregate and window functions in detail.

Aggregation Functions

Aggregate functions produce a single result from a set of input values. The following table describes the list of aggregate functions in detail.

S.No. Function & Description
1 AVG(exp)

Averages a column of all records in a data source.

2 CORR(expression1, expression2)

Returns the coefficient of correlation between a set of number pairs.

3 COUNT()

Returns the number rows.

4 MAX(expression)

Returns the largest value of the selected column.

5 MIN(expression)

Returns the smallest value of the selected column.

6 SUM(expression)

Returns the sum of the given column.

7 LAST_VALUE(expression)

Returns the last value of the given column.

Window Function

The Window functions execute on a set of rows and return a single value for each row from the query. The term window has the meaning of set of row for the function.

The Window function in a query, defines the window using the OVER() clause.

The OVER() clause has the following capabilities −

  • Defines window partitions to form groups of rows. (PARTITION BY clause)
  • Orders rows within a partition. (ORDER BY clause)

The following table describes the window functions in detail.

Function Return type Description
rank() int Returns rank of the current row with gaps.
row_num() int Returns the current row within its partition, counting from 1.
lead(value[, offset integer[, default any]]) Same as input type Returns value evaluated at the row that is offset rows after the current row within the partition. If there is no such row, default value will be returned.
lag(value[, offset integer[, default any]]) Same as input type Returns value evaluated at the row that is offset rows before the current row within the partition.
first_value(value) Same as input type Returns the first value of input rows.
last_value(value) Same as input type Returns the last value of input rows.

Apache Tajo – SQL Queries

This chapter explains about the following significant Queries.

  • Predicates
  • Explain
  • Join

Let us proceed and perform the queries.

Predicates

Predicate is an expression which is used to evaluate true/false values and UNKNOWN. Predicates are used in the search condition of WHERE clauses and HAVING clauses and other constructs where a Boolean value is required.

IN predicate

Determines whether the value of expression to test matches any value in the subquery or the list. Subquery is an ordinary SELECT statement that has a result set of one column and one or more rows. This column or all expressions in the list must have the same data type as the expression to test.

Syntax

IN::= 
<expression to test> [NOT] IN (<subquery>) 
| (<expression1>,...)

Query

select id,name,address from mytable where id in(2,3,4);

Result

The above query will generate the following result.

id,  name,   address 
------------------------------- 
2,  Amit,  12 old street 
3,  Bob,   10 cross street 
4,  David, 15 express avenue 

The query returns records from mytable for the students id 2,3 and 4.

Query

select id,name,address from mytable where id not in(2,3,4);

Result

The above query will generate the following result.

id,  name,  address 
------------------------------- 
1,  Adam,   23 new street 
5,  Esha,   20 garden street 
6,  Ganga,  25 north street 
7,  Jack,   2 park street 
8,  Leena,  24 south street 
9,  Mary,   5 west street 
10, Peter,  16 park avenue 

The above query returns records from mytable where students is not in 2,3 and 4.

Like Predicate

The LIKE predicate compares the string specified in the first expression for calculating the string value, which is refered to as a value to test, with the pattern that is defined in the second expression for calculating the string value.

The pattern may contain any combination of wildcards such as −

  • Underline symbol (_), which can be used instead of any single character in the value to test.

  • Percent sign (%), which replaces any string of zero or more characters in the value to test.

Syntax

LIKE::= 
<expression for calculating the string value> 
[NOT] LIKE 
<expression for calculating the string value> 
[ESCAPE <symbol>] 

Query

select * from mytable where name like ‘A%''; 

Result

The above query will generate the following result.

id,  name,  address,     age,  mark 
------------------------------- 
1,  Adam,  23 new street,  12,  90 
2,  Amit,  12 old street,  13,  95

The query returns records from mytable of those students whose names are starting with ‘A’.

Query

select * from mytable where name like ‘_a%''; 

Result

The above query will generate the following result.

id,  name,  address,    age,  mark 
——————————————————————————————————————- 
4,  David,  15 express avenue,  12,  85 
6,  Ganga,  25 north street,    12,  55 
7,  Jack,  2 park street,       12,  60 
9,  Mary,  5 west street,       12,  75 

The query returns records from mytable of those students whose names are starting with ‘a’ as the second char.

Using NULL Value in Search Conditions

Let us now understand how to use NULL Value in the search conditions.

Syntax

Predicate  
IS [NOT] NULL 

Query

select name from mytable where name is not null; 

Result

The above query will generate the following result.

name 
------------------------------- 
Adam 
Amit 
Bob 
David 
Esha 
Ganga 
Jack 
Leena 
Mary 
Peter  
(10 rows, 0.076 sec, 163 B selected) 

Here, the result is true so it returns all the names from table.

Query

Let us now check the query with NULL condition.

default> select name from mytable where name is null; 

Result

The above query will generate the following result.

name 
------------------------------- 
(0 rows, 0.068 sec, 0 B selected) 

Explain

Explain is used to obtain a query execution plan. It shows a logical and global plan execution of a statement.

Logical Plan Query

explain select * from mytable;  
explain 
-------------------------------  
   => target list: default.mytable.id (INT4), default.mytable.name (TEXT), 
      default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) 
   
   => out schema: {
   (5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), 
      default.mytable.age (INT4), default.mytable.mark (INT4)
   } 
   
   => in schema: {
	(5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), 
      default.mytable.age (INT4), default.mytable.mark (INT4)
   }

Result

The above query will generate the following result.

Explain

The query result shows a logical plan format for the given table. The Logical plan returns the following three results −

  • Target list
  • Out schema
  • In schema

Global Plan Query

explain global select * from mytable;  
explain 
------------------------------- 
------------------------------------------------------------------------------- 
Execution Block Graph (TERMINAL - eb_0000000000000_0000_000002) 
------------------------------------------------------------------------------- 
|-eb_0000000000000_0000_000002 
   |-eb_0000000000000_0000_000001 
------------------------------------------------------------------------------- 
Order of Execution 
------------------------------------------------------------------------------- 
1: eb_0000000000000_0000_000001 
2: eb_0000000000000_0000_000002 
-------------------------------------------------------------------------------  
======================================================= 
Block Id: eb_0000000000000_0000_000001 [ROOT] 
=======================================================  
SCAN(0) on default.mytable 
   
   => target list: default.mytable.id (INT4), default.mytable.name (TEXT), 
      default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) 
   
   => out schema: {
	(5) default.mytable.id (INT4), default.mytable.name (TEXT),default.mytable.address (TEXT), 
      default.mytable.age (INT4), default.mytable.mark (INT4)
   } 
   
   => in schema: {
	(5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), 
      default.mytable.age (INT4), default.mytable.mark (INT4)
   }  
======================================================= 
Block Id: eb_0000000000000_0000_000002 [TERMINAL] 
======================================================= 
(24 rows, 0.065 sec, 0 B selected) 

Result

The above query will generate the following result.

Global Plan

Here, Global plan shows execution block ID, order of execution and its information.

Joins

SQL joins are used to combine rows from two or more tables. The following are the different types of SQL Joins −

  • Inner join
  • { LEFT | RIGHT | FULL } OUTER JOIN
  • Cross join
  • Self join
  • Natural join

Consider the following two tables to perform joins operations.

Table1 − Customers

Id Name Address Age
1 Customer 1 23 Old Street 21
2 Customer 2 12 New Street 23
3 Customer 3 10 Express Avenue 22
4 Customer 4 15 Express Avenue 22
5 Customer 5 20 Garden Street 33
6 Customer 6 21 North Street 25

Table2 − customer_order

Id Order Id Emp Id
1 1 101
2 2 102
3 3 103
4 4 104
5 5 105

Let us now proceed and perform the SQL joins operations on the above two tables.

Inner Join

The Inner join selects all rows from both the tables when there is a match between the columns in both tables.

Syntax

SELECT column_name(s) FROM table1 INNER JOIN table2 ON table1.column_name = table2.column_name;

Query

default> select c.age,c1.empid from customers c inner join customer_order c1 on c.id = c1.id; 

Result

The above query will generate the following result.

age,  empid 
------------------------------- 
21,  101 
23,  102 
22,  103 
22,  104 
33,  105 

The query matches five rows from both the tables. Hence, it returns the matched rows age from the first table.

Left Outer Join

A left outer join retains all of the rows of the “left” table, regardless of whether there is a row that matches on the “right” table or not.

Query

select c.name,c1.empid from customers c left outer join customer_order c1 on c.id = c1.id;

Result

The above query will generate the following result.

name,       empid 
------------------------------- 
customer1,  101 
customer2,  102 
customer3,  103 
customer4,  104 
customer5,  105 
customer6,

Here, the left outer join returns name column rows from the customers(left) table and empid column matched rows from the customer_order(right) table.

Right Outer Join

A right outer join retains all of the rows of the “right” table, regardless of whether there is a row that matches on the “left” table.

Query

select c.name,c1.empid from customers c right outer join customer_order c1 on c.id = c1.id;

Result

The above query will generate the following result.

name,      empid 
------------------------------- 
customer1,  101 
customer2,  102 
customer3,  103 
customer4,  104 
customer5,  105

Here, the Right Outer Join returns the empid rows from the customer_order(right) table and the name column matched rows from customers table.

Full Outer Join

The Full Outer Join retains all rows from both the left and the right table.

Query

select * from customers c full outer join customer_order c1 on c.id = c1.id;

Result

The above query will generate the following result.

Full Outer Join

The query returns all the matching and non-matching rows from both the customers and the customer_order tables.

Cross Join

This returns the Cartesian product of the sets of records from the two or more joined tables.

Syntax

SELECT *  FROM table1  CROSS JOIN table2;

Query

select orderid,name,address from customers,customer_order;

Result

The above query will generate the following result.

Cross Join

The above query returns the Cartesian product of the table.

Natural Join

A Natural Join does not use any comparison operator. It does not concatenate the way a Cartesian product does. We can perform a Natural Join only if there is at least one common attribute that exists between the two relations.

Syntax

SELECT * FROM table1 NATURAL JOIN table2;

Query

select * from customers natural join customer_order; 

Result

The above query will generate the following result.

Natural Join

Here, there is one common column id that exists between two tables. Using that common column, the Natural Join joins both the tables.

Self Join

The SQL SELF JOIN is used to join a table to itself as if the table were two tables, temporarily renaming at least one table in the SQL statement.

Syntax

SELECT a.column_name, b.column_name...  
FROM table1 a, table1 b  
WHERE a.common_filed = b.common_field  

Query

default> select c.id,c1.name from customers c, customers c1 where c.id = c1.id; 

Result

The above query will generate the following result.

id,   name 
------------------------------- 
1,   customer1 
2,   customer2 
3,   customer3 
4,   customer4 
5,   customer5 
6,   customer6 

The query joins a customer table to itself.

Apache Tajo – Storage Plugins

Tajo supports various storage formats. To register storage plugin configuration, you should add the changes to the configuration file “storage-site.json”.

storage-site.json

The structure is defined as follows −

{ 
   "storages": { 
      “storage plugin name“: { 
         "handler": "${class name}”, "default-format": “plugin name" 
      } 
   } 
}

Each storage instance is identified by URI.

PostgreSQL Storage Handler

Tajo supports PostgreSQL storage handler. It enables user queries to access database objects in PostgreSQL. It is the default storage handler in Tajo so you can easily configure it.

configuration

{ 
   "spaces": {  
      "postgre": {  
         "uri": "jdbc:postgresql://hostname:port/database1"  
         "configs": {  
            "mapped_database": “sampledb”  
            "connection_properties": { 
               "user":“tajo", "password": "pwd" 
            } 
         } 
      } 
   } 
} 		  

Here, “database1” refers to the postgreSQL database which is mapped to the database “sampledb” in Tajo.

Apache Tajo – Integration with HBase

Apache Tajo supports HBase integration. This enables us to access HBase tables in Tajo. HBase is a distributed column-oriented database built on top of the Hadoop file system. It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. The following steps are required to configure HBase integration.

Set Environment Variable

Add the following changes to “conf/tajo-env.sh” file.

$ vi conf/tajo-env.sh  
# HBase home directory. It is opitional but is required mandatorily to use HBase. 
# export HBASE_HOME = path/to/HBase

After you have included the HBase path, Tajo will set the HBase library file to the classpath.

Create an External Table

Create an external table using the following syntax −

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] <table_name> [(<column_name> <data_type>, ... )] 
USING hbase WITH (''table'' = ''<hbase_table_name>'' 
, ''columns'' = '':key,<column_family_name>:<qualifier_name>, ...'' 
, ''hbase.zookeeper.quorum'' = ''<zookeeper_address>'' 
, ''hbase.zookeeper.property.clientPort'' = ''<zookeeper_client_port>'') 
[LOCATION ''hbase:zk://<hostname>:<port>/''] ;

To access HBase tables, you must configure the tablespace location.

Here,

  • Table − Set hbase origin table name. If you want to create an external table, the table must exists on HBase.

  • Columns − Key refers to the HBase row key. The number of columns entry need to be equal to the number of Tajo table columns.

  • hbase.zookeeper.quorum − Set zookeeper quorum address.

  • hbase.zookeeper.property.clientPort − Set zookeeper client port.

Query

CREATE EXTERNAL TABLE students (rowkey text,id int,name text) 
USING hbase WITH (''table'' = ''students'', ''columns'' = '':key,info:id,content:name'') 
LOCATION ''hbase:zk://<hostname>:<port>/'';

Here, the Location path field sets the zookeeper client port id. If you don’t set the port, Tajo will refer the property of hbase-site.xml file.

Create Table in HBase

You can start the HBase interactive shell using the “hbase shell” command as shown in the following query.

Query

/bin/hbase shell 

Result

The above query will generate the following result.

hbase(main):001:0>

Steps to Query HBase

To query HBase, you should complete the following steps −

Step 1 − Pipe the following commands to the HBase shell to create a “tutorial” table.

Query

hbase(main):001:0> create ‘students’,{NAME => ’info’},{NAME => ’content’} 
put ''students'', ‘row-01'', ''content:name'', ''Adam'' 
put ''students'', ‘row-01'', ''info:id'', ''001'' 
put ''students'', ‘row-02'', ''content:name'', ''Amit'' 
put ''students'', ‘row-02'', ''info:id'', ''002'' 
put ''students'', ‘row-03'', ''content:name'', ''Bob'' 
put ''students'', ‘row-03'', ''info:id'', ‘003'' 

Step 2 − Now, issue the following command in hbase shell to load the data into a table.

main):001:0> cat ../hbase/hbase-students.txt | bin/hbase shell

Step 3 − Now, return to the Tajo shell and execute the following command to view the metadata of the table −

default> d students;  

table name: default.students 
table path: 
store type: HBASE 
number of rows: unknown 
volume: 0 B 
Options: 
   ''columns'' = '':key,info:id,content:name'' 
   ''table'' = ''students''  

schema: 
rowkey  TEXT 
id  INT4 
name TEXT

Step 4 − To fetch the results from the table, use the following query −

Query

default> select * from students

Result

The above query will fetch the following result −

rowkey,  id,  name 
------------------------------- 
row-01,  001,  Adam 
row-02,  002,  Amit 
row-03   003,  Bob 

Apache Tajo – Integration with Hive

Tajo supports the HiveCatalogStore to integrate with Apache Hive. This integration allows Tajo to access tables in Apache Hive.

Set Environment Variable

Add the following changes to “conf/tajo-env.sh” file.

$ vi conf/tajo-env.sh  
export HIVE_HOME = /path/to/hive

After you have included the Hive path, Tajo will set the Hive library file to the classpath.

Catalog Configuration

Add the following changes to the “conf/catalog-site.xml” file.

$ vi conf/catalog-site.xml  
<property> 
   <name>tajo.catalog.store.class</name> 
   <value>org.apache.tajo.catalog.store.HiveCatalogStore</value> 
</property>

Once HiveCatalogStore is configured, you can access Hive’s table in Tajo.

Apache Tajo – OpenStack Swift Integration

Swift is a distributed and consistent object/blob store. Swift offers cloud storage software so that you can store and retrieve lots of data with a simple API. Tajo supports Swift integration.

The following are the prerequisites of Swift Integration −

  • Swift
  • Hadoop

Core-site.xml

Add the following changes to the hadoop “core-site.xml” file −

<property> 
   <name>fs.swift.impl</name> 
   <value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value> 
   <description>File system implementation for Swift</description> 
</property>  

<property> 
   <name>fs.swift.blocksize</name> 
   <value>131072</value> 
   <description>Split size in KB</description> 
</property> 

This will be used for Hadoop to access the Swift objects. After you made all the changes move to the Tajo directory to set Swift environment variable.

conf/tajo-env.h

Open the Tajo configuration file and add set the environment variable as follows −

$ vi conf/tajo-env.h  
export TAJO_CLASSPATH = $HADOOP_HOME/share/hadoop/tools/lib/hadoop-openstack-x.x.x.jar 

Now, Tajo will be able to query the data using Swift.

Create Table

Let’s create an external table to access Swift objects in Tajo as follows −

default> create external table swift(num1 int, num2 text, num3 float) 
   using text with (''text.delimiter'' = ''|'') location ''swift://bucket-name/table1'';

After the table has been created, you can run the SQL queries.

Apache Tajo – JDBC Interface

Apache Tajo provides JDBC interface to connect and execute queries. We can use the same JDBC interface to connect Tajo from our Java based application. Let us now understand how to connect Tajo and execute the commands in our sample Java application using JDBC interface in this section.

Download JDBC Driver

Download the JDBC driver by visiting the following link − http://apache.org/dyn/closer.cgi/tajo/tajo-0.11.3/tajo-jdbc-0.11.3.jar.

Now, “tajo-jdbc-0.11.3.jar” file has been downloaded on your machine.

Set Class Path

To make use of the JDBC driver in your program, set the class path as follows −

CLASSPATH = path/to/tajo-jdbc-0.11.3.jar:$CLASSPATH 

Connect to Tajo

Apache Tajo provides a JDBC driver as a single jar file and it is available @ /path/to/tajo/share/jdbc-dist/tajo-jdbc-0.11.3.jar.

The connection string to connect the Apache Tajo is of the following format −

jdbc:tajo://host/
jdbc:tajo://host/database
jdbc:tajo://host:port/  
jdbc:tajo://host:port/database 

Here,

  • host − The hostname of the TajoMaster.

  • port − The port number that server is listening. Default port number is 26002.

  • database − The database name. The default database name is default.

Java Application

Let us now understand Java application.

Coding

import java.sql.*; 
import org.apache.tajo.jdbc.TajoDriver;  

public class TajoJdbcSample {  
   public static void main(String[] args) {  
      Connection connection = null; 
      Statement statement = null;  
      try {
         Class.forName("org.apache.tajo.jdbc.TajoDriver");  
         connection = DriverManager.getConnection(“jdbc:tajo://localhost/default");
         statement = connection.createStatement(); 
         String sql;  
         sql = "select * from mytable”; 
         // fetch records from mytable.  
         ResultSet resultSet = statement.executeQuery(sql);  
         while(resultSet.next()){  
            int id  = resultSet.getInt("id"); 
            String name = resultSet.getString("name");  
            System.out.print("ID: " + id + ";nName: " + name + "n"); 
         }  
         resultSet.close();
         statement.close(); 
         connection.close(); 
      }catch(SQLException sqlException){ 
         sqlException.printStackTrace(); 
      }catch(Exception exception){ 
         exception.printStackTrace(); 
      } 
   } 
}

The application can be compiled and run using the following commands.

Compilation

javac -cp /path/to/tajo-jdbc-0.11.3.jar:. TajoJdbcSample.java 

Execution

java -cp /path/to/tajo-jdbc-0.11.3.jar:. TajoJdbcSample 

Result

The above commands will generate the following result −

ID: 1; 
Name: Adam  

ID: 2; 
Name: Amit  

ID: 3; 
Name: Bob  

ID: 4; 
Name: David  

ID: 5; 
Name: Esha  

ID: 6; 
Name: Ganga 

ID: 7; 
Name: Jack  

ID: 8; 
Name: Leena  

ID: 9; 
Name: Mary  

ID: 10; 
Name: Peter

Apache Tajo – Custom Functions

Apache Tajo supports the custom / user defined functions (UDFs). The custom functions can be created in python.

The custom functions are just plain python functions with decorator “@output_type(<tajo sql datatype>)” as follows −

@ouput_type(“integer”) 
def sum_py(a, b): 
   return a + b; 

The python scripts with UDFs can be registered by adding the below configuration in “tajosite.xml”.

<property> 
   <name>tajo.function.python.code-dir</name> 
   <value>file:///path/to/script1.py,file:///path/to/script2.py</value> 
</property>

Once the scripts are registered, restart the cluster and the UDFs will be available right in the SQL query as follows −

select sum_py(10, 10) as pyfn; 

Apache Tajo supports user defined aggregate functions as well but does not support user defined window functions.

Advertisements

”;

Leave a Reply

Your email address will not be published. Required fields are marked *