Apache Tajo – Introduction

Apache Tajo – Introduction ”; Previous Next Distributed Data Warehouse System Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It is a subject-oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization but relational data volumes are increased day by day. To overcome the challenges, distributed data warehouse system shares data across multiple data repositories for the purpose of Online Analytical Processing(OLAP). Each data warehouse may belong to one or more organizations. It performs load balancing and scalability. Metadata is replicated and centrally distributed. Apache Tajo is a distributed data warehouse system which uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of MapReduce framework. Overview of SQL on Hadoop Hadoop is an open-source framework that allows to store and process big data in a distributed environment. It is extremely fast and powerful. However, Hadoop has limited querying capabilities so its performance can be made even better with the help of SQL on Hadoop. This allows users to interact with Hadoop through easy SQL commands. Some of the examples of SQL on Hadoop applications are Hive, Impala, Drill, Presto, Spark, HAWQ and Apache Tajo. What is Apache Tajo Apache Tajo is a relational and distributed data processing framework. It is designed for low latency and scalable ad-hoc query analysis. Tajo supports standard SQL and various data formats. Most of the Tajo queries can be executed without any modification. Tajo has fault-tolerance through a restart mechanism for failed tasks and extensible query rewrite engine. Tajo performs the necessary ETL (Extract Transform and Load process) operations to summarize large datasets stored on HDFS. It is an alternative choice to Hive/Pig. The latest version of Tajo has greater connectivity to Java programs and third-party databases such as Oracle and PostGreSQL. Features of Apache Tajo Apache Tajo has the following features − Superior scalability and optimized performance Low latency User-defined functions Row/columnar storage processing framework. Compatibility with HiveQL and Hive MetaStore Simple data flow and easy maintenance. Benefits of Apache Tajo Apache Tajo offers the following benefits − Easy to use Simplified architecture Cost-based query optimization Vectorized query execution plan Fast delivery Simple I/O mechanism and supports various type of storage. Fault tolerance Use Cases of Apache Tajo The following are some of the use cases of Apache Tajo − Data warehousing and analysis Korea’s SK Telecom firm ran Tajo against 1.7 terabytes worth of data and found it could complete queries with greater speed than either Hive or Impala. Data discovery The Korean music streaming service Melon uses Tajo for analytical processing. Tajo executes ETL (extract-transform-load process) jobs 1.5 to 10 times faster than Hive. Log analysis Bluehole Studio, a Korean based company developed TERA — a fantasy multiplayer online game. The company uses Tajo for game log analysis and finding principal causes of service quality interrupts. Storage and Data Formats Apache Tajo supports the following data formats − JSON Text file(CSV) Parquet Sequence File AVRO Protocol Buffer Apache Orc Tajo supports the following storage formats − HDFS JDBC Amazon S3 Apache HBase Elasticsearch Print Page Previous Next Advertisements ”;

Apache Tajo – Storage Plugins

Apache Tajo – Storage Plugins ”; Previous Next Tajo supports various storage formats. To register storage plugin configuration, you should add the changes to the configuration file “storage-site.json”. storage-site.json The structure is defined as follows − { “storages”: { “storage plugin name“: { “handler”: “${class name}”, “default-format”: “plugin name” } } } Each storage instance is identified by URI. PostgreSQL Storage Handler Tajo supports PostgreSQL storage handler. It enables user queries to access database objects in PostgreSQL. It is the default storage handler in Tajo so you can easily configure it. configuration { “spaces”: { “postgre”: { “uri”: “jdbc:postgresql://hostname:port/database1” “configs”: { “mapped_database”: “sampledb” “connection_properties”: { “user”:“tajo”, “password”: “pwd” } } } } } Here, “database1” refers to the postgreSQL database which is mapped to the database “sampledb” in Tajo. Print Page Previous Next Advertisements ”;

Apache Tajo – Quick Guide

Apache Tajo – Quick Guide ”; Previous Next Apache Tajo – Introduction Distributed Data Warehouse System Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It is a subject-oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization but relational data volumes are increased day by day. To overcome the challenges, distributed data warehouse system shares data across multiple data repositories for the purpose of Online Analytical Processing(OLAP). Each data warehouse may belong to one or more organizations. It performs load balancing and scalability. Metadata is replicated and centrally distributed. Apache Tajo is a distributed data warehouse system which uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of MapReduce framework. Overview of SQL on Hadoop Hadoop is an open-source framework that allows to store and process big data in a distributed environment. It is extremely fast and powerful. However, Hadoop has limited querying capabilities so its performance can be made even better with the help of SQL on Hadoop. This allows users to interact with Hadoop through easy SQL commands. Some of the examples of SQL on Hadoop applications are Hive, Impala, Drill, Presto, Spark, HAWQ and Apache Tajo. What is Apache Tajo Apache Tajo is a relational and distributed data processing framework. It is designed for low latency and scalable ad-hoc query analysis. Tajo supports standard SQL and various data formats. Most of the Tajo queries can be executed without any modification. Tajo has fault-tolerance through a restart mechanism for failed tasks and extensible query rewrite engine. Tajo performs the necessary ETL (Extract Transform and Load process) operations to summarize large datasets stored on HDFS. It is an alternative choice to Hive/Pig. The latest version of Tajo has greater connectivity to Java programs and third-party databases such as Oracle and PostGreSQL. Features of Apache Tajo Apache Tajo has the following features − Superior scalability and optimized performance Low latency User-defined functions Row/columnar storage processing framework. Compatibility with HiveQL and Hive MetaStore Simple data flow and easy maintenance. Benefits of Apache Tajo Apache Tajo offers the following benefits − Easy to use Simplified architecture Cost-based query optimization Vectorized query execution plan Fast delivery Simple I/O mechanism and supports various type of storage. Fault tolerance Use Cases of Apache Tajo The following are some of the use cases of Apache Tajo − Data warehousing and analysis Korea’s SK Telecom firm ran Tajo against 1.7 terabytes worth of data and found it could complete queries with greater speed than either Hive or Impala. Data discovery The Korean music streaming service Melon uses Tajo for analytical processing. Tajo executes ETL (extract-transform-load process) jobs 1.5 to 10 times faster than Hive. Log analysis Bluehole Studio, a Korean based company developed TERA — a fantasy multiplayer online game. The company uses Tajo for game log analysis and finding principal causes of service quality interrupts. Storage and Data Formats Apache Tajo supports the following data formats − JSON Text file(CSV) Parquet Sequence File AVRO Protocol Buffer Apache Orc Tajo supports the following storage formats − HDFS JDBC Amazon S3 Apache HBase Elasticsearch Apache Tajo – Architecture The following illustration depicts the architecture of Apache Tajo. The following table describes each of the components in detail. S.No. Component & Description 1 Client Client submits the SQL statements to the Tajo Master to get the result. 2 Master Master is the main daemon. It is responsible for query planning and is the coordinator for workers. 3 Catalog server Maintains the table and index descriptions. It is embedded in the Master daemon. The catalog server uses Apache Derby as the storage layer and connects via JDBC client. 4 Worker Master node assigns task to worker nodes. TajoWorker processes data. As the number of TajoWorkers increases, the processing capacity also increases linearly. 5 Query Master Tajo master assigns query to the Query Master. The Query Master is responsible for controlling a distributed execution plan. It launches the TaskRunner and schedules tasks to TaskRunner. The main role of the Query Master is to monitor the running tasks and report them to the Master node. 6 Node Managers Manages the resource of the worker node. It decides on allocating requests to the node. 7 TaskRunner Acts as a local query execution engine. It is used to run and monitor query process. The TaskRunner processes one task at a time. It has the following three main attributes − Logical plan − An execution block which created the task. A fragment − an input path, an offset range, and schema. Fetches URIs 8 Query Executor It is used to execute a query. 9 Storage service Connects the underlying data storage to Tajo. Workflow Tajo uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and assigns to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. The web-based interface of Tajo has the following capabilities − Option to find how the submitted queries are planned Option to find how the queries are distributed across nodes Option to check the status of the cluster and nodes Apache Tajo – Installation To install Apache Tajo, you must have the following software on your system − Hadoop version 2.3 or greater Java version 1.7 or higher Linux or Mac OS Let us now continue with the following steps to install Tajo. Verifying Java installation Hopefully, you have already installed Java version 8 on your machine. Now, you just need to

Aggregate & Window Functions

Aggregate & Window Functions ”; Previous Next This chapter explains the aggregate and window functions in detail. Aggregation Functions Aggregate functions produce a single result from a set of input values. The following table describes the list of aggregate functions in detail. S.No. Function & Description 1 AVG(exp) Averages a column of all records in a data source. 2 CORR(expression1, expression2) Returns the coefficient of correlation between a set of number pairs. 3 COUNT() Returns the number rows. 4 MAX(expression) Returns the largest value of the selected column. 5 MIN(expression) Returns the smallest value of the selected column. 6 SUM(expression) Returns the sum of the given column. 7 LAST_VALUE(expression) Returns the last value of the given column. Window Function The Window functions execute on a set of rows and return a single value for each row from the query. The term window has the meaning of set of row for the function. The Window function in a query, defines the window using the OVER() clause. The OVER() clause has the following capabilities − Defines window partitions to form groups of rows. (PARTITION BY clause) Orders rows within a partition. (ORDER BY clause) The following table describes the window functions in detail. Function Return type Description rank() int Returns rank of the current row with gaps. row_num() int Returns the current row within its partition, counting from 1. lead(value[, offset integer[, default any]]) Same as input type Returns value evaluated at the row that is offset rows after the current row within the partition. If there is no such row, default value will be returned. lag(value[, offset integer[, default any]]) Same as input type Returns value evaluated at the row that is offset rows before the current row within the partition. first_value(value) Same as input type Returns the first value of input rows. last_value(value) Same as input type Returns the last value of input rows. Print Page Previous Next Advertisements ”;

Apache Tajo – JDBC Interface

Apache Tajo – JDBC Interface ”; Previous Next Apache Tajo provides JDBC interface to connect and execute queries. We can use the same JDBC interface to connect Tajo from our Java based application. Let us now understand how to connect Tajo and execute the commands in our sample Java application using JDBC interface in this section. Download JDBC Driver Download the JDBC driver by visiting the following link − http://apache.org/dyn/closer.cgi/tajo/tajo-0.11.3/tajo-jdbc-0.11.3.jar. Now, “tajo-jdbc-0.11.3.jar” file has been downloaded on your machine. Set Class Path To make use of the JDBC driver in your program, set the class path as follows − CLASSPATH = path/to/tajo-jdbc-0.11.3.jar:$CLASSPATH Connect to Tajo Apache Tajo provides a JDBC driver as a single jar file and it is available @ /path/to/tajo/share/jdbc-dist/tajo-jdbc-0.11.3.jar. The connection string to connect the Apache Tajo is of the following format − jdbc:tajo://host/ jdbc:tajo://host/database jdbc:tajo://host:port/ jdbc:tajo://host:port/database Here, host − The hostname of the TajoMaster. port − The port number that server is listening. Default port number is 26002. database − The database name. The default database name is default. Java Application Let us now understand Java application. Coding import java.sql.*; import org.apache.tajo.jdbc.TajoDriver; public class TajoJdbcSample { public static void main(String[] args) { Connection connection = null; Statement statement = null; try { Class.forName(“org.apache.tajo.jdbc.TajoDriver”); connection = DriverManager.getConnection(“jdbc:tajo://localhost/default”); statement = connection.createStatement(); String sql; sql = “select * from mytable”; // fetch records from mytable. ResultSet resultSet = statement.executeQuery(sql); while(resultSet.next()){ int id = resultSet.getInt(“id”); String name = resultSet.getString(“name”); System.out.print(“ID: ” + id + “;nName: ” + name + “n”); } resultSet.close(); statement.close(); connection.close(); }catch(SQLException sqlException){ sqlException.printStackTrace(); }catch(Exception exception){ exception.printStackTrace(); } } } The application can be compiled and run using the following commands. Compilation javac -cp /path/to/tajo-jdbc-0.11.3.jar:. TajoJdbcSample.java Execution java -cp /path/to/tajo-jdbc-0.11.3.jar:. TajoJdbcSample Result The above commands will generate the following result − ID: 1; Name: Adam ID: 2; Name: Amit ID: 3; Name: Bob ID: 4; Name: David ID: 5; Name: Esha ID: 6; Name: Ganga ID: 7; Name: Jack ID: 8; Name: Leena ID: 9; Name: Mary ID: 10; Name: Peter Print Page Previous Next Advertisements ”;

Apache Tajo – SQL Statements

Apache Tajo – SQL Statements ”; Previous Next In the previous chapter, you have understood how to create tables in Tajo. This chapter explains about the SQL statement in Tajo. Create Table Statement Before moving to create a table, create a text file “students.csv” in Tajo installation directory path as follows − students.csv Id Name Address Age Marks 1 Adam 23 New Street 21 90 2 Amit 12 Old Street 13 95 3 Bob 10 Cross Street 12 80 4 David 15 Express Avenue 12 85 5 Esha 20 Garden Street 13 50 6 Ganga 25 North Street 12 55 7 Jack 2 Park Street 12 60 8 Leena 24 South Street 12 70 9 Mary 5 West Street 12 75 10 Peter 16 Park Avenue 12 95 After the file has been created, move to the terminal and start the Tajo server and shell one by one. Create Database Create a new database using the following command − Query default> create database sampledb; OK Connect to the database “sampledb” which is now created. default> c sampledb You are now connected to database “sampledb” as user “user1”. Then, create a table in “sampledb” as follows − Query sampledb> create external table mytable(id int,name text,address text,age int,mark int) using text with(”text.delimiter” = ”,”) location ‘file:/Users/workspace/Tajo/students.csv’; Result The above query will generate the following result. OK Here, the external table is created. Now, you just have to enter the file location. If you have to assign the table from hdfs then use hdfs instead of file. Next, the “students.csv” file contains comma separated values. The text.delimiter field is assigned with ‘,’. You have now created “mytable” successfully in “sampledb”. Show Table To show tables in Tajo, use the following query. Query sampledb> d mytable sampledb> d mytable Result The above query will generate the following result. table name: sampledb.mytable table uri: file:/Users/workspace/Tajo/students.csv store type: TEXT number of rows: unknown volume: 261 B Options: ”timezone” = ”Asia/Kolkata” ”text.null” = ”\N” ”text.delimiter” = ”,” schema: id INT4 name TEXT address TEXT age INT4 mark INT4 List table To fetch all the records in the table, type the following query − Query sampledb> select * from mytable; Result The above query will generate the following result. Insert Table Statement Tajo uses the following syntax to insert records in table. Syntax create table table1 (col1 int8, col2 text, col3 text); –schema should be same for target table schema Insert overwrite into table1 select * from table2; (or) Insert overwrite into LOCATION ”/dir/subdir” select * from table; Tajo’s insert statement is similar to the INSERT INTO SELECT statement of SQL. Query Let’s create a table to overwrite table data of an existing table. sampledb> create table test(sno int,name text,addr text,age int,mark int); OK sampledb> d Result The above query will generate the following result. mytable test Insert Records To insert records in the “test” table, type the following query. Query sampledb> insert overwrite into test select * from mytable; Result The above query will generate the following result. Progress: 100%, response time: 0.518 sec Here, “mytable” records overwrite the “test” table. If you don’t want to create the “test” table, then straight away assign the physical path location as mentioned in an alternative option for insert query. Fetch records Use the following query to list out all the records in the “test” table − Query sampledb> select * from test; Result The above query will generate the following result. This statement is used to add, remove or modify columns of an existing table. To rename the table use the following syntax − Alter table table1 RENAME TO table2; Query sampledb> alter table test rename to students; Result The above query will generate the following result. OK To check the changed table name, use the following query. sampledb> d mytable students Now the table “test” is changed to “students” table. Add Column To insert new column in the “students” table, type the following syntax − Alter table <table_name> ADD COLUMN <column_name> <data_type> Query sampledb> alter table students add column grade text; Result The above query will generate the following result. OK Set Property This property is used to change the table’s property. Query sampledb> ALTER TABLE students SET PROPERTY ”compression.type” = ”RECORD”, ”compression.codec” = ”org.apache.hadoop.io.compress.Snappy Codec” ; OK Here, compression type and codec properties are assigned. To change the text delimiter property, use the following − Query ALTER TABLE students SET PROPERTY ‘text.delimiter”=”,”; OK Result The above query will generate the following result. sampledb> d students table name: sampledb.students table uri: file:/tmp/tajo-user1/warehouse/sampledb/students store type: TEXT number of rows: 10 volume: 228 B Options: ”compression.type” = ”RECORD” ”timezone” = ”Asia/Kolkata” ”text.null” = ”\N” ”compression.codec” = ”org.apache.hadoop.io.compress.SnappyCodec” ”text.delimiter” = ”,” schema: id INT4 name TEXT addr TEXT age INT4 mark INT4 grade TEXT The above result shows that the table’s properties are changed using the “SET” property. Select Statement The SELECT statement is used to select data from a database. The syntax for the Select statement is as follows − SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, …] [FROM <table reference> [[AS] <table alias name>] [, …]] [WHERE <condition>] [GROUP BY <expression> [, …]] [HAVING <condition>] [ORDER BY <expression> [ASC|DESC] [NULLS (FIRST|LAST)] [, …]] Where Clause The Where clause is used to filter records from the table. Query sampledb> select * from mytable where id > 5; Result The above query will generate the following result. The query returns the records of those students whose id is greater than 5. Query sampledb> select * from mytable where name = ‘Peter’; Result The above query will generate the following result. Progress: 100%, response time: 0.117 sec id, name, address, age ——————————- 10, Peter, 16 park avenue , 12 The result filters Peter’s records only. Distinct Clause A table column may contain duplicate values. The DISTINCT keyword can be used to return only distinct (different) values. Syntax SELECT DISTINCT column1,column2 FROM table_name; Query sampledb> select distinct age from mytable; Result The above query will generate the

OpenStack Swift Integration

Apache Tajo – OpenStack Swift Integration ”; Previous Next Swift is a distributed and consistent object/blob store. Swift offers cloud storage software so that you can store and retrieve lots of data with a simple API. Tajo supports Swift integration. The following are the prerequisites of Swift Integration − Swift Hadoop Core-site.xml Add the following changes to the hadoop “core-site.xml” file − <property> <name>fs.swift.impl</name> <value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value> <description>File system implementation for Swift</description> </property> <property> <name>fs.swift.blocksize</name> <value>131072</value> <description>Split size in KB</description> </property> This will be used for Hadoop to access the Swift objects. After you made all the changes move to the Tajo directory to set Swift environment variable. conf/tajo-env.h Open the Tajo configuration file and add set the environment variable as follows − $ vi conf/tajo-env.h export TAJO_CLASSPATH = $HADOOP_HOME/share/hadoop/tools/lib/hadoop-openstack-x.x.x.jar Now, Tajo will be able to query the data using Swift. Create Table Let’s create an external table to access Swift objects in Tajo as follows − default> create external table swift(num1 int, num2 text, num3 float) using text with (”text.delimiter” = ”|”) location ”swift://bucket-name/table1”; After the table has been created, you can run the SQL queries. Print Page Previous Next Advertisements ”;

Apache Tajo – SQL Queries

Apache Tajo – SQL Queries ”; Previous Next This chapter explains about the following significant Queries. Predicates Explain Join Let us proceed and perform the queries. Predicates Predicate is an expression which is used to evaluate true/false values and UNKNOWN. Predicates are used in the search condition of WHERE clauses and HAVING clauses and other constructs where a Boolean value is required. IN predicate Determines whether the value of expression to test matches any value in the subquery or the list. Subquery is an ordinary SELECT statement that has a result set of one column and one or more rows. This column or all expressions in the list must have the same data type as the expression to test. Syntax IN::= <expression to test> [NOT] IN (<subquery>) | (<expression1>,…) Query select id,name,address from mytable where id in(2,3,4); Result The above query will generate the following result. id, name, address ——————————- 2, Amit, 12 old street 3, Bob, 10 cross street 4, David, 15 express avenue The query returns records from mytable for the students id 2,3 and 4. Query select id,name,address from mytable where id not in(2,3,4); Result The above query will generate the following result. id, name, address ——————————- 1, Adam, 23 new street 5, Esha, 20 garden street 6, Ganga, 25 north street 7, Jack, 2 park street 8, Leena, 24 south street 9, Mary, 5 west street 10, Peter, 16 park avenue The above query returns records from mytable where students is not in 2,3 and 4. Like Predicate The LIKE predicate compares the string specified in the first expression for calculating the string value, which is refered to as a value to test, with the pattern that is defined in the second expression for calculating the string value. The pattern may contain any combination of wildcards such as − Underline symbol (_), which can be used instead of any single character in the value to test. Percent sign (%), which replaces any string of zero or more characters in the value to test. Syntax LIKE::= <expression for calculating the string value> [NOT] LIKE <expression for calculating the string value> [ESCAPE <symbol>] Query select * from mytable where name like ‘A%”; Result The above query will generate the following result. id, name, address, age, mark ——————————- 1, Adam, 23 new street, 12, 90 2, Amit, 12 old street, 13, 95 The query returns records from mytable of those students whose names are starting with ‘A’. Query select * from mytable where name like ‘_a%”; Result The above query will generate the following result. id, name, address, age, mark ——————————————————————————————————————- 4, David, 15 express avenue, 12, 85 6, Ganga, 25 north street, 12, 55 7, Jack, 2 park street, 12, 60 9, Mary, 5 west street, 12, 75 The query returns records from mytable of those students whose names are starting with ‘a’ as the second char. Using NULL Value in Search Conditions Let us now understand how to use NULL Value in the search conditions. Syntax Predicate IS [NOT] NULL Query select name from mytable where name is not null; Result The above query will generate the following result. name ——————————- Adam Amit Bob David Esha Ganga Jack Leena Mary Peter (10 rows, 0.076 sec, 163 B selected) Here, the result is true so it returns all the names from table. Query Let us now check the query with NULL condition. default> select name from mytable where name is null; Result The above query will generate the following result. name ——————————- (0 rows, 0.068 sec, 0 B selected) Explain Explain is used to obtain a query execution plan. It shows a logical and global plan execution of a statement. Logical Plan Query explain select * from mytable; explain ——————————- => target list: default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) => out schema: { (5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) } => in schema: { (5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) } Result The above query will generate the following result. The query result shows a logical plan format for the given table. The Logical plan returns the following three results − Target list Out schema In schema Global Plan Query explain global select * from mytable; explain ——————————- ——————————————————————————- Execution Block Graph (TERMINAL – eb_0000000000000_0000_000002) ——————————————————————————- |-eb_0000000000000_0000_000002 |-eb_0000000000000_0000_000001 ——————————————————————————- Order of Execution ——————————————————————————- 1: eb_0000000000000_0000_000001 2: eb_0000000000000_0000_000002 ——————————————————————————- ======================================================= Block Id: eb_0000000000000_0000_000001 [ROOT] ======================================================= SCAN(0) on default.mytable => target list: default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) => out schema: { (5) default.mytable.id (INT4), default.mytable.name (TEXT),default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) } => in schema: { (5) default.mytable.id (INT4), default.mytable.name (TEXT), default.mytable.address (TEXT), default.mytable.age (INT4), default.mytable.mark (INT4) } ======================================================= Block Id: eb_0000000000000_0000_000002 [TERMINAL] ======================================================= (24 rows, 0.065 sec, 0 B selected) Result The above query will generate the following result. Here, Global plan shows execution block ID, order of execution and its information. Joins SQL joins are used to combine rows from two or more tables. The following are the different types of SQL Joins − Inner join { LEFT | RIGHT | FULL } OUTER JOIN Cross join Self join Natural join Consider the following two tables to perform joins operations. Table1 − Customers Id Name Address Age 1 Customer 1 23 Old Street 21 2 Customer 2 12 New Street 23 3 Customer 3 10 Express Avenue 22 4 Customer 4 15 Express Avenue 22 5 Customer 5 20 Garden Street 33 6 Customer 6 21 North Street 25 Table2 − customer_order Id Order Id Emp Id 1 1 101 2 2 102 3 3 103 4 4 104 5 5 105 Let us now proceed and perform the SQL joins operations on the above two tables. Inner Join The Inner join selects all rows from both the tables when there is a match between the columns in both tables. Syntax SELECT column_name(s) FROM table1 INNER JOIN table2 ON table1.column_name = table2.column_name; Query default> select

Apache Tajo – Discussion

Discuss Apache Tajo ”; Previous Next Apache Tajo is an open-source distributed data warehouse framework for Hadoop. Tajo was initially started by Gruter, a Hadoop-based infrastructure company in south Korea. Later, experts from Intel, Etsy, NASA, Cloudera, Hortonworks also contributed to the project. Tajo refers to an ostrich in Korean language. In the year March 2014, Tajo was granted a top-level open source Apache project. This tutorial will explore the basics of Tajo and moving on, it will explain cluster setup, Tajo shell, SQL queries, integration with other big data technologies and finally conclude with some examples. Print Page Previous Next Advertisements ”;

Apache Tajo – Architecture

Apache Tajo – Architecture ”; Previous Next The following illustration depicts the architecture of Apache Tajo. The following table describes each of the components in detail. S.No. Component & Description 1 Client Client submits the SQL statements to the Tajo Master to get the result. 2 Master Master is the main daemon. It is responsible for query planning and is the coordinator for workers. 3 Catalog server Maintains the table and index descriptions. It is embedded in the Master daemon. The catalog server uses Apache Derby as the storage layer and connects via JDBC client. 4 Worker Master node assigns task to worker nodes. TajoWorker processes data. As the number of TajoWorkers increases, the processing capacity also increases linearly. 5 Query Master Tajo master assigns query to the Query Master. The Query Master is responsible for controlling a distributed execution plan. It launches the TaskRunner and schedules tasks to TaskRunner. The main role of the Query Master is to monitor the running tasks and report them to the Master node. 6 Node Managers Manages the resource of the worker node. It decides on allocating requests to the node. 7 TaskRunner Acts as a local query execution engine. It is used to run and monitor query process. The TaskRunner processes one task at a time. It has the following three main attributes − Logical plan − An execution block which created the task. A fragment − an input path, an offset range, and schema. Fetches URIs 8 Query Executor It is used to execute a query. 9 Storage service Connects the underlying data storage to Tajo. Workflow Tajo uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and assigns to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. The web-based interface of Tajo has the following capabilities − Option to find how the submitted queries are planned Option to find how the queries are distributed across nodes Option to check the status of the cluster and nodes Print Page Previous Next Advertisements ”;