AVRO – Home

AVRO Tutorial PDF Version Quick Guide Resources Job Search Discussion Apache Avro is a language-neutral data serialization system, developed by Doug Cutting, the father of Hadoop. This is a brief tutorial that provides an overview of how to set up Avro and how to serialize and deserialize data using Avro. Audience This tutorial is prepared for professionals aspiring to learn the basics of Big Data Analytics using Hadoop Framework and become a successful Hadoop developer. It will be a handy resource for enthusiasts who want to use Avro for data serialization and deserialization. Prerequisites Before you start proceeding with this tutorial, we assume that you are already aware of Hadoop”s architecture and APIs, and you have experience in writing basic applications, preferably using Java. Print Page Previous Next Advertisements ”;

AVRO – Discussion

Discuss AVRO ”; Previous Next Apache Avro is a language-neutral data serialization system, developed by Doug Cutting, the father of Hadoop. This is a brief tutorial that provides an overview of how to set up Avro and how to serialize and deserialize data using Avro. Print Page Previous Next Advertisements ”;

AWS Quicksight – Data Source Limit

AWS Quicksight – Data Source Limit ”; Previous Next When you use different data sources in Quicksight tool, there are certain limits that apply based on data sources. File You can use up to 25GB of total size specified in manifest file. This limit is dependent on the file size after it is imported to SPICE. The number of files supported in manifest file is 1000 and you also have some limit on number of columns in each file. Table and Query When you are querying a large table, it is recommended that you use Where or Having condition to reduce the number of data imported to SPICE. Query result imported into SPICE should not exceed 25 GB. You can deselect some of the columns while importing the data into SPICE. In case your data source contains data type which are not supported in Quicksight, AWS Quicksight skips those values. Person ID Sales Date Amount 001 10/14/2017 12.43 002 5/3/2017 25.00 003 Unknown 18.17 004 3/8/2019 86.02 From the above values, Quicksight will drop the no date value row while importing this data in dataset. Following data types are supported in Quicksight − Database Source Number Data Types String Data Types Date time Data Types Boolean Data Types Amazon Athena,Presto bigint decimal double integer real smallint tinyint char varchar date timestamp Boolean Amazon Aurora, MariaDB,and MySQL bigint decimal double int integer mediumint numeric smallint tinyint char enum set text varchar date datetime timestamp year PostgreSQL bigint decimal double integer numeric precision real smallint char character text varchar varying character date timestamp Boolean Apache Spark bigint decimal double integer real smallint tinyint varchar date timestamp Boolean Snowflake bigint byteint decimal double doubleprecision float float4 float8 int integer number numeric real smallint tinyint char character string text varchar date datetime time timestamp timestamp_* Boolean Microsoft SQL Server bigint bit decimal int money numeric real smallint smallmoney tinyint char nchar nvarchar text varchar date datetime datetime2 datetimeoffset smalldatetime bit Print Page Previous Next Advertisements ”;

Apache Tajo – Storage Plugins

Apache Tajo – Storage Plugins ”; Previous Next Tajo supports various storage formats. To register storage plugin configuration, you should add the changes to the configuration file “storage-site.json”. storage-site.json The structure is defined as follows − { “storages”: { “storage plugin name“: { “handler”: “${class name}”, “default-format”: “plugin name” } } } Each storage instance is identified by URI. PostgreSQL Storage Handler Tajo supports PostgreSQL storage handler. It enables user queries to access database objects in PostgreSQL. It is the default storage handler in Tajo so you can easily configure it. configuration { “spaces”: { “postgre”: { “uri”: “jdbc:postgresql://hostname:port/database1” “configs”: { “mapped_database”: “sampledb” “connection_properties”: { “user”:“tajo”, “password”: “pwd” } } } } } Here, “database1” refers to the postgreSQL database which is mapped to the database “sampledb” in Tajo. Print Page Previous Next Advertisements ”;

AWS Quicksight – Using Data Sources

AWS Quicksight – Using Data Sources ”; Previous Next AWS Quicksight accepts data from various sources. Once you click on “New Dataset” on the home page, it gives you options of all the data sources that can be used. Below are the sources containing the list of all internal and external sources − Let us go through connecting Quicksight with some of the most commonly used data sources − Uploading a file from system It allows you to input .csv, .tsv, .clf,.elf.xlsx and Json format files only. Once you select the file, Quicksight automatically recognizes the file and displays the data. When you click on Upload a File button, you need to provide the location of file which you want to use to create dataset. Using a file from S3 format The screen will appear as below. Under Data source name, you can enter the name to be displayed for the data set that would be created. Also you would require either uploading a manifest file from your local system or providing the S3 location of the manifest file. Manifest file is a json format file, which specifies the url/location of input files and their format. You can enter more than one input files, provided the format is same. Here is an example of a manifest file. The “URI” parameter used to pass the location of input file is S3. { “fileLocations”: [ { “URIs”: [ “url of first file”, “url of second file”, “url of 3rd file and so on” ] }, ], } “globalUploadSettings”: { “format”: “CSV”, “delimiter”: “,”, “textqualifier”: “””, “containsHeader”: “true” } The parameters passed in globalUploadSettings are the default ones. You can change these parameters as per your requirements. MySQL You need to enter the database information in the fields to connect to your database. Once it is connected to your database, you can import the data from it. Following information is required when you connect to any RDBMS database − DSN name Type of connection Database server name Port Database name User name Password Following RDBMS based data sources are supported in Quicksight − Amazon Athena Amazon Aurora Amazon Redshift Amazon Redshift Spectrum Amazon S3 Amazon S3 Analytics Apache Spark 2.0 or later MariaDB 10.0 or later Microsoft SQL Server 2012 or later MySQL 5.1 or later PostgreSQL 9.3.1 or later Presto 0.167 or later Snowflake Teradata 14.0 or later Athena Athena is the AWS tool to run queries on tables. You can choose any table from Athena or run a custom query on those tables and use the output of those queries in Quicksight. There are couple of steps to choose data source When you choose Athena, below screen appears. You can input any data source name which you want to give to your data source in Quicksight. Click on “Validate Connection”. Once the connection is validated, click on the “Create new source” button Now choose the table name from the dropdown. The dropdown will show the databases present in Athena which will further show tables in that database. Else you can click on “Use custom SQL” to run query on Athena tables. Once done, you can click on “Edit/Preview data” or “Visualize” to either edit your data or directly visualize the data as per your requirement. Deleting a data source When you delete a data source which is in use in any of the Quicksight dashboards, it can make associated data set unusable. It usually happens when you query a SQL based data source. When you create a dataset based on S3, Sales force or SPICE, it does not affect your ability to use any dataset as data is stored in SPICE; however refresh option is not available in this case. To delete a data source, select the data source. Navigate to From Existing Data Source tab on creating a dataset page. Before deletion, you can also confirm estimated table size and other details of data source. Print Page Previous Next Advertisements ”;

AVRO – Environment Setup

AVRO – Environment Setup ”; Previous Next Apache software foundation provides Avro with various releases. You can download the required release from Apache mirrors. Let us see, how to set up the environment to work with Avro − Downloading Avro To download Apache Avro, proceed with the following − Open the web page Apache.org. You will see the homepage of Apache Avro as shown below − Click on project → releases. You will get a list of releases. Select the latest release which leads you to a download link. mirror.nexcess is one of the links where you can find the list of all libraries of different languages that Avro supports as shown below − You can select and download the library for any of the languages provided. In this tutorial, we use Java. Hence download the jar files avro-1.7.7.jar and avro-tools-1.7.7.jar. Avro with Eclipse To use Avro in Eclipse environment, you need to follow the steps given below − Step 1. Open eclipse. Step 2. Create a project. Step 3. Right-click on the project name. You will get a shortcut menu. Step 4. Click on Build Path. It leads you to another shortcut menu. Step 5. Click on Configure Build Path… You can see Properties window of your project as shown below − Step 6. Under libraries tab, click on ADD EXternal JARs… button. Step 7. Select the jar file avro-1.77.jar you have downloaded. Step 8. Click on OK. Avro with Maven You can also get the Avro library into your project using Maven. Given below is the pom.xml file for Avro. <project xmlns=”http://maven.apache.org/POM/4.0.0″ xmlns:xsi=” http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd”> <modelVersion>4.0.0</modelVersion> <groupId>Test</groupId> <artifactId>Test</artifactId> <version>0.0.1-SNAPSHOT</version> <build> <sourceDirectory>src</sourceDirectory> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.7.7</version> </dependency> <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro-tools</artifactId> <version>1.7.7</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>2.0-beta9</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.0-beta9</version> </dependency> </dependencies> </project> Setting Classpath To work with Avro in Linux environment, download the following jar files − avro-1.77.jar avro-tools-1.77.jar log4j-api-2.0-beta9.jar og4j-core-2.0.beta9.jar. Copy these files into a folder and set the classpath to the folder, in the ./bashrc file as shown below. #class path for Avro export CLASSPATH=$CLASSPATH://home/Hadoop/Avro_Work/jars/* Print Page Previous Next Advertisements ”;

AVRO – Reference API

AVRO – Reference API ”; Previous Next In the previous chapter, we described the input type of Avro, i.e., Avro schemas. In this chapter, we will explain the classes and methods used in the serialization and deserialization of Avro schemas. SpecificDatumWriter Class This class belongs to the package org.apache.avro.specific. It implements the DatumWriter interface which converts Java objects into an in-memory serialized format. Constructor S.No. Description 1 SpecificDatumWriter(Schema schema) Method S.No. Description 1 SpecificData getSpecificData() Returns the SpecificData implementation used by this writer. SpecificDatumReader Class This class belongs to the package org.apache.avro.specific. It implements the DatumReader interface which reads the data of a schema and determines in-memory data representation. SpecificDatumReader is the class which supports generated java classes. Constructor S.No. Description 1 SpecificDatumReader(Schema schema) Construct where the writer”s and reader”s schemas are the same. Methods S.No. Description 1 SpecificData getSpecificData() Returns the contained SpecificData. 2 void setSchema(Schema actual) This method is used to set the writer”s schema. DataFileWriter Instantiates DataFileWrite for emp class. This class writes a sequence serialized records of data conforming to a schema, along with the schema in a file. Constructor S.No. Description 1 DataFileWriter(DatumWriter<D> dout) Methods S.No Description 1 void append(D datum) Appends a datum to a file. 2 DataFileWriter<D> appendTo(File file) This method is used to open a writer appending to an existing file. Data FileReader This class provides random access to files written with DataFileWriter. It inherits the class DataFileStream. Constructor S.No. Description 1 DataFileReader(File file, DatumReader<D> reader)) Methods S.No. Description 1 next() Reads the next datum in the file. 2 Boolean hasNext() Returns true if more entries remain in this file. Class Schema.parser This class is a parser for JSON-format schemas. It contains methods to parse the schema. It belongs to org.apache.avro package. Constructor S.No. Description 1 Schema.Parser() Methods S.No. Description 1 parse (File file) Parses the schema provided in the given file. 2 parse (InputStream in) Parses the schema provided in the given InputStream. 3 parse (String s) Parses the schema provided in the given String. Interface GenricRecord This interface provides methods to access the fields by name as well as index. Methods S.No. Description 1 Object get(String key) Returns the value of a field given. 2 void put(String key, Object v) Sets the value of a field given its name. Class GenericData.Record Constructor S.No. Description 1 GenericData.Record(Schema schema) Methods S.No. Description 1 Object get(String key) Returns the value of a field of the given name. 2 Schema getSchema() Returns the schema of this instance. 3 void put(int i, Object v) Sets the value of a field given its position in the schema. 4 void put(String key, Object value) Sets the value of a field given its name. Print Page Previous Next Advertisements ”;

AVRO – Quick Guide

AVRO – Quick Guide ”; Previous Next AVRO – Overview To transfer data over a network or for its persistent storage, you need to serialize the data. Prior to the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a schema-based serialization technique. This tutorial teaches you how to serialize and deserialize the data using Avro. Avro provides libraries for various programming languages. In this tutorial, we demonstrate the examples using Java library. What is Avro? Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop. Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application. Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby. Avro Schemas Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing. In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift. Comparison with Thrift and Protocol Buffers Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways − Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization. Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro”s schema definition is in JSON and not in any proprietary IDL. Property Avro Thrift & Protocol Buffer Dynamic schema Yes No Built into Hadoop Yes No Schema in JSON Yes No No need to compile Yes No No need to declare IDs Yes No Bleeding edge Yes No Features of Avro Listed below are some of the prominent features of Avro − Avro is a language-neutral data serialization system. It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby). Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs. Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language. Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries. Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section. Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake. General Working of Avro To use Avro, you need to follow the given workflow − Step 1 − Create schemas. Here you need to design Avro schema according to your data. Step 2 − Read the schemas into your program. It is done in two ways − By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema By Using Parsers Library − You can directly read the schema using parsers library. Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific. Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific. AVRO – Serialization Data is serialized for two objectives − For persistent storage To transport the data over network What is Serialization? Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling. Serialization in Java Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object”s data as well as information about the object”s type and the types of data stored in the object. After a serialized object is written into a file, it can be read from the file and deserialized. That is, the type information and bytes that represent the object and its data can be used to recreate the object in memory. ObjectInputStream and ObjectOutputStream classes are used to serialize and deserialize an object respectively in Java. Serialization in Hadoop Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess Communication and Persistent Storage. Interprocess Communication To establish the interprocess communication between the nodes connected in a network, RPC technique was used. RPC used internal serialization to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deserializes the binary stream into the original message. The RPC serialization format is required to be as follows − Compact − To make the best use

Aggregate & Window Functions

Aggregate & Window Functions ”; Previous Next This chapter explains the aggregate and window functions in detail. Aggregation Functions Aggregate functions produce a single result from a set of input values. The following table describes the list of aggregate functions in detail. S.No. Function & Description 1 AVG(exp) Averages a column of all records in a data source. 2 CORR(expression1, expression2) Returns the coefficient of correlation between a set of number pairs. 3 COUNT() Returns the number rows. 4 MAX(expression) Returns the largest value of the selected column. 5 MIN(expression) Returns the smallest value of the selected column. 6 SUM(expression) Returns the sum of the given column. 7 LAST_VALUE(expression) Returns the last value of the given column. Window Function The Window functions execute on a set of rows and return a single value for each row from the query. The term window has the meaning of set of row for the function. The Window function in a query, defines the window using the OVER() clause. The OVER() clause has the following capabilities − Defines window partitions to form groups of rows. (PARTITION BY clause) Orders rows within a partition. (ORDER BY clause) The following table describes the window functions in detail. Function Return type Description rank() int Returns rank of the current row with gaps. row_num() int Returns the current row within its partition, counting from 1. lead(value[, offset integer[, default any]]) Same as input type Returns value evaluated at the row that is offset rows after the current row within the partition. If there is no such row, default value will be returned. lag(value[, offset integer[, default any]]) Same as input type Returns value evaluated at the row that is offset rows before the current row within the partition. first_value(value) Same as input type Returns the first value of input rows. last_value(value) Same as input type Returns the last value of input rows. Print Page Previous Next Advertisements ”;

Apache Tajo – JDBC Interface

Apache Tajo – JDBC Interface ”; Previous Next Apache Tajo provides JDBC interface to connect and execute queries. We can use the same JDBC interface to connect Tajo from our Java based application. Let us now understand how to connect Tajo and execute the commands in our sample Java application using JDBC interface in this section. Download JDBC Driver Download the JDBC driver by visiting the following link − http://apache.org/dyn/closer.cgi/tajo/tajo-0.11.3/tajo-jdbc-0.11.3.jar. Now, “tajo-jdbc-0.11.3.jar” file has been downloaded on your machine. Set Class Path To make use of the JDBC driver in your program, set the class path as follows − CLASSPATH = path/to/tajo-jdbc-0.11.3.jar:$CLASSPATH Connect to Tajo Apache Tajo provides a JDBC driver as a single jar file and it is available @ /path/to/tajo/share/jdbc-dist/tajo-jdbc-0.11.3.jar. The connection string to connect the Apache Tajo is of the following format − jdbc:tajo://host/ jdbc:tajo://host/database jdbc:tajo://host:port/ jdbc:tajo://host:port/database Here, host − The hostname of the TajoMaster. port − The port number that server is listening. Default port number is 26002. database − The database name. The default database name is default. Java Application Let us now understand Java application. Coding import java.sql.*; import org.apache.tajo.jdbc.TajoDriver; public class TajoJdbcSample { public static void main(String[] args) { Connection connection = null; Statement statement = null; try { Class.forName(“org.apache.tajo.jdbc.TajoDriver”); connection = DriverManager.getConnection(“jdbc:tajo://localhost/default”); statement = connection.createStatement(); String sql; sql = “select * from mytable”; // fetch records from mytable. ResultSet resultSet = statement.executeQuery(sql); while(resultSet.next()){ int id = resultSet.getInt(“id”); String name = resultSet.getString(“name”); System.out.print(“ID: ” + id + “;nName: ” + name + “n”); } resultSet.close(); statement.close(); connection.close(); }catch(SQLException sqlException){ sqlException.printStackTrace(); }catch(Exception exception){ exception.printStackTrace(); } } } The application can be compiled and run using the following commands. Compilation javac -cp /path/to/tajo-jdbc-0.11.3.jar:. TajoJdbcSample.java Execution java -cp /path/to/tajo-jdbc-0.11.3.jar:. TajoJdbcSample Result The above commands will generate the following result − ID: 1; Name: Adam ID: 2; Name: Amit ID: 3; Name: Bob ID: 4; Name: David ID: 5; Name: Esha ID: 6; Name: Ganga ID: 7; Name: Jack ID: 8; Name: Leena ID: 9; Name: Mary ID: 10; Name: Peter Print Page Previous Next Advertisements ”;