Developer Responsibilities

AWS Quicksight – Developer Responsibilities ”; Previous Next Following job responsibilities are performed by an AWS Quicksight developer − Person should have relevant work experience in analytics, reporting and business intelligence tools. Understanding customer requirements and design solution in AWS to setup ETL and Business Intelligence environment. Understanding different AWS services, their use and configuration. Proficient in using SQL, ETL, Data Warehouse solutions and databases in a business environment with large-scale, disparate datasets. Complex quantitative and data analysis skills. Understanding AWS IAM policies, roles and administrator of AWS services. Print Page Previous Next Advertisements ”;

AWS Quicksight – Managing Quicksight

AWS Quicksight – Managing Quicksight ”; Previous Next Manage Quicksight is to manage your current account. You can add users with respective roles, manage your subscription, and check SPICE capacity or whitelist domains for embedding. You would require admin access to perform any activity on this page. Under the user profile, you will find the option to manage Quicksight. On clicking Manage subscription, below screen will appear. It will show the users in this account and their respective roles. You also have a search option; in case you want to particularly search for an existing user in Quicksight. You can invite users with valid email address or you can add users with a valid IAM account. The users with IAM role can then login to their Quicksight account and view the dashboard to which they have access. Your Subscriptions will show the edition of Quicksight you are subscribed to. SPICE capacity shows the capacity of their calculation engine being opted for and the amount used so far. There is an option of purchasing more capacity if required. Account settings shows details of Quicksight account − notification email address, AWS resource permissions to Quicksight, or you also have an option to close the account. When you close Quicksight account, it deletes all the data related to below objects − Data Sources Data Sets Analyses Published Dashboards Manage VPC connection allows you to manage and add VPC connection to Quicksight. To add a new VPC connection, you need to provide the following details − Domains and embedding allows you to whitelist the domain on which you want to embed Quicksight dashboards for the users. It only supports https:// domain to whitelist in Quicksight − https://example.com You can also include any subdomains if you want to use by selecting the checkbox shown below. When you click on Add button, it adds the domain to the list of domain names allowed in Quicksight for embedding. To edit an allowed domain, you need to click on Edit button located beside the domain name. You can make changes and click on Update. Print Page Previous Next Advertisements ”;

Deserialization by Generating Class

AVRO – Deserialization By Generating Class ”; Previous Next As described earlier, one can read an Avro schema into a program either by generating a class corresponding to the schema or by using the parsers library. This chapter describes how to read the schema by generating a class and Deserialize the data using Avro. Deserialization by Generating a Class The serialized data is stored in the file emp.avro. You can deserialize and read it using Avro. Follow the procedure given below to deserialize the serialized data from a file. Step 1 Create an object of DatumReader interface using SpecificDatumReader class. DatumReader<emp>empDatumReader = new SpecificDatumReader<emp>(emp.class); Step 2 Instantiate DataFileReader for emp class. This class reads serialized data from a file. It requires the Dataumeader object, and path of the file where the serialized data is existing, as a parameters to the constructor. DataFileReader<emp> dataFileReader = new DataFileReader(new File(“/path/to/emp.avro”), empDatumReader); Step 3 Print the deserialized data, using the methods of DataFileReader. The hasNext() method will return a boolean if there are any elements in the Reader. The next() method of DataFileReader returns the data in the Reader. while(dataFileReader.hasNext()){ em=dataFileReader.next(em); System.out.println(em); } Example – Deserialization by Generating a Class The following complete program shows how to deserialize the data in a file using Avro. import java.io.File; import java.io.IOException; import org.apache.avro.file.DataFileReader; import org.apache.avro.io.DatumReader; import org.apache.avro.specific.SpecificDatumReader; public class Deserialize { public static void main(String args[]) throws IOException{ //DeSerializing the objects DatumReader<emp> empDatumReader = new SpecificDatumReader<emp>(emp.class); //Instantiating DataFileReader DataFileReader<emp> dataFileReader = new DataFileReader<emp>(new File(“/home/Hadoop/Avro_Work/with_code_genfile/emp.avro”), empDatumReader); emp em=null; while(dataFileReader.hasNext()){ em=dataFileReader.next(em); System.out.println(em); } } } Browse into the directory where the generated code is placed. In this case, at home/Hadoop/Avro_work/with_code_gen. $ cd home/Hadoop/Avro_work/with_code_gen/ Now, copy and save the above program in the file named DeSerialize.java. Compile and execute it as shown below − $ javac Deserialize.java $ java Deserialize Output {“name”: “omar”, “id”: 1, “salary”: 30000, “age”: 21, “address”: “Hyderabad”} {“name”: “ram”, “id”: 2, “salary”: 40000, “age”: 30, “address”: “Hyderabad”} {“name”: “robbin”, “id”: 3, “salary”: 35000, “age”: 25, “address”: “Hyderabad”} Print Page Previous Next Advertisements ”;

AVRO – Schemas

AVRO – Schemas ”; Previous Next Avro, being a schema-based serialization utility, accepts schemas as input. In spite of various schemas being available, Avro follows its own standards of defining schemas. These schemas describe the following details − type of file (record by default) location of record name of the record fields in the record with their corresponding data types Using these schemas, you can store serialized values in binary format using less space. These values are stored without any metadata. Creating Avro Schemas The Avro schema is created in JavaScript Object Notation (JSON) document format, which is a lightweight text-based data interchange format. It is created in one of the following ways − A JSON string A JSON object A JSON array Example − The following example shows a schema, which defines a document, under the name space Tutorialspoint, with name Employee, having fields name and age. { “type” : “record”, “namespace” : “Tutorialspoint”, “name” : “Employee”, “fields” : [ { “name” : “Name” , “type” : “string” }, { “name” : “Age” , “type” : “int” } ] } In this example, you can observe that there are four fields for each record − type − This field comes under the document as well as the under the field named fields. In case of document, it shows the type of the document, generally a record because there are multiple fields. When it is field, the type describes data type. namespace − This field describes the name of the namespace in which the object resides. name − This field comes under the document as well as the under the field named fields. In case of document, it describes the schema name. This schema name together with the namespace, uniquely identifies the schema within the store (Namespace.schema name). In the above example, the full name of the schema will be Tutorialspoint.Employee. In case of fields, it describes name of the field. Primitive Data Types of Avro Avro schema is having primitive data types as well as complex data types. The following table describes the primitive data types of Avro − Data type Description null Null is a type having no value. int 32-bit signed integer. long 64-bit signed integer. float single precision (32-bit) IEEE 754 floating-point number. double double precision (64-bit) IEEE 754 floating-point number. bytes sequence of 8-bit unsigned bytes. string Unicode character sequence. Complex Data Types of Avro Along with primitive data types, Avro provides six complex data types namely Records, Enums, Arrays, Maps, Unions, and Fixed. Record A record data type in Avro is a collection of multiple attributes. It supports the following attributes − name − The value of this field holds the name of the record. namespace − The value of this field holds the name of the namespace where the object is stored. type − The value of this attribute holds either the type of the document (record) or the datatype of the field in the schema. fields − This field holds a JSON array, which have the list of all of the fields in the schema, each having name and the type attributes. Example Given below is the example of a record. { ” type ” : “record”, ” namespace ” : “Tutorialspoint”, ” name ” : “Employee”, ” fields ” : [ { “name” : ” Name” , “type” : “string” }, { “name” : “age” , “type” : “int” } ] } Enum An enumeration is a list of items in a collection, Avro enumeration supports the following attributes − name − The value of this field holds the name of the enumeration. namespace − The value of this field contains the string that qualifies the name of the Enumeration. symbols − The value of this field holds the enum”s symbols as an array of names. Example Given below is the example of an enumeration. { “type” : “enum”, “name” : “Numbers”, “namespace”: “data”, “symbols” : [ “ONE”, “TWO”, “THREE”, “FOUR” ] } Arrays This data type defines an array field having a single attribute items. This items attribute specifies the type of items in the array. Example { ” type ” : ” array “, ” items ” : ” int ” } Maps The map data type is an array of key-value pairs, it organizes data as key-value pairs. The key for an Avro map must be a string. The values of a map hold the data type of the content of map. Example {“type” : “map”, “values” : “int”} Unions A union datatype is used whenever the field has one or more datatypes. They are represented as JSON arrays. For example, if a field that could be either an int or null, then the union is represented as [“int”, “null”]. Example Given below is an example document using unions − { “type” : “record”, “namespace” : “tutorialspoint”, “name” : “empdetails “, “fields” : [ { “name” : “experience”, “type”: [“int”, “null”] }, { “name” : “age”, “type”: “int” } ] } Fixed This data type is used to declare a fixed-sized field that can be used for storing binary data. It has field name and data as attributes. Name holds the name of the field, and size holds the size of the field. Example { “type” : “fixed” , “name” : “bdata”, “size” : 1048576} Print Page Previous Next Advertisements ”;

Deserialization Using Parsers

AVRO – Deserialization Using Parsers ”; Previous Next As mentioned earlier, one can read an Avro schema into a program either by generating a class corresponding to a schema or by using the parsers library. In Avro, data is always stored with its corresponding schema. Therefore, we can always read a serialized item without code generation. This chapter describes how to read the schema using parsers library and Deserializing the data using Avro. Deserialization Using Parsers Library The serialized data is stored in the file mydata.txt. You can deserialize and read it using Avro. Follow the procedure given below to deserialize the serialized data from a file. Step 1 First of all, read the schema from the file. To do so, use Schema.Parser class. This class provides methods to parse the schema in different formats. Instantiate the Schema.Parser class by passing the file path where the schema is stored. Schema schema = new Schema.Parser().parse(new File(“/path/to/emp.avsc”)); Step 2 Create an object of DatumReader interface using SpecificDatumReader class. DatumReader<emp>empDatumReader = new SpecificDatumReader<emp>(emp.class); Step 3 Instantiate DataFileReader class. This class reads serialized data from a file. It requires the DatumReader object, and path of the file where the serialized data exists, as a parameters to the constructor. DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(new File(“/path/to/mydata.txt”), datumReader); Step 4 Print the deserialized data, using the methods of DataFileReader. The hasNext() method returns a boolean if there are any elements in the Reader . The next() method of DataFileReader returns the data in the Reader. while(dataFileReader.hasNext()){ em=dataFileReader.next(em); System.out.println(em); } Example – Deserialization Using Parsers Library The following complete program shows how to deserialize the serialized data using Parsers library − public class Deserialize { public static void main(String args[]) throws Exception{ //Instantiating the Schema.Parser class. Schema schema = new Schema.Parser().parse(new File(“/home/Hadoop/Avro/schema/emp.avsc”)); DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(new File(“/home/Hadoop/Avro_Work/without_code_gen/mydata.txt”), datumReader); GenericRecord emp = null; while (dataFileReader.hasNext()) { emp = dataFileReader.next(emp); System.out.println(emp); } System.out.println(“hello”); } } Browse into the directory where the generated code is placed. In this case, it is at home/Hadoop/Avro_work/without_code_gen. $ cd home/Hadoop/Avro_work/without_code_gen/ Now copy and save the above program in the file named DeSerialize.java. Compile and execute it as shown below − $ javac Deserialize.java $ java Deserialize Output {“name”: “ramu”, “id”: 1, “salary”: 30000, “age”: 25, “address”: “chennai”} {“name”: “rahman”, “id”: 2, “salary”: 35000, “age”: 30, “address”: “Delhi”} Print Page Previous Next Advertisements ”;

AWS Quicksight – Data Source Limit

AWS Quicksight – Data Source Limit ”; Previous Next When you use different data sources in Quicksight tool, there are certain limits that apply based on data sources. File You can use up to 25GB of total size specified in manifest file. This limit is dependent on the file size after it is imported to SPICE. The number of files supported in manifest file is 1000 and you also have some limit on number of columns in each file. Table and Query When you are querying a large table, it is recommended that you use Where or Having condition to reduce the number of data imported to SPICE. Query result imported into SPICE should not exceed 25 GB. You can deselect some of the columns while importing the data into SPICE. In case your data source contains data type which are not supported in Quicksight, AWS Quicksight skips those values. Person ID Sales Date Amount 001 10/14/2017 12.43 002 5/3/2017 25.00 003 Unknown 18.17 004 3/8/2019 86.02 From the above values, Quicksight will drop the no date value row while importing this data in dataset. Following data types are supported in Quicksight − Database Source Number Data Types String Data Types Date time Data Types Boolean Data Types Amazon Athena,Presto bigint decimal double integer real smallint tinyint char varchar date timestamp Boolean Amazon Aurora, MariaDB,and MySQL bigint decimal double int integer mediumint numeric smallint tinyint char enum set text varchar date datetime timestamp year PostgreSQL bigint decimal double integer numeric precision real smallint char character text varchar varying character date timestamp Boolean Apache Spark bigint decimal double integer real smallint tinyint varchar date timestamp Boolean Snowflake bigint byteint decimal double doubleprecision float float4 float8 int integer number numeric real smallint tinyint char character string text varchar date datetime time timestamp timestamp_* Boolean Microsoft SQL Server bigint bit decimal int money numeric real smallint smallmoney tinyint char nchar nvarchar text varchar date datetime datetime2 datetimeoffset smalldatetime bit Print Page Previous Next Advertisements ”;

Apache Tajo – Storage Plugins

Apache Tajo – Storage Plugins ”; Previous Next Tajo supports various storage formats. To register storage plugin configuration, you should add the changes to the configuration file “storage-site.json”. storage-site.json The structure is defined as follows − { “storages”: { “storage plugin name“: { “handler”: “${class name}”, “default-format”: “plugin name” } } } Each storage instance is identified by URI. PostgreSQL Storage Handler Tajo supports PostgreSQL storage handler. It enables user queries to access database objects in PostgreSQL. It is the default storage handler in Tajo so you can easily configure it. configuration { “spaces”: { “postgre”: { “uri”: “jdbc:postgresql://hostname:port/database1” “configs”: { “mapped_database”: “sampledb” “connection_properties”: { “user”:“tajo”, “password”: “pwd” } } } } } Here, “database1” refers to the postgreSQL database which is mapped to the database “sampledb” in Tajo. Print Page Previous Next Advertisements ”;

AWS Quicksight – Using Data Sources

AWS Quicksight – Using Data Sources ”; Previous Next AWS Quicksight accepts data from various sources. Once you click on “New Dataset” on the home page, it gives you options of all the data sources that can be used. Below are the sources containing the list of all internal and external sources − Let us go through connecting Quicksight with some of the most commonly used data sources − Uploading a file from system It allows you to input .csv, .tsv, .clf,.elf.xlsx and Json format files only. Once you select the file, Quicksight automatically recognizes the file and displays the data. When you click on Upload a File button, you need to provide the location of file which you want to use to create dataset. Using a file from S3 format The screen will appear as below. Under Data source name, you can enter the name to be displayed for the data set that would be created. Also you would require either uploading a manifest file from your local system or providing the S3 location of the manifest file. Manifest file is a json format file, which specifies the url/location of input files and their format. You can enter more than one input files, provided the format is same. Here is an example of a manifest file. The “URI” parameter used to pass the location of input file is S3. { “fileLocations”: [ { “URIs”: [ “url of first file”, “url of second file”, “url of 3rd file and so on” ] }, ], } “globalUploadSettings”: { “format”: “CSV”, “delimiter”: “,”, “textqualifier”: “””, “containsHeader”: “true” } The parameters passed in globalUploadSettings are the default ones. You can change these parameters as per your requirements. MySQL You need to enter the database information in the fields to connect to your database. Once it is connected to your database, you can import the data from it. Following information is required when you connect to any RDBMS database − DSN name Type of connection Database server name Port Database name User name Password Following RDBMS based data sources are supported in Quicksight − Amazon Athena Amazon Aurora Amazon Redshift Amazon Redshift Spectrum Amazon S3 Amazon S3 Analytics Apache Spark 2.0 or later MariaDB 10.0 or later Microsoft SQL Server 2012 or later MySQL 5.1 or later PostgreSQL 9.3.1 or later Presto 0.167 or later Snowflake Teradata 14.0 or later Athena Athena is the AWS tool to run queries on tables. You can choose any table from Athena or run a custom query on those tables and use the output of those queries in Quicksight. There are couple of steps to choose data source When you choose Athena, below screen appears. You can input any data source name which you want to give to your data source in Quicksight. Click on “Validate Connection”. Once the connection is validated, click on the “Create new source” button Now choose the table name from the dropdown. The dropdown will show the databases present in Athena which will further show tables in that database. Else you can click on “Use custom SQL” to run query on Athena tables. Once done, you can click on “Edit/Preview data” or “Visualize” to either edit your data or directly visualize the data as per your requirement. Deleting a data source When you delete a data source which is in use in any of the Quicksight dashboards, it can make associated data set unusable. It usually happens when you query a SQL based data source. When you create a dataset based on S3, Sales force or SPICE, it does not affect your ability to use any dataset as data is stored in SPICE; however refresh option is not available in this case. To delete a data source, select the data source. Navigate to From Existing Data Source tab on creating a dataset page. Before deletion, you can also confirm estimated table size and other details of data source. Print Page Previous Next Advertisements ”;

Apache Tajo – Quick Guide

Apache Tajo – Quick Guide ”; Previous Next Apache Tajo – Introduction Distributed Data Warehouse System Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It is a subject-oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization but relational data volumes are increased day by day. To overcome the challenges, distributed data warehouse system shares data across multiple data repositories for the purpose of Online Analytical Processing(OLAP). Each data warehouse may belong to one or more organizations. It performs load balancing and scalability. Metadata is replicated and centrally distributed. Apache Tajo is a distributed data warehouse system which uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of MapReduce framework. Overview of SQL on Hadoop Hadoop is an open-source framework that allows to store and process big data in a distributed environment. It is extremely fast and powerful. However, Hadoop has limited querying capabilities so its performance can be made even better with the help of SQL on Hadoop. This allows users to interact with Hadoop through easy SQL commands. Some of the examples of SQL on Hadoop applications are Hive, Impala, Drill, Presto, Spark, HAWQ and Apache Tajo. What is Apache Tajo Apache Tajo is a relational and distributed data processing framework. It is designed for low latency and scalable ad-hoc query analysis. Tajo supports standard SQL and various data formats. Most of the Tajo queries can be executed without any modification. Tajo has fault-tolerance through a restart mechanism for failed tasks and extensible query rewrite engine. Tajo performs the necessary ETL (Extract Transform and Load process) operations to summarize large datasets stored on HDFS. It is an alternative choice to Hive/Pig. The latest version of Tajo has greater connectivity to Java programs and third-party databases such as Oracle and PostGreSQL. Features of Apache Tajo Apache Tajo has the following features − Superior scalability and optimized performance Low latency User-defined functions Row/columnar storage processing framework. Compatibility with HiveQL and Hive MetaStore Simple data flow and easy maintenance. Benefits of Apache Tajo Apache Tajo offers the following benefits − Easy to use Simplified architecture Cost-based query optimization Vectorized query execution plan Fast delivery Simple I/O mechanism and supports various type of storage. Fault tolerance Use Cases of Apache Tajo The following are some of the use cases of Apache Tajo − Data warehousing and analysis Korea’s SK Telecom firm ran Tajo against 1.7 terabytes worth of data and found it could complete queries with greater speed than either Hive or Impala. Data discovery The Korean music streaming service Melon uses Tajo for analytical processing. Tajo executes ETL (extract-transform-load process) jobs 1.5 to 10 times faster than Hive. Log analysis Bluehole Studio, a Korean based company developed TERA — a fantasy multiplayer online game. The company uses Tajo for game log analysis and finding principal causes of service quality interrupts. Storage and Data Formats Apache Tajo supports the following data formats − JSON Text file(CSV) Parquet Sequence File AVRO Protocol Buffer Apache Orc Tajo supports the following storage formats − HDFS JDBC Amazon S3 Apache HBase Elasticsearch Apache Tajo – Architecture The following illustration depicts the architecture of Apache Tajo. The following table describes each of the components in detail. S.No. Component & Description 1 Client Client submits the SQL statements to the Tajo Master to get the result. 2 Master Master is the main daemon. It is responsible for query planning and is the coordinator for workers. 3 Catalog server Maintains the table and index descriptions. It is embedded in the Master daemon. The catalog server uses Apache Derby as the storage layer and connects via JDBC client. 4 Worker Master node assigns task to worker nodes. TajoWorker processes data. As the number of TajoWorkers increases, the processing capacity also increases linearly. 5 Query Master Tajo master assigns query to the Query Master. The Query Master is responsible for controlling a distributed execution plan. It launches the TaskRunner and schedules tasks to TaskRunner. The main role of the Query Master is to monitor the running tasks and report them to the Master node. 6 Node Managers Manages the resource of the worker node. It decides on allocating requests to the node. 7 TaskRunner Acts as a local query execution engine. It is used to run and monitor query process. The TaskRunner processes one task at a time. It has the following three main attributes − Logical plan − An execution block which created the task. A fragment − an input path, an offset range, and schema. Fetches URIs 8 Query Executor It is used to execute a query. 9 Storage service Connects the underlying data storage to Tajo. Workflow Tajo uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and assigns to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. The web-based interface of Tajo has the following capabilities − Option to find how the submitted queries are planned Option to find how the queries are distributed across nodes Option to check the status of the cluster and nodes Apache Tajo – Installation To install Apache Tajo, you must have the following software on your system − Hadoop version 2.3 or greater Java version 1.7 or higher Linux or Mac OS Let us now continue with the following steps to install Tajo. Verifying Java installation Hopefully, you have already installed Java version 8 on your machine. Now, you just need to

AWS Quicksight – Landing Page

AWS Quicksight – Landing Page ”; Previous Next To access AWS Quicksight tool, you can open it directly by passing this URL in web browser or navigating to AWS Console → Services https://aws.amazon.com/quicksight/ Once you open this URL, on top right corner click on “Sign in to the Console”. You need to provide the below details to login to Quicksight tool − Account ID or alias IAM User name Password Once you login into Quicksight, you will see the below screen − As marked in the above image, Section A − “New Analysis” icon is used to create a new analysis. When you click on this, it will ask you to select any data set. You can also create a new data set as shown below − Section B − The “Manage data” icon will show all the data sets that have already been input to Quicksight. This option can be used to manage the dataset without creating any analysis. Section C − It shows various data sources you have already connected to. You can also connect to a new data source or upload a file. Section D − This section contains icon for already created analysis, published dashboards and tutorial videos explaining about Quicksight in detail. You can click on each tab to view them as below − All analysis Here, you can see all the existing analysis in AWS Quicksight account including report and dashboards. All dashboards This option shows only existing dashboards in AWS Quicksight account. Tutorial videos Other option to open Quicksight console is by navigating to AWS console using below URL − https://aws.amazon.com/console/ Once you login, you need to navigate to Services tab and search for Quicksight in search bar. If you have recently used Quicksight services in AWS account, it will be seen under History tab. Print Page Previous Next Advertisements ”;