AWS Quicksight – Creating New Analysis ”; Previous Next An analysis is a combination of one or more visuals. A visual is the representation of data in graphical, chart or tabular format. There are number of formats available to create any visual. This includes the pie charts, horizontal bar chart, vertical bar chart and pivot table. Once the input data set has been modified as per business requirement, double click on the dataset and click on visualize to start creating new analysis. It will show you the below screen or the workspace. Once you select a field, Quicksight automatically chooses the type of visual depending upon the field. If you want to change the visual type, you can select one of the visual types. For example, we start by selecting the Horizontal bar under visual types. First, drag any one field inside the visual in the centre. At the top, you will see “Fields wells” containing fields used in visual and the corresponding axis. You can click on the down arrow at the last just under user name. This will give you an expanded view. I have selected Gender on Y axis and Job family under Group/colour. You can modify the fields from dropdown. Under value, you can add any numeric field and use aggregate function on that field. By default, the visual will show count of rows. The visual will appear as follows − There are options to change the heading/title of the visual and number of other formatting options. Click the dropdown on the right most. The options will expand. Choose “format visual”. You will be able to see various options in left tab under “Format visual”. X-Axis/Y-Axis − This gives the option if you want to see the label or field name on the respective axis. It also allows you to rename these labels. Group/Color − This provides the option to change the default colors in the visual. Legend − This provides the options to change title and position of title in the visual. You can also rename the title by just clicking over it. Data labels − This provides the option to show exact values of each bar and the position where the values needs to be displayed. The below screen shows the visual with everything turned on − Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
AVRO – Overview
AVRO – Overview ”; Previous Next To transfer data over a network or for its persistent storage, you need to serialize the data. Prior to the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a schema-based serialization technique. This tutorial teaches you how to serialize and deserialize the data using Avro. Avro provides libraries for various programming languages. In this tutorial, we demonstrate the examples using Java library. What is Avro? Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop. Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application. Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby. Avro Schemas Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing. In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift. Comparison with Thrift and Protocol Buffers Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways − Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization. Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro”s schema definition is in JSON and not in any proprietary IDL. Property Avro Thrift & Protocol Buffer Dynamic schema Yes No Built into Hadoop Yes No Schema in JSON Yes No No need to compile Yes No No need to declare IDs Yes No Bleeding edge Yes No Features of Avro Listed below are some of the prominent features of Avro − Avro is a language-neutral data serialization system. It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby). Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs. Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language. Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries. Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section. Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake. General Working of Avro To use Avro, you need to follow the given workflow − Step 1 − Create schemas. Here you need to design Avro schema according to your data. Step 2 − Read the schemas into your program. It is done in two ways − By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema By Using Parsers Library − You can directly read the schema using parsers library. Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific. Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific. Print Page Previous Next Advertisements ”;
Cassandra – Architecture
Cassandra – Architecture ”; Previous Next The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster. All the nodes in a cluster play the same role. Each node is independent and at the same time interconnected to other nodes. Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster. When a node goes down, read/write requests can be served from other nodes in the network. Data Replication in Cassandra In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is detected that some of the nodes responded with an out-of-date value, Cassandra will return the most recent value to the client. After returning the most recent value, Cassandra performs a read repair in the background to update the stale values. The following figure shows a schematic view of how Cassandra uses data replication among the nodes in a cluster to ensure no single point of failure. Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with each other and detect any faulty nodes in the cluster. Components of Cassandra The key components of Cassandra are as follows − Node − It is the place where data is stored. Data center − It is a collection of related nodes. Cluster − A cluster is a component that contains one or more data centers. Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value. Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query. Cassandra Query Language Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers. Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data. Write Operations Every write activity of nodes is captured by the commit logs written in the nodes. Later the data will be captured and stored in the mem-table. Whenever the mem-table is full, data will be written into the SStable data file. All writes are automatically partitioned and replicated throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary data. Read Operations During read operations, Cassandra gets values from the mem-table and checks the bloom filter to find the appropriate SSTable that holds the required data. Print Page Previous Next Advertisements ”;
Cassandra – Referenced Api
Cassandra – Referenced Api ”; Previous Next This chapter covers all the important classes in Cassandra. Cluster This class is the main entry point of the driver. It belongs to com.datastax.driver.core package. Methods S. No. Methods and Description 1 Session connect() It creates a new session on the current cluster and initializes it. 2 void close() It is used to close the cluster instance. 3 static Cluster.Builder builder() It is used to create a new Cluster.Builder instance. Cluster.Builder This class is used to instantiate the Cluster.Builder class. Methods S. No Methods and Description 1 Cluster.Builder addContactPoint(String address) This method adds a contact point to cluster. 2 Cluster build() This method builds the cluster with the given contact points. Session This interface holds the connections to Cassandra cluster. Using this interface, you can execute CQL queries. It belongs to com.datastax.driver.core package. Methods S. No. Methods and Description 1 void close() This method is used to close the current session instance. 2 ResultSet execute(Statement statement) This method is used to execute a query. It requires a statement object. 3 ResultSet execute(String query) This method is used to execute a query. It requires a query in the form of a String object. 4 PreparedStatement prepare(RegularStatement statement) This method prepares the provided query. The query is to be provided in the form of a Statement. 5 PreparedStatement prepare(String query) This method prepares the provided query. The query is to be provided in the form of a String. Print Page Previous Next Advertisements ”;
Cassandra – Home
Cassandra Tutorial PDF Version Quick Guide Resources Job Search Discussion Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure. The tutorial starts off with a basic introduction of Cassandra followed by its architecture, installation, and important classes and interfaces. Thereafter, it proceeds to cover how to perform operations such as create, alter, update, and delete on keyspaces, tables, and indexes using CQLSH as well as Java API. The tutorial also has dedicated chapters to explain the data types and collections available in CQL and how to make use of user-defined data types. Audience This tutorial will be extremely useful for software professionals in particular who aspire to learn the ropes of Cassandra and implement it in practice. Prerequisites It is an elementary tutorial and you can easily understand the concepts explained here with a basic knowledge of Java programming. However, it will help if you have some prior exposure to database concepts and any of the Linux flavors. Print Page Previous Next Advertisements ”;
AWS Quicksight – Dashboards
AWS Quicksight – Dashboards ”; Previous Next A dashboard shows the screenshot of the analysis. Unlike Analysis, dashboards are read as only screenshots. One can just use the parameters and filters created in visuals to create new visuals but with same charts To generate a dashboard of the analysis you have created, just click on publish dashboard under share icon. Provide any name to the dashboard and click on “Publish Dashboard” icon. You can opt to share the dashboard with all users in this account or only particular users. Now the dashboard is visible under “All dashboards” tab on home screen. Here is the sample dashboard with a filter attached to it. You can see that there is no option to edit the dashboard except applying filters added while creating visuals. Print Page Previous Next Advertisements ”;
AWS Quicksight – Embedding Dashboard ”; Previous Next You can also embed your Quicksight dashboards into external applications/web pages or can control user access using AWS Cognito service. To perform user control, you can create user pool and identity pool in Cognito and assign Embed dashboard policies to identity pool. AWS Cognito is an IAM service which allows administrators to create and manage temporary users to provide access to applications. With the use of identity pool, you can manage permissions on these user pools. Let us see how we can generate secure dashboard URL and perform user control − Step 1 – Creating user pools and users Create user pool in AWS Cognito and create users. Go to Amazon Cognito → Manage User Pools → Create a User Pool. Step 2 – Creating an identity pool When user pool is created, next step is to create an identity pool. Go to https://console.aws.amazon.com/cognito/home?region=us-east-1 Click on “Create New Identity Pool”. Enter the appropriate name of an identity pool. Go to the Authentication Providers section and select “Cognito” option. Step 3 – Creating Cognito roles Enter the User Pool ID (your User pool ID) and App Client ID (go to App Clients in user pool and copy id). Next is to click on ‘Create Pool’ and click on ‘Allow’ to create roles of the identity pool in IAM. It will create 2 Cognito roles. Step 4 – Assigning Custom Policy Next step is to assign custom policy to identity roles created in the above step − { “Version”: “2012-10-17”, “Statement”: [ { “Action”: “quicksight:RegisterUser”, “Resource”: “*”, “Effect”: “Allow” }, { “Action”: “quicksight:GetDashboardEmbedUrl”, “Resource”: “*”, “Effect”: “Allow” }, { “Action”: “sts:AssumeRole”, “Resource”: “*”, “Effect”: “Allow” } ] } You can pass dashboard Amazon Resource Name (ARN) under quicksight:GetDashboardEmbedUrl” instead of “*” to restrict user to access only one dashboard. Step 5 – Logging into Cognito application Next step is to login to Cognito application with user credentials in user pool. When user logins into application, Cognito generates 3 tokens − IDToken AccessToken Refresh Token To create a temporary IAM user, credentials are as shown below − AWS.config.region = ”us-east-1”; AWS.config.credentials = new AWS.CognitoIdentityCredentials({ IdentityPoolId:”Identity pool ID”, Logins: { ”cognito-idp.us-east-1.amazonaws.com/UserPoolID”: AccessToken } }); For generating temporary IAM credentials, you need to call sts.assume role method with the below parameters − var params = { RoleArn: “Cognito Identity role arn”, RoleSessionName: “Session name” }; sts.assumeRole(params, function (err, data) { if (err) console.log( err, err.stack); // an error occurred else { console.log(data); }) } Step 6 – Registering the user in Quicksight Next step is to register the user in Quicksight using “quicksight.registerUser” for credentials generated in step 3 with the below parameters − var params = { AwsAccountId: “account id”, Email: ”email”, IdentityType: ”IAM” , Namespace: ”default”, UserRole: ADMIN | AUTHOR | READER | RESTRICTED_AUTHOR | RESTRICTED_READER, IamArn: ”Cognito Identity role arn”, SessionName: ”session name given in the assume role creation”, }; quicksight.registerUser(params, function (err, data1) { if (err) console.log(“err register user”); // an error occurred else { // console.log(“Register User1”); } }) Step 7 – Updating AWS Configuration file Next is to update AWS configuration for user generated in step 5. AWS.config.update({ accessKeyId: AccessToken, secretAccessKey: SecretAccessKey , sessionToken: SessionToken, “region”: Region }); Step 8 – Generating embed URL for Quicksight dashboard With credentials created in step 5, call the quicksight.getDashboardEmbedUrl with the below parameters to generate URL. var params = { AwsAccountId: “Enter AWS account ID”, DashboardId: “Enter dashboard Id”, IdentityType: “IAM”, ResetDisabled: true, SessionLifetimeInMinutes: between 15 to 600 minutes, UndoRedoDisabled: True | False } quicksight.getDashboardEmbedUrl(params,function (err, data) { if (!err) { console.log(data); } else { console.log(err); } }); You have to call “QuickSightEmbedding.embedDashboard” from your application using the above generated URL. Like Amazon Quicksight, embedded dashboard also supports the following features − Drill-down option Custom actions (link to a new tab) On-screen filters Download to CSV Sorting on visuals Email report opt-in Reset dashboard to defaults option Undo/redo actions on the dashboard Print Page Previous Next Advertisements ”;
Developer Responsibilities
AWS Quicksight – Developer Responsibilities ”; Previous Next Following job responsibilities are performed by an AWS Quicksight developer − Person should have relevant work experience in analytics, reporting and business intelligence tools. Understanding customer requirements and design solution in AWS to setup ETL and Business Intelligence environment. Understanding different AWS services, their use and configuration. Proficient in using SQL, ETL, Data Warehouse solutions and databases in a business environment with large-scale, disparate datasets. Complex quantitative and data analysis skills. Understanding AWS IAM policies, roles and administrator of AWS services. Print Page Previous Next Advertisements ”;
AWS Quicksight – Managing Quicksight ”; Previous Next Manage Quicksight is to manage your current account. You can add users with respective roles, manage your subscription, and check SPICE capacity or whitelist domains for embedding. You would require admin access to perform any activity on this page. Under the user profile, you will find the option to manage Quicksight. On clicking Manage subscription, below screen will appear. It will show the users in this account and their respective roles. You also have a search option; in case you want to particularly search for an existing user in Quicksight. You can invite users with valid email address or you can add users with a valid IAM account. The users with IAM role can then login to their Quicksight account and view the dashboard to which they have access. Your Subscriptions will show the edition of Quicksight you are subscribed to. SPICE capacity shows the capacity of their calculation engine being opted for and the amount used so far. There is an option of purchasing more capacity if required. Account settings shows details of Quicksight account − notification email address, AWS resource permissions to Quicksight, or you also have an option to close the account. When you close Quicksight account, it deletes all the data related to below objects − Data Sources Data Sets Analyses Published Dashboards Manage VPC connection allows you to manage and add VPC connection to Quicksight. To add a new VPC connection, you need to provide the following details − Domains and embedding allows you to whitelist the domain on which you want to embed Quicksight dashboards for the users. It only supports https:// domain to whitelist in Quicksight − https://example.com You can also include any subdomains if you want to use by selecting the checkbox shown below. When you click on Add button, it adds the domain to the list of domain names allowed in Quicksight for embedding. To edit an allowed domain, you need to click on Edit button located beside the domain name. You can make changes and click on Update. Print Page Previous Next Advertisements ”;
AVRO – Deserialization By Generating Class ”; Previous Next As described earlier, one can read an Avro schema into a program either by generating a class corresponding to the schema or by using the parsers library. This chapter describes how to read the schema by generating a class and Deserialize the data using Avro. Deserialization by Generating a Class The serialized data is stored in the file emp.avro. You can deserialize and read it using Avro. Follow the procedure given below to deserialize the serialized data from a file. Step 1 Create an object of DatumReader interface using SpecificDatumReader class. DatumReader<emp>empDatumReader = new SpecificDatumReader<emp>(emp.class); Step 2 Instantiate DataFileReader for emp class. This class reads serialized data from a file. It requires the Dataumeader object, and path of the file where the serialized data is existing, as a parameters to the constructor. DataFileReader<emp> dataFileReader = new DataFileReader(new File(“/path/to/emp.avro”), empDatumReader); Step 3 Print the deserialized data, using the methods of DataFileReader. The hasNext() method will return a boolean if there are any elements in the Reader. The next() method of DataFileReader returns the data in the Reader. while(dataFileReader.hasNext()){ em=dataFileReader.next(em); System.out.println(em); } Example – Deserialization by Generating a Class The following complete program shows how to deserialize the data in a file using Avro. import java.io.File; import java.io.IOException; import org.apache.avro.file.DataFileReader; import org.apache.avro.io.DatumReader; import org.apache.avro.specific.SpecificDatumReader; public class Deserialize { public static void main(String args[]) throws IOException{ //DeSerializing the objects DatumReader<emp> empDatumReader = new SpecificDatumReader<emp>(emp.class); //Instantiating DataFileReader DataFileReader<emp> dataFileReader = new DataFileReader<emp>(new File(“/home/Hadoop/Avro_Work/with_code_genfile/emp.avro”), empDatumReader); emp em=null; while(dataFileReader.hasNext()){ em=dataFileReader.next(em); System.out.println(em); } } } Browse into the directory where the generated code is placed. In this case, at home/Hadoop/Avro_work/with_code_gen. $ cd home/Hadoop/Avro_work/with_code_gen/ Now, copy and save the above program in the file named DeSerialize.java. Compile and execute it as shown below − $ javac Deserialize.java $ java Deserialize Output {“name”: “omar”, “id”: 1, “salary”: 30000, “age”: 21, “address”: “Hyderabad”} {“name”: “ram”, “id”: 2, “salary”: 40000, “age”: 30, “address”: “Hyderabad”} {“name”: “robbin”, “id”: 3, “salary”: 35000, “age”: 25, “address”: “Hyderabad”} Print Page Previous Next Advertisements ”;