Cognos – Custom Calculations ”; Previous Next You can add custom calculations to your report as per the business requirement. With the help of operators, different calculations can be added like if you want to add a new value salary*0.2 as a Bonus. To create Calculations in a Report − Select the item in the report. Click the insert calculation button and select the calculation to perform. Note − Calculations that are not applicable to the items you selected are greyed out. To change the order of the operands or the name of the calculated item added to the report, click Custom. The calculation appears as a new row or a column in your report. Drilling Drill up and drill down is used to perform analysis by moving between levels of information. Drill down is used to see more detailed information to lowest level and drill up is used to compare the results. To drill down or up in a single row or column, pause the pointer over the label text until the icon with the plus sign (+) and caret drill down drill up icon appears and the text is underlined, and then click. To drill down or up in both a row and column simultaneously, click on the value at the intersection of the row and the column, and then click again. Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Cognos – Save an Analysis
Cognos – Save an Analysis ”; Previous Next To save an analysis, you can click on the save button at the top as shown in the following screenshot. Enter a name of the analysis and location → then click OK. Print Page Previous Next Advertisements ”;
Cognos – Report Templates
Cognos – Report Templates ”; Previous Next In Report Studio, you can create different types of reports. They allow you to present the data in different formats like a list report can be used to show the customer information. The following reports can be created in Report Studio − List Report This report is used to show the data in detailed format. Data is shown in rows and columns and each column contains all the values of a data item. Quarter Order number Quantity Revenue Q4 101035 105 $4,200.00 101037 90 $8,470.80 101044 124 $11,479.92 101052 193 $15,952.42 101064 58 $5,458.96 101065 78 $7,341.36 101081 145 $5,800.00 101092 81 $7,623.72 101093 50 $4,706.00 101103 139 $5,560.00 Crosstab Like list report, a cross tab report also shows the data in row and columns, but the data is compact and not detailed. At the intersection points of rows and columns, you show the summarized data. Chart You can use the Report Studio to create many chart types, including column, bar, area, and line charts. You can also create custom charts that combine these chart types. Map You can also use maps in the Report Studio to present data for a particular region, country or a location. A map report consists of three parts − Region Layer Point Layer Display Layer Repeater Repeaters are used to add repeat items in a report while running the report. To add a Repeater, drag a repeater from the tool box to work area. Print Page Previous Next Advertisements ”;
Cognos – Filters
Cognos – Filters ”; Previous Next Filters are used to limit the data that you want in your report. You can apply one or more filters in a Cognos report and the report returns the data that meet the filter conditions. You can create various custom filters in a report as per the requirement. Select the column to filter by. Click the drop down list from the Filter button. Choose Create Custom Filter. The Filter Condition dialog displays. In the next window, define the filter’s parameters. Condition − click the list arrow to see your choices (Show or Don’t show the following values). Values − click the list arrow to see your choices. Keywords − allows you to search for specific values within the values list. Values List − shows the field values which you can use as filter values. You can select one or many. Use the arrow button to add multiple values. Select a value and click the right pointing arrow to move the value into the selected column. You can use the Ctrl key to add multiple values at tone time. Click OK when the filter is defined. Note − You can view filters in the Query Explorer page and not the page explorer. You can go to the query explorer and view the filters. Deleting a Filter A filter can be deleted by using the following steps − Go to the Query Explorer as shown in the above screenshot Click on Query and Locate the Detail Filters pane in the upper right side of the window as shown in above screenshot Select the filter that you want to delete and press the delete button You can also cut/copy a filter Print Page Previous Next Advertisements ”;
DAX – Calculated Columns
Excel DAX – Calculated Columns ”; Previous Next A calculated column is a column that you add to an existing table in the Data Model of your workbook by means of a DAX formula that defines the column values. Instead of importing the values in the column, you create the calculated column. You can use the calculated column in a PivotTable, PivotChart, Power PivotTable, Power PivotChart or Power View report just like any other table column. Understanding Calculated Columns The DAX formula used to create a calculated column is like an Excel formula. However, in DAX formula, you cannot create different formulas for different rows in a table. The DAX formula is automatically applied to the entire column. For example, you can create one calculated column to extract Year from the existing column – Date, with the DAX formula − = YEAR ([Date]) YEAR is a DAX function and Date is an existing column in the table. As seen, the table name is enclosed in brackets. You will learn more about this in the chapter – DAX Syntax. When you add a column to a table with this DAX formula, the column values are computed as soon as you create the formula. A new column with the header CalculatedColumn1 filled with Year values will get created. Column values are recalculated as necessary, such as when the underlying data is refreshed. You can create calculated columns based on existing columns, calculated fields (measures), and other calculated columns. Creating a Calculated Column Consider the Data Model with the Olympics Results as shown in the following screenshot. Click the Data View. Click the Results tab. You will be viewing the Results table. As seen in the above screenshot, the rightmost column has the header – Add Column. Click the Design tab on the Ribbon. Click Add in the Columns group. The pointer will appear in the formula bar. That means you are adding a column with a DAX formula. Type =YEAR ([Date]) in the formula bar. As can be seen in the above screenshot, the rightmost column with the header – Add Column is highlighted. Press Enter. It will take a while (few seconds) for the calculations to be done. Please wait. The new calculated column will get inserted to the left of the rightmost Add Column. As shown in the above screenshot, the newly inserted calculated column is highlighted. Values in the entire column appear as per the DAX formula used. The column header is CalculatedColumn1. Renaming the Calculated Column To rename the calculated column to a meaningful name, do the following − Double-click on the column header. The column name will be highlighted. Select the column name. Type Year (the new name). As seen in the above screenshot, the name of the calculated column got changed. You can also rename a calculated column by right-clicking on the column and then clicking on Rename in the dropdown list. Just make sure that the new name does not conflict with an existing name in the table. Checking the Data Type of the Calculated Column You can check the data type of the calculated column as follows − Click the Home tab on the Ribbon. Click the Data Type. As you can see in the above screenshot, the dropdown list has the possible data types for the columns. In this example, the default (Auto) data type, i.e. the Whole Number is selected. Errors in Calculated Columns Errors can occur in the calculated columns for the following reasons − Changing or deleting relationships between the tables. This is because the formulas that use columns in those tables will become invalid. The formula contains a circular or self-referencing dependency. Performance Issues As seen earlier in the example of Olympics results, the Results table has about 35000 rows of data. Hence, when you created a column with a DAX formula, it had calculated all the 35000+ values in the column at once, for which it took a little while. The Data Model and the tables are meant to handle millions of rows of data. Hence, it can affect the performance when the DAX formula has too many references. You can avoid the performance issues doing the following − If your DAX formula contains many complex dependencies, then create it in steps saving the results in new calculated columns, instead of creating a single big formula at once. This enables you to validate the results and assess the performance. Calculated columns need to be recalculated when data modifications occur. You can set the recalculation mode to manual, thus saving frequent recalculations. However, if any values in the calculated column are incorrect, the column will be grayed out, until you refresh and recalculate the data. Print Page Previous Next Advertisements ”;
Cognos – Report Run with Options ”; Previous Next You can run the report with different options. To set the report options, go to Run options. You get different options − Format − You can select from different format. To select Paper size − You can select from different paper sizes, orientation. Select Data mode − All data, limited data, and no data. Language − Select language in which you want to run the report. Rows per page and prompt option, etc. Print Page Previous Next Advertisements ”;
Cognos – Add Data to a Report ”; Previous Next You can add objects from a data source. Each object has a representative icon and can insert all the following objects to a report. Print Page Previous Next Advertisements ”;
Big Data Analytics – Charts & Graphs ”; Previous Next The first approach to analyzing data is to visually analyze it. The objectives at doing this are normally finding relations between variables and univariate descriptions of the variables. We can divide these strategies as − Univariate analysis Multivariate analysis Univariate Graphical Methods Univariate is a statistical term. In practice, it means we want to analyze a variable independently from the rest of the data. The plots that allow to do this efficiently are − Box-Plots Box-Plots are normally used to compare distributions. It is a great way to visually inspect if there are differences between distributions. We can see if there are differences between the price of diamonds for different cut. # We will be using the ggplot2 library for plotting library(ggplot2) data(“diamonds”) # We will be using the diamonds dataset to analyze distributions of numeric variables head(diamonds) # carat cut color clarity depth table price x y z # 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 # 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 # 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 # 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 # 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 # 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ### Box-Plots p = ggplot(diamonds, aes(x = cut, y = price, fill = cut)) + geom_box-plot() + theme_bw() print(p) We can see in the plot there are differences in the distribution of diamonds price in different types of cut. Histograms source(”01_box_plots.R”) # We can plot histograms for each level of the cut factor variable using facet_grid p = ggplot(diamonds, aes(x = price, fill = cut)) + geom_histogram() + facet_grid(cut ~ .) + theme_bw() p # the previous plot doesn’t allow to visuallize correctly the data because of the differences in scale # we can turn this off using the scales argument of facet_grid p = ggplot(diamonds, aes(x = price, fill = cut)) + geom_histogram() + facet_grid(cut ~ ., scales = ”free”) + theme_bw() p png(”02_histogram_diamonds_cut.png”) print(p) dev.off() The output of the above code will be as follows − Multivariate Graphical Methods Multivariate graphical methods in exploratory data analysis have the objective of finding relationships among different variables. There are two ways to accomplish this that are commonly used: plotting a correlation matrix of numeric variables or simply plotting the raw data as a matrix of scatter plots. In order to demonstrate this, we will use the diamonds dataset. To follow the code, open the script bda/part2/charts/03_multivariate_analysis.R. library(ggplot2) data(diamonds) # Correlation matrix plots keep_vars = c(”carat”, ”depth”, ”price”, ”table”) df = diamonds[, keep_vars] # compute the correlation matrix M_cor = cor(df) # carat depth price table # carat 1.00000000 0.02822431 0.9215913 0.1816175 # depth 0.02822431 1.00000000 -0.0106474 -0.2957785 # price 0.92159130 -0.01064740 1.0000000 0.1271339 # table 0.18161755 -0.29577852 0.1271339 1.0000000 # plots heat-map(M_cor) The code will produce the following output − This is a summary, it tells us that there is a strong correlation between price and caret, and not much among the other variables. A correlation matrix can be useful when we have a large number of variables in which case plotting the raw data would not be practical. As mentioned, it is possible to show the raw data also − library(GGally) ggpairs(df) We can see in the plot that the results displayed in the heat-map are confirmed, there is a 0.922 correlation between the price and carat variables. It is possible to visualize this relationship in the price-carat scatterplot located in the (3, 1) index of the scatterplot matrix. Print Page Previous Next Advertisements ”;
Apache Tajo – Introduction
Apache Tajo – Introduction ”; Previous Next Distributed Data Warehouse System Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It is a subject-oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization but relational data volumes are increased day by day. To overcome the challenges, distributed data warehouse system shares data across multiple data repositories for the purpose of Online Analytical Processing(OLAP). Each data warehouse may belong to one or more organizations. It performs load balancing and scalability. Metadata is replicated and centrally distributed. Apache Tajo is a distributed data warehouse system which uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of MapReduce framework. Overview of SQL on Hadoop Hadoop is an open-source framework that allows to store and process big data in a distributed environment. It is extremely fast and powerful. However, Hadoop has limited querying capabilities so its performance can be made even better with the help of SQL on Hadoop. This allows users to interact with Hadoop through easy SQL commands. Some of the examples of SQL on Hadoop applications are Hive, Impala, Drill, Presto, Spark, HAWQ and Apache Tajo. What is Apache Tajo Apache Tajo is a relational and distributed data processing framework. It is designed for low latency and scalable ad-hoc query analysis. Tajo supports standard SQL and various data formats. Most of the Tajo queries can be executed without any modification. Tajo has fault-tolerance through a restart mechanism for failed tasks and extensible query rewrite engine. Tajo performs the necessary ETL (Extract Transform and Load process) operations to summarize large datasets stored on HDFS. It is an alternative choice to Hive/Pig. The latest version of Tajo has greater connectivity to Java programs and third-party databases such as Oracle and PostGreSQL. Features of Apache Tajo Apache Tajo has the following features − Superior scalability and optimized performance Low latency User-defined functions Row/columnar storage processing framework. Compatibility with HiveQL and Hive MetaStore Simple data flow and easy maintenance. Benefits of Apache Tajo Apache Tajo offers the following benefits − Easy to use Simplified architecture Cost-based query optimization Vectorized query execution plan Fast delivery Simple I/O mechanism and supports various type of storage. Fault tolerance Use Cases of Apache Tajo The following are some of the use cases of Apache Tajo − Data warehousing and analysis Korea’s SK Telecom firm ran Tajo against 1.7 terabytes worth of data and found it could complete queries with greater speed than either Hive or Impala. Data discovery The Korean music streaming service Melon uses Tajo for analytical processing. Tajo executes ETL (extract-transform-load process) jobs 1.5 to 10 times faster than Hive. Log analysis Bluehole Studio, a Korean based company developed TERA — a fantasy multiplayer online game. The company uses Tajo for game log analysis and finding principal causes of service quality interrupts. Storage and Data Formats Apache Tajo supports the following data formats − JSON Text file(CSV) Parquet Sequence File AVRO Protocol Buffer Apache Orc Tajo supports the following storage formats − HDFS JDBC Amazon S3 Apache HBase Elasticsearch Print Page Previous Next Advertisements ”;
Association Rules
Big Data Analytics – Association Rules ”; Previous Next Let I = i1, i2, …, in be a set of n binary attributes called items. Let D = t1, t2, …, tm be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = ∅. The sets of items (for short item-sets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule. To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I = {milk, bread, butter, beer} and a small database containing the items is shown in the following table. Transaction ID Items 1 milk, bread 2 bread, butter 3 beer 4 milk, bread, butter 5 bread, butter An example rule for the supermarket could be {milk, bread} ⇒ {butter} meaning that if milk and bread is bought, customers also buy butter. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support supp(X) of an item-set X is defined as the proportion of transactions in the data set which contain the item-set. In the example database in Table 1, the item-set {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). Finding frequent item-sets can be seen as a simplification of the unsupervised learning problem. The confidence of a rule is defined conf(X ⇒ Y ) = supp(X ∪ Y )/supp(X). For example, the rule {milk, bread} ⇒ {butter} has a confidence of 0.2/0.4 = 0.5 in the database in Table 1, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. In the script located in bda/part3/apriori.R the code to implement the apriori algorithm can be found. # Load the library for doing association rules # install.packages(’arules’) library(arules) # Data preprocessing data(“AdultUCI”) AdultUCI[1:2,] AdultUCI[[“fnlwgt”]] <- NULL AdultUCI[[“education-num”]] <- NULL AdultUCI[[ “age”]] <- ordered(cut(AdultUCI[[ “age”]], c(15,25,45,65,100)), labels = c(“Young”, “Middle-aged”, “Senior”, “Old”)) AdultUCI[[ “hours-per-week”]] <- ordered(cut(AdultUCI[[ “hours-per-week”]], c(0,25,40,60,168)), labels = c(“Part-time”, “Full-time”, “Over-time”, “Workaholic”)) AdultUCI[[ “capital-gain”]] <- ordered(cut(AdultUCI[[ “capital-gain”]], c(-Inf,0,median(AdultUCI[[ “capital-gain”]][AdultUCI[[ “capitalgain”]]>0]),Inf)), labels = c(“None”, “Low”, “High”)) AdultUCI[[ “capital-loss”]] <- ordered(cut(AdultUCI[[ “capital-loss”]], c(-Inf,0, median(AdultUCI[[ “capital-loss”]][AdultUCI[[ “capitalloss”]]>0]),Inf)), labels = c(“none”, “low”, “high”)) In order to generate rules using the apriori algorithm, we need to create a transaction matrix. The following code shows how to do this in R. # Convert the data into a transactions format Adult <- as(AdultUCI, “transactions”) Adult # transactions in sparse format with # 48842 transactions (rows) and # 115 items (columns) summary(Adult) # Plot frequent item-sets itemFrequencyPlot(Adult, support = 0.1, cex.names = 0.8) # generate rules min_support = 0.01 confidence = 0.6 rules <- apriori(Adult, parameter = list(support = min_support, confidence = confidence)) rules inspect(rules[100:110, ]) # lhs rhs support confidence lift # {occupation = Farming-fishing} => {sex = Male} 0.02856148 0.9362416 1.4005486 # {occupation = Farming-fishing} => {race = White} 0.02831579 0.9281879 1.0855456 # {occupation = Farming-fishing} => {native-country 0.02671881 0.8758389 0.9759474 = United-States} Print Page Previous Next Advertisements ”;