Big Data & Analytics Archives - Page 55 of 75 - Donotsad where can learn any thing work project and make money

Aug 10

Cognos – List Report

Cognos – List Report ”; Previous Next A list report that shows the data in rows and columns and each cell shows the data in the database or you can also add custom calculations in a list report. To create a new list report, go to New → Blank as shown in the following screenshot. When you select a list report, you get the following structure of the report in the Report Studio. You have to drag the objects from the package on the left side to the report structure. You can also edit the title of the report that will appear once you run the report. You can use different tools at the top for the report formatting. To save a report, click on the save button. To run a report, click on Run report. Once you save the report, you have an option to save it in the Public folder or My folder. When you click on the Run option, you can select different formats to run the report. Print Page Previous Next Advertisements ”;

Aug 10

Cognos – Home

Cognos Tutorial PDF Version Quick Guide Resources Job Search Discussion IBM Cognos Business intelligence is a web based reporting and analytic tool. It is used to perform data aggregation and create user friendly detailed reports. IBM Cognos provides a wide range of features and can be considered as an enterprise software to provide flexible reporting environment and can be used for large and medium enterprise. Cognos also provides you an option to export the report in XML or PDF format or you can view the reports in XML format. Audience IBM Cognos provides a wide range of features and can be considered as an enterprise software to provide flexible reporting environment and can be used for large and medium enterprises. It meets the needs of Power Users, Analysts, Business Managers and Company Executives. Power users and analysts want to create ad-hoc reports and can create multiple views of the same data. Business Executives want to see summarize data in dashboard styles, cross tabs and visualizations. Cognos allows both the options for all set of users. Prerequisites IBM Cognos Business Intelligence is an advanced topic. Even though the content has been prepared keeping in mind the requirements of a beginner, the reader should be familiar with the fundamentals of running and viewing reports or manage schedules, portal layouts, and other users” permissions before starting with this tutorial. Print Page Previous Next Advertisements ”;

Aug 10

Apache Tajo – Introduction

Apache Tajo – Introduction ”; Previous Next Distributed Data Warehouse System Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It is a subject-oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization but relational data volumes are increased day by day. To overcome the challenges, distributed data warehouse system shares data across multiple data repositories for the purpose of Online Analytical Processing(OLAP). Each data warehouse may belong to one or more organizations. It performs load balancing and scalability. Metadata is replicated and centrally distributed. Apache Tajo is a distributed data warehouse system which uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine instead of MapReduce framework. Overview of SQL on Hadoop Hadoop is an open-source framework that allows to store and process big data in a distributed environment. It is extremely fast and powerful. However, Hadoop has limited querying capabilities so its performance can be made even better with the help of SQL on Hadoop. This allows users to interact with Hadoop through easy SQL commands. Some of the examples of SQL on Hadoop applications are Hive, Impala, Drill, Presto, Spark, HAWQ and Apache Tajo. What is Apache Tajo Apache Tajo is a relational and distributed data processing framework. It is designed for low latency and scalable ad-hoc query analysis. Tajo supports standard SQL and various data formats. Most of the Tajo queries can be executed without any modification. Tajo has fault-tolerance through a restart mechanism for failed tasks and extensible query rewrite engine. Tajo performs the necessary ETL (Extract Transform and Load process) operations to summarize large datasets stored on HDFS. It is an alternative choice to Hive/Pig. The latest version of Tajo has greater connectivity to Java programs and third-party databases such as Oracle and PostGreSQL. Features of Apache Tajo Apache Tajo has the following features − Superior scalability and optimized performance Low latency User-defined functions Row/columnar storage processing framework. Compatibility with HiveQL and Hive MetaStore Simple data flow and easy maintenance. Benefits of Apache Tajo Apache Tajo offers the following benefits − Easy to use Simplified architecture Cost-based query optimization Vectorized query execution plan Fast delivery Simple I/O mechanism and supports various type of storage. Fault tolerance Use Cases of Apache Tajo The following are some of the use cases of Apache Tajo − Data warehousing and analysis Korea’s SK Telecom firm ran Tajo against 1.7 terabytes worth of data and found it could complete queries with greater speed than either Hive or Impala. Data discovery The Korean music streaming service Melon uses Tajo for analytical processing. Tajo executes ETL (extract-transform-load process) jobs 1.5 to 10 times faster than Hive. Log analysis Bluehole Studio, a Korean based company developed TERA — a fantasy multiplayer online game. The company uses Tajo for game log analysis and finding principal causes of service quality interrupts. Storage and Data Formats Apache Tajo supports the following data formats − JSON Text file(CSV) Parquet Sequence File AVRO Protocol Buffer Apache Orc Tajo supports the following storage formats − HDFS JDBC Amazon S3 Apache HBase Elasticsearch Print Page Previous Next Advertisements ”;

Aug 10

Association Rules

Big Data Analytics – Association Rules ”; Previous Next Let I = i1, i2, …, in be a set of n binary attributes called items. Let D = t1, t2, …, tm be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = ∅. The sets of items (for short item-sets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule. To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I = {milk, bread, butter, beer} and a small database containing the items is shown in the following table. Transaction ID Items 1 milk, bread 2 bread, butter 3 beer 4 milk, bread, butter 5 bread, butter An example rule for the supermarket could be {milk, bread} ⇒ {butter} meaning that if milk and bread is bought, customers also buy butter. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support supp(X) of an item-set X is defined as the proportion of transactions in the data set which contain the item-set. In the example database in Table 1, the item-set {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). Finding frequent item-sets can be seen as a simplification of the unsupervised learning problem. The confidence of a rule is defined conf(X ⇒ Y ) = supp(X ∪ Y )/supp(X). For example, the rule {milk, bread} ⇒ {butter} has a confidence of 0.2/0.4 = 0.5 in the database in Table 1, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. In the script located in bda/part3/apriori.R the code to implement the apriori algorithm can be found. # Load the library for doing association rules # install.packages(’arules’) library(arules) # Data preprocessing data(“AdultUCI”) AdultUCI[1:2,] AdultUCI[[“fnlwgt”]] <- NULL AdultUCI[[“education-num”]] <- NULL AdultUCI[[ “age”]] <- ordered(cut(AdultUCI[[ “age”]], c(15,25,45,65,100)), labels = c(“Young”, “Middle-aged”, “Senior”, “Old”)) AdultUCI[[ “hours-per-week”]] <- ordered(cut(AdultUCI[[ “hours-per-week”]], c(0,25,40,60,168)), labels = c(“Part-time”, “Full-time”, “Over-time”, “Workaholic”)) AdultUCI[[ “capital-gain”]] <- ordered(cut(AdultUCI[[ “capital-gain”]], c(-Inf,0,median(AdultUCI[[ “capital-gain”]][AdultUCI[[ “capitalgain”]]>0]),Inf)), labels = c(“None”, “Low”, “High”)) AdultUCI[[ “capital-loss”]] <- ordered(cut(AdultUCI[[ “capital-loss”]], c(-Inf,0, median(AdultUCI[[ “capital-loss”]][AdultUCI[[ “capitalloss”]]>0]),Inf)), labels = c(“none”, “low”, “high”)) In order to generate rules using the apriori algorithm, we need to create a transaction matrix. The following code shows how to do this in R. # Convert the data into a transactions format Adult <- as(AdultUCI, “transactions”) Adult # transactions in sparse format with # 48842 transactions (rows) and # 115 items (columns) summary(Adult) # Plot frequent item-sets itemFrequencyPlot(Adult, support = 0.1, cex.names = 0.8) # generate rules min_support = 0.01 confidence = 0.6 rules <- apriori(Adult, parameter = list(support = min_support, confidence = confidence)) rules inspect(rules[100:110, ]) # lhs rhs support confidence lift # {occupation = Farming-fishing} => {sex = Male} 0.02856148 0.9362416 1.4005486 # {occupation = Farming-fishing} => {race = White} 0.02831579 0.9281879 1.0855456 # {occupation = Farming-fishing} => {native-country 0.02671881 0.8758389 0.9759474 = United-States} Print Page Previous Next Advertisements ”;

Aug 10

Cognos – Creating a Report

Cognos – Creating a Report ”; Previous Next You can create a new report by inserting objects from the data source in the Query Studio. You can also change an existing report and save it with different name. You can open Query Studio by going to Query my data option on the home page or you can go to Launch → Query Studio. In the next screen, you will be prompted to select a package to add objects in the reports. You can select a recently used package or any other package created in the Framework Manager. You can see Query items listed on the left side. You can add data and save the report. Print Page Previous Next Advertisements ”;

Aug 10

Cognos – Ad-hoc Reports

Cognos – Ad-hoc Reports ”; Previous Next Using ad-hoc reporting, a user can create queries or reports for ad-hoc analysis. Ad-hoc reporting feature allows business users to create simple queries and reports on the top of fact and dimension table in data Warehouse. The Query Studio in Cognos BI, provides the following features − View data and perform ad-hoc data analysis. Save the report for future use. Work with data in the report by applying filters, summaries and calculations. To create ad-hoc report using query studio, login to IBM Cognos software and click on Query my data. Select the report package. Next time you visit this page; you will see your selection under the recently used packages. Click on the package name. In the next screen, you can add Dimension elements, filters and prompts, facts and calculation, etc. You should insert the objects in this order. To insert object in the report, you can use Insert button at the bottom. Insert and filter dimension elements Insert filters and prompts Insert facts and calculations Apply finishing touches Save, run, collaborate, and share At the top, you have the tool bar, where you can create a new report, save existing report, cut, paste, insert charts, drill up and down, etc. When you insert all the objects to a report, you can click on the Run option () at the top. Print Page Previous Next Advertisements ”;

Aug 10

Cognos – Connections

Cognos – Connections ”; Previous Next You can report interactive user reports in Cognos Studio on the top of various data sources by creating relational and OLAP connections in web administration interface which are later used for data modeling in Framework Manager known as packages. All the reports and dashboards that are created in Cognos Studio they are published to Cognos Connection and portal for distribution. The report studio can be used to run the complex report and to view the Business Intelligence information or this can also be accessed from different portals where they are published. Cognos Connections are used to access reports, queries, analysis, and packages. They can also be used to create report shortcuts, URLs and pages and to organize entries and they can also be customized for other use. Connecting Different Data Sources A data source defines the physical connection to a database and different connection parameters like connection time out, location of database, etc. A data source connection contains credential and sign on information. You can create a new database connection or can also edit an existing data source connection. You can also combine one or more data source connections and create packages and published them using Framework manager. Dynamic Query Mode The dynamic query mode is used to provide communication to data source using XMLA/Java connections. To connect to the Relation database, you can use type4 JDBC connection which converts JDBC calls into vendor specific format. It provides improved performance over type 2 drivers because there is no need to convert calls to ODBC or database API. Dynamic query mode in Cognos connection can support the following types of Relational databases − Microsoft SQL Server Oracle IBM DB2 Teradata Netezza To support OLAP data sources, Java/XMLA connectivity provides optimized and enhanced MDX for different OLAP versions and technology. The Dynamic query mode in Cognos can be used with the following OLAP data sources − SAP Business Information Warehouse (SAP BW) Oracle Essbase Microsoft Analysis Services IBM Cognos TM1 IBM Cognos Real-time Monitoring DB2 Data Sources The DB2 connection type are used to connect to DB2 Windows, Unix and Linux, Db2 zOS, etc. The common connection parameters used in DB2 data source includes − Database Name Timeouts Signon DB2 connect string Collation Sequence Creating a Data Source Connection in IBM Cognos To create models in IBM Cognos Framework Manager, there is a need to create a data source connection. When defining the data source connection, you need to enter the connection parameters – location of database, timeout interval, Sign-on, etc. In IBM Cognos Connection → click on the Launch IBM Cognos Administration In the Configuration tab, click Data Source Connections. In this window, navigate to the New Data Source button. Enter the unique connection name and description. You can add a description related to the data source to uniquely identify the connection and click the next button. Select the type of connection from the drop down list and click on the next button as shown in the following screenshot. In the next screen that appears, enter the connection details as shown in the following screenshot. You can use the Test connection to test the connectivity to the data source using connection parameters that you have defined. Click on the finish button once done. Data Source Security Setup Data Source Security can be defined using IBM Cognos authentication. As per the data source, different types of authentication can be configured in the Cognos connection − No Authentication − This allows login to the data source without using any sign-on credentials. This type of connection doesn’t provide data source security in connection. IBM Cognos Software Service Credential − In this type of a sign-on, you log in to the data source using a logon specified for the IBM Cognos Service and the user does not require a separate database sign-on. In a live environment, it is advisable to use individual database sign on. External Name Space − It requires the same BI logon credentials that are used to authenticate the external authentication namespace. The user must be logged into the name space before logging in to the data source and it should be active. All the data sources also support data source sign-on defined for everyone in the group or for individual users, group or roles. If the data source requires a data source sign-on, but you don”t have the access to a sign-on for this data source, you will be prompted to log on each time you access the data source. IBM Cognos also supports security at cube level. If you are using cubes, security may be set at the cube level. For Microsoft Analysis Service, security is defined at the cube level roles. Print Page Previous Next Advertisements ”;

Aug 10

Big Data Analytics – Resources

Big Data Analytics – Useful Resources ”; Previous Next The following resources contain additional information on Big Data Analytics. Please use them to get more in-depth knowledge on this topic. Useful Video Courses Big Data Analytics on Microsoft AZURE 19 Lectures 1.5 hours Pranjal Srivastava, Harshit Srivastava More Detail Microsoft Azure: Big Data and Analytics Training 19 Lectures 1.5 hours Pranjal Srivastava More Detail Machine Learning & BIG Data Analytics with Microsoft AZURE Best Seller 47 Lectures 3.5 hours Pranjal Srivastava More Detail Complete AWS Big Data and Analytics Training 38 Lectures 3.5 hours Pranjal Srivastava, Harshit Srivastava More Detail Big Data Analytics with AWS and Microsoft Azure 28 Lectures 2 hours Pranjal Srivastava More Detail Learn Machine Learning and Big Data Analytics with AWS 47 Lectures 4 hours Pranjal Srivastava More Detail Print Page Previous Next Advertisements ”;

Aug 10

Big Data Analytics – Time Series

Big Data Analytics – Time Series Analysis ”; Previous Next Time series is a sequence of observations of categorical or numeric variables indexed by a date, or timestamp. A clear example of time series data is the time series of a stock price. In the following table, we can see the basic structure of time series data. In this case the observations are recorded every hour. Timestamp Stock – Price 2015-10-11 09:00:00 100 2015-10-11 10:00:00 110 2015-10-11 11:00:00 105 2015-10-11 12:00:00 90 2015-10-11 13:00:00 120 Normally, the first step in time series analysis is to plot the series, this is normally done with a line chart. The most common application of time series analysis is forecasting future values of a numeric value using the temporal structure of the data. This means, the available observations are used to predict values from the future. The temporal ordering of the data, implies that traditional regression methods are not useful. In order to build robust forecast, we need models that take into account the temporal ordering of the data. The most widely used model for Time Series Analysis is called Autoregressive Moving Average (ARMA). The model consists of two parts, an autoregressive (AR) part and a moving average (MA) part. The model is usually then referred to as the ARMA(p, q) model where p is the order of the autoregressive part and q is the order of the moving average part. Autoregressive Model The AR(p) is read as an autoregressive model of order p. Mathematically it is written as − $$X_t = c + sum_{i = 1}^{P} phi_i X_{t – i} + varepsilon_{t}$$ where {φ1, …, φp} are parameters to be estimated, c is a constant, and the random variable εt represents the white noise. Some constraints are necessary on the values of the parameters so that the model remains stationary. Moving Average The notation MA(q) refers to the moving average model of order q − $$X_t = mu + varepsilon_t + sum_{i = 1}^{q} theta_i varepsilon_{t – i}$$ where the θ1, …, θq are the parameters of the model, μ is the expectation of Xt, and the εt, εt − 1, … are, white noise error terms. Autoregressive Moving Average The ARMA(p, q) model combines p autoregressive terms and q moving-average terms. Mathematically the model is expressed with the following formula − $$X_t = c + varepsilon_t + sum_{i = 1}^{P} phi_iX_{t – 1} + sum_{i = 1}^{q} theta_i varepsilon_{t-i}$$ We can see that the ARMA(p, q) model is a combination of AR(p) and MA(q) models. To give some intuition of the model consider that the AR part of the equation seeks to estimate parameters for Xt − i observations of in order to predict the value of the variable in Xt. It is in the end a weighted average of the past values. The MA section uses the same approach but with the error of previous observations, εt − i. So in the end, the result of the model is a weighted average. The following code snippet demonstrates how to implement an ARMA(p, q) in R. # install.packages(“forecast”) library(“forecast”) # Read the data data = scan(”fancy.dat”) ts_data <- ts(data, frequency = 12, start = c(1987,1)) ts_data plot.ts(ts_data) Plotting the data is normally the first step to find out if there is a temporal structure in the data. We can see from the plot that there are strong spikes at the end of each year. The following code fits an ARMA model to the data. It runs several combinations of models and selects the one that has less error. # Fit the ARMA model fit = auto.arima(ts_data) summary(fit) # Series: ts_data # ARIMA(1,1,1)(0,1,1)[12] # Coefficients: # ar1 ma1 sma1 # 0.2401 -0.9013 0.7499 # s.e. 0.1427 0.0709 0.1790 # # sigma^2 estimated as 15464184: log likelihood = -693.69 # AIC = 1395.38 AICc = 1395.98 BIC = 1404.43 # Training set error measures: # ME RMSE MAE MPE MAPE MASE ACF1 # Training set 328.301 3615.374 2171.002 -2.481166 15.97302 0.4905797 -0.02521172 Print Page Previous Next Advertisements ”;

Aug 10

Big Data Analytics – Data Visualization

Big Data Analytics – Data Visualization ”; Previous Next In order to understand data, it is often useful to visualize it. Normally in Big Data applications, the interest relies in finding insight rather than just making beautiful plots. The following are examples of different approaches to understanding data using plots. To start analyzing the flights data, we can start by checking if there are correlations between numeric variables. This code is also available in bda/part1/data_visualization/data_visualization.R file. # Install the package corrplot by running install.packages(”corrplot”) # then load the library library(corrplot) # Load the following libraries library(nycflights13) library(ggplot2) library(data.table) library(reshape2) # We will continue working with the flights data DT <- as.data.table(flights) head(DT) # take a look # We select the numeric variables after inspecting the first rows. numeric_variables = c(”dep_time”, ”dep_delay”, ”arr_time”, ”arr_delay”, ”air_time”, ”distance”) # Select numeric variables from the DT data.table dt_num = DT[, numeric_variables, with = FALSE] # Compute the correlation matrix of dt_num cor_mat = cor(dt_num, use = “complete.obs”) print(cor_mat) ### Here is the correlation matrix # dep_time dep_delay arr_time arr_delay air_time distance # dep_time 1.00000000 0.25961272 0.66250900 0.23230573 -0.01461948 -0.01413373 # dep_delay 0.25961272 1.00000000 0.02942101 0.91480276 -0.02240508 -0.02168090 # arr_time 0.66250900 0.02942101 1.00000000 0.02448214 0.05429603 0.04718917 # arr_delay 0.23230573 0.91480276 0.02448214 1.00000000 -0.03529709 -0.06186776 # air_time -0.01461948 -0.02240508 0.05429603 -0.03529709 1.00000000 0.99064965 # distance -0.01413373 -0.02168090 0.04718917 -0.06186776 0.99064965 1.00000000 # We can display it visually to get a better understanding of the data corrplot.mixed(cor_mat, lower = “circle”, upper = “ellipse”) # save it to disk png(”corrplot.png”) print(corrplot.mixed(cor_mat, lower = “circle”, upper = “ellipse”)) dev.off() This code generates the following correlation matrix visualization − We can see in the plot that there is a strong correlation between some of the variables in the dataset. For example, arrival delay and departure delay seem to be highly correlated. We can see this because the ellipse shows an almost lineal relationship between both variables, however, it is not simple to find causation from this result. We can’t say that as two variables are correlated, that one has an effect on the other. Also we find in the plot a strong correlation between air time and distance, which is fairly reasonable to expect as with more distance, the flight time should grow. We can also do univariate analysis of the data. A simple and effective way to visualize distributions are box-plots. The following code demonstrates how to produce box-plots and trellis charts using the ggplot2 library. This code is also available in bda/part1/data_visualization/boxplots.R file. source(”data_visualization.R”) ### Analyzing Distributions using box-plots # The following shows the distance as a function of the carrier p = ggplot(DT, aes(x = carrier, y = distance, fill = carrier)) + # Define the carrier in the x axis and distance in the y axis geom_box-plot() + # Use the box-plot geom theme_bw() + # Leave a white background – More in line with tufte”s principles than the default guides(fill = FALSE) + # Remove legend labs(list(title = ”Distance as a function of carrier”, # Add labels x = ”Carrier”, y = ”Distance”)) p # Save to disk png(‘boxplot_carrier.png’) print(p) dev.off() # Let”s add now another variable, the month of each flight # We will be using facet_wrap for this p = ggplot(DT, aes(carrier, distance, fill = carrier)) + geom_box-plot() + theme_bw() + guides(fill = FALSE) + facet_wrap(~month) + # This creates the trellis plot with the by month variable labs(list(title = ”Distance as a function of carrier by month”, x = ”Carrier”, y = ”Distance”)) p # The plot shows there aren”t clear differences between distance in different months # Save to disk png(”boxplot_carrier_by_month.png”) print(p) dev.off() Print Page Previous Next Advertisements ”;