Big Data Analytics – Introduction to R

Big Data Analytics – Introduction to R ”; Previous Next This section is devoted to introduce the users to the R programming language. R can be downloaded from the cran website. For Windows users, it is useful to install rtools and the rstudio IDE. The general concept behind R is to serve as an interface to other software developed in compiled languages such as C, C++, and Fortran and to give the user an interactive tool to analyze data. Navigate to the folder of the book zip file bda/part2/R_introduction and open the R_introduction.Rproj file. This will open an RStudio session. Then open the 01_vectors.R file. Run the script line by line and follow the comments in the code. Another useful option in order to learn is to just type the code, this will help you get used to R syntax. In R comments are written with the # symbol. In order to display the results of running R code in the book, after code is evaluated, the results R returns are commented. This way, you can copy paste the code in the book and try directly sections of it in R. # Create a vector of numbers numbers = c(1, 2, 3, 4, 5) print(numbers) # [1] 1 2 3 4 5 # Create a vector of letters ltrs = c(”a”, ”b”, ”c”, ”d”, ”e”) # [1] “a” “b” “c” “d” “e” # Concatenate both mixed_vec = c(numbers, ltrs) print(mixed_vec) # [1] “1” “2” “3” “4” “5” “a” “b” “c” “d” “e” Let’s analyze what happened in the previous code. We can see it is possible to create vectors with numbers and with letters. We did not need to tell R what type of data type we wanted beforehand. Finally, we were able to create a vector with both numbers and letters. The vector mixed_vec has coerced the numbers to character, we can see this by visualizing how the values are printed inside quotes. The following code shows the data type of different vectors as returned by the function class. It is common to use the class function to “interrogate” an object, asking him what his class is. ### Evaluate the data types using class ### One dimensional objects # Integer vector num = 1:10 class(num) # [1] “integer” # Numeric vector, it has a float, 10.5 num = c(1:10, 10.5) class(num) # [1] “numeric” # Character vector ltrs = letters[1:10] class(ltrs) # [1] “character” # Factor vector fac = as.factor(ltrs) class(fac) # [1] “factor” R supports two-dimensional objects also. In the following code, there are examples of the two most popular data structures used in R: the matrix and data.frame. # Matrix M = matrix(1:12, ncol = 4) # [,1] [,2] [,3] [,4] # [1,] 1 4 7 10 # [2,] 2 5 8 11 # [3,] 3 6 9 12 lM = matrix(letters[1:12], ncol = 4) # [,1] [,2] [,3] [,4] # [1,] “a” “d” “g” “j” # [2,] “b” “e” “h” “k” # [3,] “c” “f” “i” “l” # Coerces the numbers to character # cbind concatenates two matrices (or vectors) in one matrix cbind(M, lM) # [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] # [1,] “1” “4” “7” “10” “a” “d” “g” “j” # [2,] “2” “5” “8” “11” “b” “e” “h” “k” # [3,] “3” “6” “9” “12” “c” “f” “i” “l” class(M) # [1] “matrix” class(lM) # [1] “matrix” # data.frame # One of the main objects of R, handles different data types in the same object. # It is possible to have numeric, character and factor vectors in the same data.frame df = data.frame(n = 1:5, l = letters[1:5]) df # n l # 1 1 a # 2 2 b # 3 3 c # 4 4 d # 5 5 e As demonstrated in the previous example, it is possible to use different data types in the same object. In general, this is how data is presented in databases, APIs part of the data is text or character vectors and other numeric. In is the analyst job to determine which statistical data type to assign and then use the correct R data type for it. In statistics we normally consider variables are of the following types − Numeric Nominal or categorical Ordinal In R, a vector can be of the following classes − Numeric – Integer Factor Ordered Factor R provides a data type for each statistical type of variable. The ordered factor is however rarely used, but can be created by the function factor, or ordered. The following section treats the concept of indexing. This is a quite common operation, and deals with the problem of selecting sections of an object and making transformations to them. # Let”s create a data.frame df = data.frame(numbers = 1:26, letters) head(df) # numbers letters # 1 1 a # 2 2 b # 3 3 c # 4 4 d # 5 5 e # 6 6 f # str gives the structure of a data.frame, it’s a good summary to inspect an object str(df) # ”data.frame”: 26 obs. of 2 variables: # $ numbers: int 1 2 3 4 5 6 7 8 9 10 … # $ letters: Factor w/ 26 levels “a”,”b”,”c”,”d”,..: 1 2 3 4 5 6 7 8 9 10 … # The latter shows the letters character vector was coerced as a factor. # This can be explained by the stringsAsFactors = TRUE argumnet in data.frame # read ?data.frame for more information class(df) # [1] “data.frame” ### Indexing # Get the first row df[1, ] # numbers letters # 1 1 a # Used for programming normally – returns the output as a list df[1, , drop = TRUE] # $numbers # [1] 1 # # $letters # [1] a # Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z # Get several rows

Big Data Analytics – Summarizing

Big Data Analytics – Summarizing Data ”; Previous Next Reporting is very important in big data analytics. Every organization must have a regular provision of information to support its decision making process. This task is normally handled by data analysts with SQL and ETL (extract, transfer, and load) experience. The team in charge of this task has the responsibility of spreading the information produced in the big data analytics department to different areas of the organization. The following example demonstrates what summarization of data means. Navigate to the folder bda/part1/summarize_data and inside the folder, open the summarize_data.Rproj file by double clicking it. Then, open the summarize_data.R script and take a look at the code, and follow the explanations presented. # Install the following packages by running the following code in R. pkgs = c(”data.table”, ”ggplot2”, ”nycflights13”, ”reshape2”) install.packages(pkgs) The ggplot2 package is great for data visualization. The data.table package is a great option to do fast and memory efficient summarization in R. A recent benchmark shows it is even faster than pandas, the python library used for similar tasks. Take a look at the data using the following code. This code is also available in bda/part1/summarize_data/summarize_data.Rproj file. library(nycflights13) library(ggplot2) library(data.table) library(reshape2) # Convert the flights data.frame to a data.table object and call it DT DT <- as.data.table(flights) # The data has 336776 rows and 16 columns dim(DT) # Take a look at the first rows head(DT) # year month day dep_time dep_delay arr_time arr_delay carrier # 1: 2013 1 1 517 2 830 11 UA # 2: 2013 1 1 533 4 850 20 UA # 3: 2013 1 1 542 2 923 33 AA # 4: 2013 1 1 544 -1 1004 -18 B6 # 5: 2013 1 1 554 -6 812 -25 DL # 6: 2013 1 1 554 -4 740 12 UA # tailnum flight origin dest air_time distance hour minute # 1: N14228 1545 EWR IAH 227 1400 5 17 # 2: N24211 1714 LGA IAH 227 1416 5 33 # 3: N619AA 1141 JFK MIA 160 1089 5 42 # 4: N804JB 725 JFK BQN 183 1576 5 44 # 5: N668DN 461 LGA ATL 116 762 5 54 # 6: N39463 1696 EWR ORD 150 719 5 54 The following code has an example of data summarization. ### Data Summarization # Compute the mean arrival delay DT[, list(mean_arrival_delay = mean(arr_delay, na.rm = TRUE))] # mean_arrival_delay # 1: 6.895377 # Now, we compute the same value but for each carrier mean1 = DT[, list(mean_arrival_delay = mean(arr_delay, na.rm = TRUE)), by = carrier] print(mean1) # carrier mean_arrival_delay # 1: UA 3.5580111 # 2: AA 0.3642909 # 3: B6 9.4579733 # 4: DL 1.6443409 # 5: EV 15.7964311 # 6: MQ 10.7747334 # 7: US 2.1295951 # 8: WN 9.6491199 # 9: VX 1.7644644 # 10: FL 20.1159055 # 11: AS -9.9308886 # 12: 9E 7.3796692 # 13: F9 21.9207048 # 14: HA -6.9152047 # 15: YV 15.5569853 # 16: OO 11.9310345 # Now let’s compute to means in the same line of code mean2 = DT[, list(mean_departure_delay = mean(dep_delay, na.rm = TRUE), mean_arrival_delay = mean(arr_delay, na.rm = TRUE)), by = carrier] print(mean2) # carrier mean_departure_delay mean_arrival_delay # 1: UA 12.106073 3.5580111 # 2: AA 8.586016 0.3642909 # 3: B6 13.022522 9.4579733 # 4: DL 9.264505 1.6443409 # 5: EV 19.955390 15.7964311 # 6: MQ 10.552041 10.7747334 # 7: US 3.782418 2.1295951 # 8: WN 17.711744 9.6491199 # 9: VX 12.869421 1.7644644 # 10: FL 18.726075 20.1159055 # 11: AS 5.804775 -9.9308886 # 12: 9E 16.725769 7.3796692 # 13: F9 20.215543 21.9207048 # 14: HA 4.900585 -6.9152047 # 15: YV 18.996330 15.5569853 # 16: OO 12.586207 11.9310345 ### Create a new variable called gain # this is the difference between arrival delay and departure delay DT[, gain:= arr_delay – dep_delay] # Compute the median gain per carrier median_gain = DT[, median(gain, na.rm = TRUE), by = carrier] print(median_gain) Print Page Previous Next Advertisements ”;

Cassandra – Batch

Cassandra – Batch Statements Using Batch Statements Using BATCH, you can execute multiple modification statements (insert, update, delete) simultaneiously. Its syntax is as follows − BEGIN BATCH <insert-stmt>/ <update-stmt>/ <delete-stmt> APPLY BATCH Example Assume there is a table in Cassandra called emp having the following data − emp_id emp_name emp_city emp_phone emp_sal 1 ram Hyderabad 9848022338 50000 2 robin Delhi 9848022339 50000 3 rahman Chennai 9848022330 45000 In this example, we will perform the following operations − Insert a new row with the following details (4, rajeev, pune, 9848022331, 30000). Update the salary of employee with row id 3 to 50000. Delete city of the employee with row id 2. To perform the above operations in one go, use the following BATCH command − cqlsh:tutorialspoint> BEGIN BATCH … INSERT INTO emp (emp_id, emp_city, emp_name, emp_phone, emp_sal) values( 4,”Pune”,”rajeev”,9848022331, 30000); … UPDATE emp SET emp_sal = 50000 WHERE emp_id =3; … DELETE emp_city FROM emp WHERE emp_id = 2; … APPLY BATCH; Verification After making changes, verify the table using the SELECT statement. It should produce the following output − cqlsh:tutorialspoint> select * from emp; emp_id | emp_city | emp_name | emp_phone | emp_sal ——–+———–+———-+————+——— 1 | Hyderabad | ram | 9848022338 | 50000 2 | null | robin | 9848022339 | 50000 3 | Chennai | rahman | 9848022330 | 50000 4 | Pune | rajeev | 9848022331 | 30000 (4 rows) Here you can observe the table with modified data. Batch Statements using Java API Batch statements can be written programmatically in a table using the execute() method of Session class. Follow the steps given below to execute multiple statements using batch statement with the help of Java API. Step1: Create a Cluster Object Create an instance of Cluster.builder class of com.datastax.driver.core package as shown below. //Creating Cluster.Builder object Cluster.Builder builder1 = Cluster.builder(); Add a contact point (IP address of the node) using the addContactPoint() method of Cluster.Builder object. This method returns Cluster.Builder. //Adding contact point to the Cluster.Builder object Cluster.Builder builder2 = build.addContactPoint( “127.0.0.1” ); Using the new builder object, create a cluster object. To do so, you have a method called build() in the Cluster.Builder class. Use the following code to create the cluster object − //Building a cluster Cluster cluster = builder.build(); You can build the cluster object using a single line of code as shown below. Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1″).build(); Step 2: Create a Session Object Create an instance of Session object using the connect() method of Cluster class as shown below. Session session = cluster.connect( ); This method creates a new session and initializes it. If you already have a keyspace, then you can set it to the existing one by passing the KeySpace name in string format to this method as shown below. Session session = cluster.connect(“ Your keyspace name ”); Here we are using the KeySpace named tp. Therefore, create the session object as shown below. Session session = cluster.connect(“tp”); Step 3: Execute Query You can execute CQL queries using the execute() method of Session class. Pass the query either in string format or as a Statement class object to the execute() method. Whatever you pass to this method in string format will be executed on the cqlsh. In this example, we will perform the following operations − Insert a new row with the following details (4, rajeev, pune, 9848022331, 30000). Update the salary of employee with row id 3 to 50000. Delete the city of the employee with row id 2. You have to store the query in a string variable and pass it to the execute() method as shown below. String query1 = ” BEGIN BATCH INSERT INTO emp (emp_id, emp_city, emp_name, emp_phone, emp_sal) values( 4,”Pune”,”rajeev”,9848022331, 30000); UPDATE emp SET emp_sal = 50000 WHERE emp_id =3; DELETE emp_city FROM emp WHERE emp_id = 2; APPLY BATCH;”; Given below is the complete program to execute multiple statements simultaneously on a table in Cassandra using Java API. import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; public class Batch { public static void main(String args[]){ //query String query =” BEGIN BATCH INSERT INTO emp (emp_id, emp_city, emp_name, emp_phone, emp_sal) values( 4,”Pune”,”rajeev”,9848022331, 30000);” + “UPDATE emp SET emp_sal = 50000 WHERE emp_id =3;” + “DELETE emp_city FROM emp WHERE emp_id = 2;” + “APPLY BATCH;”; //Creating Cluster object Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); //Creating Session object Session session = cluster.connect(“tp”); //Executing the query session.execute(query); System.out.println(“Changes done”); } } Save the above program with the class name followed by .java, browse to the location where it is saved. Compile and execute the program as shown below. $javac Batch.java $java Batch Under normal conditions, it should produce the following output − Changes done Print Page Previous Next Advertisements ”;

Big Data Analytics – Text Analytics

Big Data Analytics – Text Analytics ”; Previous Next In this chapter, we will be using the data scraped in the part 1 of the book. The data has text that describes profiles of freelancers, and the hourly rate they are charging in USD. The idea of the following section is to fit a model that given the skills of a freelancer, we are able to predict its hourly salary. The following code shows how to convert the raw text that in this case has skills of a user in a bag of words matrix. For this we use an R library called tm. This means that for each word in the corpus we create variable with the amount of occurrences of each variable. library(tm) library(data.table) source(”text_analytics/text_analytics_functions.R”) data = fread(”text_analytics/data/profiles.txt”) rate = as.numeric(data$rate) keep = !is.na(rate) rate = rate[keep] ### Make bag of words of title and body X_all = bag_words(data$user_skills[keep]) X_all = removeSparseTerms(X_all, 0.999) X_all # <<DocumentTermMatrix (documents: 389, terms: 1422)>> # Non-/sparse entries: 4057/549101 # Sparsity : 99% # Maximal term length: 80 # Weighting : term frequency – inverse document frequency (normalized) (tf-idf) ### Make a sparse matrix with all the data X_all <- as_sparseMatrix(X_all) Now that we have the text represented as a sparse matrix we can fit a model that will give a sparse solution. A good alternative for this case is using the LASSO (least absolute shrinkage and selection operator). This is a regression model that is able to select the most relevant features to predict the target. train_inx = 1:200 X_train = X_all[train_inx, ] y_train = rate[train_inx] X_test = X_all[-train_inx, ] y_test = rate[-train_inx] # Train a regression model library(glmnet) fit <- cv.glmnet(x = X_train, y = y_train, family = ”gaussian”, alpha = 1, nfolds = 3, type.measure = ”mae”) plot(fit) # Make predictions predictions = predict(fit, newx = X_test) predictions = as.vector(predictions[,1]) head(predictions) # 36.23598 36.43046 51.69786 26.06811 35.13185 37.66367 # We can compute the mean absolute error for the test data mean(abs(y_test – predictions)) # 15.02175 Now we have a model that given a set of skills is able to predict the hourly salary of a freelancer. If more data is collected, the performance of the model will improve, but the code to implement this pipeline would be the same. Print Page Previous Next Advertisements ”;

Cassandra – Quick Guide

Cassandra – Quick Guide ”; Previous Next Cassandra – Introduction Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL database. Let us first understand what a NoSQL database does. NoSQLDatabase A NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data. The primary objective of a NoSQL database is to have simplicity of design, horizontal scaling, and finer control over availability. NoSql databases use different data structures compared to relational databases. It makes some operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it must solve. NoSQL vs. Relational Database The following table lists the points that differentiate a relational database from a NoSQL database. Relational Database NoSql Database Supports powerful query language. Supports very simple query language. It has a fixed schema. No fixed schema. Follows ACID (Atomicity, Consistency, Isolation, and Durability). It is only “eventually consistent”. Supports transactions. Does not support transactions. Besides Cassandra, we have the following NoSQL databases that are quite popular − Apache HBase − HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as a part of Apache Hadoop project and runs on top of HDFS, providing BigTable-like capabilities for Hadoop. MongoDB − MongoDB is a cross-platform document-oriented database system that avoids using the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas making the integration of data in certain types of applications easier and faster. What is Apache Cassandra? Apache Cassandra is an open source, distributed and decentralized/distributed storage system (database), for managing very large amounts of structured data spread out across the world. It provides highly available service with no single point of failure. Listed below are some of the notable points of Apache Cassandra − It is scalable, fault-tolerant, and consistent. It is a column-oriented database. Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it differs sharply from relational database management systems. Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model. Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more. Features of Cassandra Cassandra has become so popular because of its outstanding technical features. Given below are some of the features of Cassandra: Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement. Always on architecture − Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure. Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time. Flexible data storage − Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need. Easy data distribution − Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers. Transaction support − Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID). Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the read efficiency. History of Cassandra Cassandra was developed at Facebook for inbox search. It was open-sourced by Facebook in July 2008. Cassandra was accepted into Apache Incubator in March 2009. It was made an Apache top-level project since February 2010. Cassandra – Architecture The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster. All the nodes in a cluster play the same role. Each node is independent and at the same time interconnected to other nodes. Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster. When a node goes down, read/write requests can be served from other nodes in the network. Data Replication in Cassandra In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is detected that some of the nodes responded with an out-of-date value, Cassandra will return the most recent value to the client. After returning the most recent value, Cassandra performs a read repair in the background to update the stale values. The following figure shows a schematic view of how Cassandra uses data replication among the nodes in a cluster to ensure no single point of failure. Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate with each other and detect any faulty nodes in the cluster. Components of Cassandra The key components of Cassandra are as follows − Node − It is the place where data is stored. Data center − It is a collection of related nodes. Cluster − A cluster is a component that contains one or more data centers. Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables. SSTable − It is a disk file to

Cognos – Report Validation

Cognos – Report Validation ”; Previous Next This is used to ensure that your report doesn’t contain any error. When a report created in the older version of Cognos is upgraded it is automatically validated. To validate a report, go to the Tools menu and click on the Validate button as shown in the following screenshot. There are different Validation levels − Error − To retrieve all errors returned from the query. Warning − To retrieve all errors and warnings returned from the query. Key Transformation − To retrieve important transformation steps. Information − To retrieve other information related to query planning and execution. Print Page Previous Next Advertisements ”;

Big Data Analytics – Core Deliverables

Big Data Analytics – Core Deliverables ”; Previous Next Big data analytics entails processing and analysing large and diverse datasets to discover hidden patterns, correlations, insights, and other valuable information. As mentioned in the big data life cycle, some core deliverables of big data analytics are mentioned in below image − Machine Learning Implementation This could be a classification algorithm, a regression model or a segmentation model. Recommending System The objective is to develop a system that can recommend options based on user behaviour. For example – on Netflix, based on users” ratings for a particular movie/web series/show, related movies, web series, and shows are recommended. Dashboard Business normally needs tools to visualize aggregated data. A dashboard is a graphical representation of data which can be filtered as per users” needs and results are reflected on screen. For example, a sales dashboard of a company may contain filter options to visualise sales nation-wise, state-wise district-wise, zone-wise or sales product etc. Insights and Patterns Identification Big data analytics identifies trends, patterns, and correlations in data that can be used to make more informed decisions. These insights could be about customer behaviour, market trends, or operational inefficiencies. Ad-Hoc Analysis Ad-hoc analysis in big data analytics is a process of analysing data on the fly or spontaneously to answer specific, immediate queries or resolve ad-hoc inquiries. Unlike traditional analysis, which relies on predefined queries or structured reporting, ad hoc analysis allows users to explore data interactively, without the requirement for predefined queries or reports. Predictive Analytics Big data analytics can forecast future trends, behaviours, and occurrences by analysing previous data. Predictive analytics helps organisations to anticipate customer needs, estimate demand, optimise resources, and manage risks. Data Visualization Big data analytics entails presenting complex data in visual forms like charts, graphs, and dashboards. Data visualisation allows stakeholders to better grasp and analyse the data insights graphically. Optimization and Efficiency Improvement Big data analytics enables organisations to optimise processes, operations, and resources by identifying areas for improvement and inefficiencies. This could include optimising supply chain logistics, streamlining manufacturing processes, or improving marketing strategies. Personalization and Targeting Big data analytics allows organisations to personalise their products, services, and marketing activities based on individual preferences and behaviour by analysing massive amounts of customer data. This personalised strategy increases customer satisfaction and marketing ROI. Risk Management and Fraud Detection Big data analytics can detect abnormalities and patterns that indicate fraudulent activity or possible threats. This is especially crucial in businesses like finance, insurance, and cybersecurity, where early discovery can save large losses. Real-time Decision Making Big data analytics can deliver insights in real or near real-time, enabling businesses to make decisions based on data. This competence is critical in dynamic contexts where quick decisions are required to capitalise on opportunities or manage risks. Scalability and Flexibility Big data analytics solutions are built to manage large amounts of data from different sources and formats. They provide scalability to support increasing data quantities, as well as flexibility to react to changing business requirements and data sources. Competitive Advantage Leveraging big data analytics efficiently can give firms a competitive advantage by allowing them to innovate, optimise processes, and better understand their consumers and market trends. Compliance and Regulatory Requirements Big data analytics could help firms in ensuring compliance with relevant regulations and standards by analysing and monitoring data for legal and ethical requirements, particularly in the healthcare and finance industries. Overall, the core deliverables of big data analytics are focused on using data to drive strategic decision-making, increase operational efficiency, improve consumer experiences, and gain a competitive advantage in the marketplace. Print Page Previous Next Advertisements ”;

Data Analytics – Statistical Methods

Big Data Analytics – Statistical Methods ”; Previous Next When analyzing data, it is possible to have a statistical approach. The basic tools that are needed to perform basic analysis are − Correlation analysis Analysis of Variance Hypothesis Testing When working with large datasets, it doesn’t involve a problem as these methods aren’t computationally intensive with the exception of Correlation Analysis. In this case, it is always possible to take a sample and the results should be robust. Correlation Analysis Correlation Analysis seeks to find linear relationships between numeric variables. This can be of use in different circumstances. One common use is exploratory data analysis, in section 16.0.2 of the book there is a basic example of this approach. First of all, the correlation metric used in the mentioned example is based on the Pearson coefficient. There is however, another interesting metric of correlation that is not affected by outliers. This metric is called the spearman correlation. The spearman correlation metric is more robust to the presence of outliers than the Pearson method and gives better estimates of linear relations between numeric variable when the data is not normally distributed. library(ggplot2) # Select variables that are interesting to compare pearson and spearman correlation methods. x = diamonds[, c(”x”, ”y”, ”z”, ”price”)] # From the histograms we can expect differences in the correlations of both metrics. # In this case as the variables are clearly not normally distributed, the spearman correlation # is a better estimate of the linear relation among numeric variables. par(mfrow = c(2,2)) colnm = names(x) for(i in 1:4) { hist(x[[i]], col = ”deepskyblue3”, main = sprintf(”Histogram of %s”, colnm[i])) } par(mfrow = c(1,1)) From the histograms in the following figure, we can expect differences in the correlations of both metrics. In this case, as the variables are clearly not normally distributed, the spearman correlation is a better estimate of the linear relation among numeric variables. In order to compute the correlation in R, open the file bda/part2/statistical_methods/correlation/correlation.R that has this code section. ## Correlation Matrix – Pearson and spearman cor_pearson <- cor(x, method = ”pearson”) cor_spearman <- cor(x, method = ”spearman”) ### Pearson Correlation print(cor_pearson) # x y z price # x 1.0000000 0.9747015 0.9707718 0.8844352 # y 0.9747015 1.0000000 0.9520057 0.8654209 # z 0.9707718 0.9520057 1.0000000 0.8612494 # price 0.8844352 0.8654209 0.8612494 1.0000000 ### Spearman Correlation print(cor_spearman) # x y z price # x 1.0000000 0.9978949 0.9873553 0.9631961 # y 0.9978949 1.0000000 0.9870675 0.9627188 # z 0.9873553 0.9870675 1.0000000 0.9572323 # price 0.9631961 0.9627188 0.9572323 1.0000000 Chi-squared Test The chi-squared test allows us to test if two random variables are independent. This means that the probability distribution of each variable doesn’t influence the other. In order to evaluate the test in R we need first to create a contingency table, and then pass the table to the chisq.test R function. For example, let’s check if there is an association between the variables: cut and color from the diamonds dataset. The test is formally defined as − H0: The variable cut and diamond are independent H1: The variable cut and diamond are not independent We would assume there is a relationship between these two variables by their name, but the test can give an objective “rule” saying how significant this result is or not. In the following code snippet, we found that the p-value of the test is 2.2e-16, this is almost zero in practical terms. Then after running the test doing a Monte Carlo simulation, we found that the p-value is 0.0004998 which is still quite lower than the threshold 0.05. This result means that we reject the null hypothesis (H0), so we believe the variables cut and color are not independent. library(ggplot2) # Use the table function to compute the contingency table tbl = table(diamonds$cut, diamonds$color) tbl # D E F G H I J # Fair 163 224 312 314 303 175 119 # Good 662 933 909 871 702 522 307 # Very Good 1513 2400 2164 2299 1824 1204 678 # Premium 1603 2337 2331 2924 2360 1428 808 # Ideal 2834 3903 3826 4884 3115 2093 896 # In order to run the test we just use the chisq.test function. chisq.test(tbl) # Pearson’s Chi-squared test # data: tbl # X-squared = 310.32, df = 24, p-value < 2.2e-16 # It is also possible to compute the p-values using a monte-carlo simulation # It”s needed to add the simulate.p.value = TRUE flag and the amount of simulations chisq.test(tbl, simulate.p.value = TRUE, B = 2000) # Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates) # data: tbl # X-squared = 310.32, df = NA, p-value = 0.0004998 T-test The idea of t-test is to evaluate if there are differences in a numeric variable # distribution between different groups of a nominal variable. In order to demonstrate this, I will select the levels of the Fair and Ideal levels of the factor variable cut, then we will compare the values a numeric variable among those two groups. data = diamonds[diamonds$cut %in% c(”Fair”, ”Ideal”), ] data$cut = droplevels.factor(data$cut) # Drop levels that aren’t used from the cut variable df1 = data[, c(”cut”, ”price”)] # We can see the price means are different for each group tapply(df1$price, df1$cut, mean) # Fair Ideal # 4358.758 3457.542 The t-tests are implemented in R with the t.test function. The formula interface to t.test is the simplest way to use it, the idea is that a numeric variable is explained by a group variable. For example: t.test(numeric_variable ~ group_variable, data = data). In the previous example, the numeric_variable is price and the group_variable is cut. From a statistical perspective, we are testing if there are differences in the distributions of the numeric variable among two groups. Formally the hypothesis test is described with a null (H0) hypothesis and an alternative hypothesis (H1). H0: There are no differences in the distributions of the price variable among the Fair and Ideal groups H1 There are differences

Big Data Adoption & Planning Considerations

Big Data Adoption and Planning Considerations ”; Previous Next Adopting big data comes with its own set of challenges and considerations, but with careful planning, organizations can maximize its benefits. Big Data initiatives should be strategic and business-driven. The adoption of big data can facilitate this change. The use of Big Data can be transformative, but it is usually innovative. Transformation activities are often low-risk and aim to improve efficiency and effectiveness. The nature of Big Data and its analytic power consists of issues and challenges that need to be planned in the beginning. For example, the adoption of new technology makes concerns to secure that conform to existing corporate standards needs to be addressed. Issues related to tracking the provenance of a dataset from its procurement to its utilization are often new requirements for organizations. It is necessary to plan for the management of the privacy of constituents whose data is being processed or whose identity is revealed by analytical processes. All of the aforementioned factors require that an organisation recognise and implement a set of distinct governance processes and decision frameworks to ensure that all parties involved understand the nature, consequences, and management requirements of Big Data. The approach to performing business analysis is changing with the adoption of Big Data. The Big Data analytics lifecycle is an effective solution. There are different factors to consider when we implement Big Data. Following image depicts about big data adoption and planning considerations − Big Data Adoption and Planning Considerations The primary potential big data adoption and planning considerations are as − Organization Prerequisites Big Data frameworks are not turnkey solutions. Enterprises require data management and Big Data governance frameworks for data analysis and analytics to be useful. Effective processes are required for implementing, customising, filling, and utilising Big Data solutions. Define Objectives Outline your aims and objectives for implementing big data. Whether it”s increasing the customer experience, optimising processes, or improving decision-making, defined objectives always give a positive direction to the decision-makers to frame strategy. Data Procurement The acquisition of Big Data solutions can be cost-effective, due to the availability of open-source platforms and tools, as well as the potential to leverage commodity hardware. A substantial budget may still be required to obtain external data. Most commercially relevant data will have to be purchased, which may necessitate continuing subscription expenses to ensure the delivery of updates to obtained datasets. Infrastructure Evaluate your current infrastructure to see if it can handle big data processing and analytics. Consider whether you need to invest in new hardware, software, or cloud-based solutions to manage the volume, velocity, and variety of data. Data Strategy Create a comprehensive data strategy that is aligned with your business objectives. This includes determining what sorts of data are required, where to obtain them, how to store and manage them, and how to ensure their quality and security. Data Privacy and Security Analytics on datasets may reveal confidential data about organisations or individuals. Analyzing different datasets includes benign data that can reveal private information when the datasets are reviewed collectively. Addressing these privacy concerns necessitates an awareness of the nature of the data being collected, as well as relevant data privacy rules and particular procedures for data tagging and anonymization. Telemetry data, such as a car”s GPS record or smart metre data readings, accumulated over a long period, might expose an individual”s location and behavior. Security ensures the security of data networks and repositories using authentication and authorization mechanisms is an essential element in securing big data. Provenance Provenance refers to information about the data”s origins and processing. Provenance information is used to determine the validity and quality of data and can also be used for auditing. It can be difficult to maintain provenance as a large size of data is collected, integrated, and processed using different phases. Limited Realtime Support Dashboards and other applications that require streaming data and alerts frequently require real-time or near-realtime data transmissions. Different open-source Big Data solutions and tools are batch-oriented; however, a new phase of real-time open-source technologies supports streaming data processing. Distinct Performance Challenges With the large amounts of data that Big Data solutions must handle, performance is frequently an issue. For example, massive datasets combined with advanced search algorithms can lead to long query times. Distinct Governance Requirements Big Data solutions access and generate data, which become corporate assets. A governance structure is essential to ensure that both the data and the solution environment are regulated, standardized, and evolved in a controlled way. Establish strong data governance policies to assure data quality, integrity, privacy, and compliance with legislation like GDPR and CCPA. Define data management roles and responsibilities, as well as data access, usage, and security processes. Distinct Methodology A mechanism will be necessary to govern the flow of data into and out of Big Data systems. It will need to explore how to construct feedback loops so that processed data can be revised again. Continuous Improvement Big data initiatives are iterative, and require on-going development over time. Monitor performance indicators, get feedback, and fine-tune your strategy to ensure that you”re getting the most out of your data investments. By carefully examining and planning for these factors, organisations can successfully adopt and exploit big data to drive innovation, enhance efficiency, and gain a competitive advantage in today”s data-driven world. Print Page Previous Next Advertisements ”;

Big Data Analytics – Data Analyst

Big Data Analytics – Data Analyst ”; Previous Next A Data Analyst is a person who collects, analyses and interprets data to solve a particular problem. A data analyst devotes a lot of time to examining the data and finds insights in terms of graphical reports and dashboards. Hence, a data analyst has a reporting-oriented profile and has experience in extracting and analyzing data from traditional data warehouses using SQL. Working as a data analyst in big data analytics sounds like a dynamic role. Big data analytics includes analysing large-size and varied datasets to discover hidden patterns, unknown relationships, market trends, customer needs, and related valuable business insights. In today’s scenario, different organizations struggle hard to find competent data scientists in the market. It is however a good idea to select prospective data analysts and train them to the relevant skills to become a data scientist. A competent data analyst has skills like business understanding, SQL programming, report design and Dashboard creation. Role and Responsibilities of Data Analyst Below mentioned image incorporate all the major roles and responsibilities of a data analyst − Data Collection It refers to a process of collecting data from different sources like databases, data warehouses, APIs, and IoT devices. This could include conducting surveys, tracking visitor behaviour on a company”s website, or buying relevant data sets from data collection specialists. Data Cleaning and Pre-processing There may be duplicates, errors or outliers in the raw data. Cleaning raw data eliminates errors, inconsistencies, and duplicates. Pre-processing is the process of converting data into an analytically useful format. Cleaning data entails maintaining data quality in a spreadsheet or using a programming language to ensure that your interpretations are correct and unbiased. Exploratory Data Analysis (EDA) Using statistical methods and visualization tools, analysis of data is carried out to identify trends, patterns or relationships. Model Data It includes creating and designing database structures. Selection of type of data is going to be stored and collected. It ensures that how data categories are related and data appears. Statistical Analysis Applying statistical techniques to interpret data, validate hypotheses, and make predictions. Machine Learning To predict future trends, classify data or detect anomalies by building predictive models using machine learning algorithms. Data Visualization To communicate data insights effectively to stakeholders, it is necessary to create visual representations such as charts, graphs and dashboards. Data Interpretation and Reporting To communicate findings and recommendations to decision-makers through the interpretation of analysis results, and preparation of reports or presentations. Continuous Learning It includes keeping up to date with the latest developments in data analysis, big data technologies and business trends. A Data analyst makes their proficiency foundation in statistics, programming languages like Python or R, database fundamentals, SQL, and big data technologies such as Hadoop, Spark, and NoSQL databases. What Tools Does a Data Analyst Use? A data analyst often uses the following tools to process assigned work more accurately and efficiently during data analysis. Some common tools used by data analysts are mentioned in below image − Types of Data Analysts As technology has rapidly increasing; so, the types and amounts of data that can be collected, classified, and analyse data has become an essential skill in almost every business. In the current scenario; every domain has data analysts experts like data analysts in the criminal justice, fashion, food, technology, business, environment, and public sectors amongst many others. People who perform data analysis might be known as − Medical and health care analyst Market research analyst Business analyst Business intelligence analyst Operations research analyst Data Analyst Skills Generally, the skills of data analysts are divided into two major groups” i.e. Technical Skills and Behavioural Skills. Data Analyst Technical Skills Data Cleaning − A data analyst has proficiency in identifying and handling missing data, outliers, and errors in datasets. Database Tools − Microsoft Excel and SQL are essential tools for any data analyst. Excel is most widely used in industries; while SQL is capable of handling larger datasets using SQL queries to manipulate and manage data as per user’s needs. Programming Languages − Data Analysts are proficient in languages such as Python, R, SQL, or others used for data manipulation, analysis, and visualization. Learning Python or R makes me proficient in working on large-size data sets and complex equations. Python and R are popular to work on data analysis. Data Visualisation − A competent data analyst must clearly and compellingly present their findings. Knowing how to show data in charts and graphs will help coworkers, employers, and stakeholders comprehend your job. Some popular data visualization tools are Tableau, Jupyter Notebook, and Excel. Data Storytelling − Data Analysts can find and communicate insights effectively through storytelling using data visualization and narrative techniques. Statistics and Maths − Statistical methods and tools are used to analyse data distributions, correlations, and trends. Knowledge of statistics and maths can guide us to determine which tools are best to use to solve a particular problem, identify errors in data, and better understand the results. Big Data Tools − Data Analysts are familiar with big data processing tools and frameworks like Hadoop, Spark, or Apache Kafka. Data Warehousing − Data Analysts also have an understanding of data warehousing concepts and work with tools such as Amazon Redshift, Google BigQuery, or Snowflake. Data Governance and Compliance − Data Analysts are aware of data governance principles, data privacy laws, and regulations (Like GDPR, and HIPAA). APIs and Web Scraping − Data Analysts have expertise in pulling data from web APIs and scraping from websites using libraries like requests (Python) or BeautifulSoup. Behavioural Skills Problem-solving − A data analyst can understand the problem that