Tableau – Line Chart ”; Previous Next In a line chart, a measure and a dimension are taken along the two axes of the chart area. The pair of values for each observation becomes a point and the joining of all these points create a line showing the variation or relationship between the dimensions and measures chosen. Simple Line Chart Choose one dimension and one measure to create a simple line chart. Drag the dimension Ship Mode to Columns Shelf and Sales to the Rows shelf. Choose the Line chart from the Marks card. You will get the following line chart, which shows the variation of Sales for different Ship modes. Multiple Measure Line Chart You can use one dimension with two or more measures in a line chart. This will produce multiple line charts, each in one pane. Each pane represents the variation of the dimension with one of the measures. Line Chart with Label Each of the points making the line chart can be labeled to make the values of the measure visible. In this case, drop another measure Profit Ratio into the labels pane in the Marks card. Choose average as the aggregation and you will get the following chart showing the labels. Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Tableau – Formatting
Tableau – Formatting ”; Previous Next Tableau has a very wide variety of formatting options to change the appearance of the visualizations created. You can modify nearly every aspect such as font, color, size, layout, etc. You can format both the content and containers like tables, labels of axes, and workbook theme, etc. The following diagram shows the Format Menu which lists the options. In this chapter, you will touch upon some of the frequently used formatting options. Formatting the Axes You can create a simple bar chart by dragging and dropping the dimension Sub-Category into the Columns Shelf and the measure Profit into the Rows shelf. Click the vertical axis and highlight it. Then right-click and choose format. Change the Font Click the font drop-down in the Format bar, which appears on the left. Choose the font type as Arial and size as 8pt. as shown in the following screenshot. Change the Shade and Alignment You can also change the orientation of the values in the axes as well as the shading color as shown in the following screenshot. Format Borders Consider a crosstab chart with Sub-Category in the Columns shelf and State in the Rows shelf. Now, you can change the borders of the crosstab table created by using the formatting options. Right-click on crosstab chart and choose Format. The Format Borders appear in the left pane. Choose the options as shown in the following screenshot. Print Page Previous Next Advertisements ”;
Statistics – Data collection – Observation ”; Previous Next Observation is a popular method of data collection in behavioral sciences. The power, observation has been summed by W.L. Prosser as follows “there is still no man that would not accept dog tracks in the mud against the sworn testimony of a hundred eye witnesses that no dog had passed by.” Observation refers to the monitoring and recording of behavioral and non behavioral activities and conditions in a systematic manner to obtain information about the phenomena of interest, ”Behavioral Observation” is: Non verbal analysis like body movement. eye movement. Linguistic analysis which includes observing sounds like ohs! and abs! Extra linguistic analysis which observes the pitch timbre, rate of speaking etc. Spatial analysis about how people relate to each other. The non behavioral observation is an analysis of records e.g. newspaper archives, physical condition analysis such as checking the quality of grains in gunny bags and process analysis which includes observing any process. Observation can be classified into various, categories. Type of Observation Structured Vs. Unstructured Observation – In structured observation the problem has been clearly defined, hence the behavior to be observed and the method by which it will be measured is specified beforehand in detail. This reduces the chances of observer introducing observer”s bias in research e.g. study of p1ant safety compliance can be observed in a structure manner. Unstructured analysis is used in situations where the problem has not been clearly defined hence it cannot be pre specified that what is to be observed. Hence a researcher monitors all relevant phenomena and a great deal of flexibility is allowed in terms of what they note and record e.g. the student”s behavior in a class would require monitoring their total behavior in the class environment. The data collected through unstructured analysis should be analyzed carefully so that no bias is introduced. Disguised Vs. Undisguised Observation – This classification has been done on the basis of whether the subjects should know that they are being observed or not. In disguised observation, the subjects are unaware of the facts that they are being observed. Their behavior is observed using hidden cameras, one way mirrors, or other devices. Since the subjects are unaware that they are being observed hence they behave in a natural way. The drawback is that it may take long hours of observation before the subjects display the phenomena of interest. Disguised observation may be: Direct observation when the behavior is observed by the researcher himself personally. Indirect observation which is the effect or the result of the behavior that is observed. In undisguised observation, the subjects are aware that they are being observed. In this type of observation, there is the fear that the subject might show a typical activity. The entry of observer may upset the subject, but for how long this disruption will exist cannot be said conclusively. Studies have shown that such descriptions are short-lived and the subjects soon resume normal behavior. Participant vs. Non-Participant Observation – If the observer participates in the situation while observing it is termed as participant observation. g. a researcher studying the life style of slum dwellers, following participant observation, will himself stay in slums. His role as an observer may be concealed or revealed. By becoming a part of the setting he is able to observe in an insightful manner. A problem that arises out of this method is that the observer may become sympathetic to the subjects and would have problem in viewing his research objectively. In case of non-participant observation, the observer remains outside the setting and does not involve himself or participate in the situation. Natural vs. Contrived Observation. – In natural observation the behavior is observed as it takes place in the actual setting e.g. the consumer preferences observed directly at Pizza Hut where consumers are ordering pizza. The advantage of this method is that the true results are obtained, but it is expensive and time consuming method. In contrived observation, the phenomena is observed in an artificial or simulated setting e.g. the consumers instead of being observed in a restaurant are made to order in a setting that looks like a restaurant but is not an actual one. This type of observation has the advantage of being over in a short time and recording of behavior is easily done. However, since the consumer”s are conscious of their setting they may not show actual behavior. Classification on the Basis of Mode of Administration – This includes: monitors and records the behavior as it occurs. The recording is done on an observation schedule. The personal observation not only records what, has been specified but also identifies and records unexpected behaviors that defy pre-established response categories. Mechanical Observation – Mechanical devices, instead of human”s are to record the behavior. The devices record the behavior as it occurs and data is sorted and analyzed later on. Apart from cameras, other devices are galvanometer which measures the emotional arousal induced by an exposure to a specific stimuli, audiometer and people meter that record which channel on TV is being viewed with the latter also recording who is viewing the channel, coulometer records the eye movement etc. Audit – It is the process of obtaining information by physical examination of data. The audit, which is a count of physical objects, is generally done by the researcher himself. An audit can be a store audit or a pantry audit. The store audits are performed by the distributors or manufacturers in order to ana1yse the market share, purchase pattern etc. e.g. the researcher may check the store records or do an analysis of inventory on hand to record the data. The pantry audit involves the researcher developing an inventory of brands quantities and package sizes of products in a consumer”s home, generally in the course of a personal interview. Such an audit is used to supplement or test the truthfulness of information provided in the direct questionnaire. Content
Statistics – Geometric Probability Distribution ”; Previous Next The geometric distribution is a special case of the negative binomial distribution. It deals with the number of trials required for a single success. Thus, the geometric distribution is a negative binomial distribution where the number of successes (r) is equal to 1. Formula ${P(X=x) = p times q^{x-1} }$ Where − ${p}$ = probability of success for single trial. ${q}$ = probability of failure for a single trial (1-p) ${x}$ = the number of failures before a success. ${P(X-x)}$ = Probability of x successes in n trials. Example Problem Statement: In an amusement fair, a competitor is entitled for a prize if he throws a ring on a peg from a certain distance. It is observed that only 30% of the competitors are able to do this. If someone is given 5 chances, what is the probability of his winning the prize when he has already missed 4 chances? Solution: If someone has already missed four chances and has to win in the fifth chance, then it is a probability experiment of getting the first success in 5 trials. The problem statement also suggests the probability distribution to be geometric. The probability of success is given by the geometric distribution formula: ${P(X=x) = p times q^{x-1} }$ Where − ${p = 30 % = 0.3 }$ ${x = 5}$ = the number of failures before a success. Therefore, the required probability: $ {P(X=5) = 0.3 times (1-0.3)^{5-1} , \[7pt] , = 0.3 times (0.7)^4, \[7pt] , approx 0.072 \[7pt] , approx 7.2 % }$ Print Page Previous Next Advertisements ”;
Factorial
Statistics – Factorial ”; Previous Next Factorial is a function applied to natural numbers greater than zero. The symbol for the factorial function is an exclamation mark after a number, like this: 2! Formula ${n! = 1 times 2 times 3 … times n}$ Where − ${n!}$ = represents factorial ${n}$ = Number of sets Example Problem Statement: Calculate the factorial of 5 i.e. 5!. Solution: Multiply all the whole numbers up to the number considered. ${5! = 5 times 4 times 3 times 2 times 1 , \[7pt] , = 120}$ Print Page Previous Next Advertisements ”;
Statistics Notation
Statistics – Notations ”; Previous Next Following table shows the usage of various symbols used in Statistics Capitalization Generally lower case letters represent the sample attributes and capital case letters are used to represent population attributes. $ P $ – population proportion. $ p $ – sample proportion. $ X $ – set of population elements. $ x $ – set of sample elements. $ N $ – set of population size. $ N $ – set of sample size. Greek Vs Roman letters Roman letters represent the sample attributs and greek letters are used to represent Population attributes. $ mu $ – population mean. $ bar x $ – sample mean. $ delta $ – standard deviation of a population. $ s $ – standard deviation of a sample. Population specific Parameters Following symbols represent population specific attributes. $ mu $ – population mean. $ delta $ – standard deviation of a population. $ {mu}^2 $ – variance of a population. $ P $ – proportion of population elements having a particular attribute. $ Q $ – proportion of population elements having no particular attribute. $ rho $ – population correlation coefficient based on all of the elements from a population. $ N $ – number of elements in a population. Sample specific Parameters Following symbols represent population specific attributes. $ bar x $ – sample mean. $ s $ – standard deviation of a sample. $ {s}^2 $ – variance of a sample. $ p $ – proportion of sample elements having a particular attribute. $ q $ – proportion of sample elements having no particular attribute. $ r $ – population correlation coefficient based on all of the elements from a sample. $ n $ – number of elements in a sample. Linear Regression $ B_0 $ – intercept constant in a population regression line. $ B_1 $ – regression coefficient in a population regression line. $ {R}^2 $ – coefficient of determination. $ b_0 $ – intercept constant in a sample regression line. $ b_1 $ – regression coefficient in a sample regression line. $ ^{s}b_1 $ – standard error of the slope of a regression line. Probability $ P(A) $ – probability that event A will occur. $ P(A|B) $ – conditional probability that event A occurs, given that event B has occurred. $ P(A”) $ – probability of the complement of event A. $ P(A cap B) $ – probability of the intersection of events A and B. $ P(A cup B) $ – probability of the union of events A and B. $ E(X) $ – expected value of random variable X. $ b(x; n, P) $ – binomial probability. $ b*(x; n, P) $ – negative binomial probability. $ g(x; P) $ – geometric probability. $ h(x; N, n, k) $ – hypergeometric probability. Permutation/Combination $ n! $ – factorial value of n. $ ^{n}P_r $ – number of permutations of n things taken r at a time. $ ^{n}C_r $ – number of combinations of n things taken r at a time. Set $ A Cap B $ – intersection of set A and B. $ A Cup B $ – union of set A and B. $ { A, B, C } $ – set of elements consisting of A, B, and C. $ emptyset $ – null or empty set. Hypothesis Testing $ H_0 $ – null hypothesis. $ H_1 $ – alternative hypothesis. $ alpha $ – significance level. $ beta $ – probability of committing a Type II error. Random Variables $ Z $ or $ z $ – standardized score, also known as a z score. $ z_{alpha} $ – standardized score that has a cumulative probability equal to $ 1 – alpha $. $ t_{alpha} $ – t statistic that has a cumulative probability equal to $ 1 – alpha $. $ f_{alpha} $ – f statistic that has a cumulative probability equal to $ 1 – alpha $. $ f_{alpha}(v_1, v_2) $ – f statistic that has a cumulative probability equal to $ 1 – alpha $ and $ v_1 $ and $ v_2 $ degrees of freedom. $ X^2 $ – chi-square statistic. Summation Symbols $ sum $ – summation symbol, used to compute sums over a range of values. $ sum x $ or $ sum x_i $ – sum of a set of n observations. Thus, $ sum x = x_1 + x_2 + … + x_n $. Print Page Previous Next Advertisements ”;
Hive – Introduction
Hive – Introduction ”; Previous Next The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and a variety of data that is increasing day by day. Using traditional data management systems, it is difficult to process Big Data. Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data management and processing challenges. Hadoop Hadoop is an open-source framework to store and process Big Data in a distributed environment. It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS). MapReduce: It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware. HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware. The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules. Sqoop: It is used to import and export data to and from between HDFS and RDBMS. Pig: It is a procedural language platform used to develop a script for MapReduce operations. Hive: It is a platform used to develop SQL type scripts to do MapReduce operations. Note: There are various ways to execute MapReduce operations: The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data. The scripting approach for MapReduce to process structured and semi structured data using Pig. The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data using Hive. What is Hive Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce. Hive is not A relational database A design for OnLine Transaction Processing (OLTP) A language for real-time queries and row-level updates Features of Hive It stores schema in a database and processed data into HDFS. It is designed for OLAP. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible. Architecture of Hive The following component diagram depicts the architecture of Hive: This component diagram contains different units. The following table describes each unit: Unit Name Operation User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping. HiveQL Process Engine HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system. Working of Hive The following diagram depicts the workflow between Hive and Hadoop. The following table defines how Hive interacts with Hadoop framework: Step No. Operation 1 Execute Query The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute. 2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 Get Metadata The compiler sends metadata request to Metastore (any database). 4 Send Metadata Metastore sends metadata as a response to the compiler. 5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. 6 Execute Plan The driver sends the execute plan to the execution engine. 7 Execute Job Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job. 7.1 Metadata Ops Meanwhile in execution, the execution engine can execute metadata operations with Metastore. 8 Fetch Result The execution engine receives the results from Data nodes. 9 Send Results The execution engine sends those resultant values to the driver. 10 Send Results The driver sends the results to Hive Interfaces. Print Page Previous Next Advertisements ”;
Tableau – String Calculations ”; Previous Next In this chapter, you will learn about calculations in Tableau involving Strings. Tableau has many inbuilt string functions, which can be used to do string manipulations such as – comparing, concatenating, replacing few characters from a string, etc. Following are the steps to create a calculation field and use string functions in it. Create Calculated Field While connected to Sample superstore, go to the Analysis menu and click ‘Create Calculated Field’ as shown in the following screenshot. Calculation Editor The above step opens a calculation editor which lists all the functions that is available in Tableau. You can change the dropdown value and see only the functions related to strings. Create a Formula Consider you want to find out the sales in the cities, which contain the letter “o”. For this, create the formula as shown in the following screenshot. Using the Calculated Field Now, to see the created field in action, you can drag it to the Rows shelf and drag the Sales field to the Columns shelf. The following screenshot shows the Sales values. Print Page Previous Next Advertisements ”;
Sqoop – List Tables
Sqoop – List Tables ”; Previous Next This chapter describes how to list out the tables of a particular database in MySQL database server using Sqoop. Sqoop list-tables tool parses and executes the ‘SHOW TABLES’ query against a particular database. Thereafter, it lists out the present tables in a database. Syntax The following syntax is used for Sqoop list-tables command. $ sqoop list-tables (generic-args) (list-tables-args) $ sqoop-list-tables (generic-args) (list-tables-args) Sample Query The following command is used to list all the tables in the userdb database of MySQL database server. $ sqoop list-tables –connect jdbc:mysql://localhost/userdb –username root If the command is executes successfully, then it will display the list of tables in the userdb database as follows. … 13/05/31 16:45:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. emp emp_add emp_contact Print Page Previous Next Advertisements ”;
Tableau – Basic Filters
Tableau – Basic Filters ”; Previous Next Filtering is the process of removing certain values or range of values from a result set. Tableau filtering feature allows both simple scenarios using field values as well as advanced calculation or context-based filters. In this chapter, you will learn about the basic filters available in Tableau. There are three types of basic filters available in Tableau. They are as follows − Filter Dimensions are the filters applied on the dimension fields. Filter Measures are the filters applied on the measure fields. Filter Dates are the filters applied on the date fields. Filter Dimensions These filters are applied on the dimension fields. Typical examples include filtering based on categories of text or numeric values with logical expressions greater than or less than conditions. Example We use the Sample – Superstore data source to apply dimension filters on the sub-category of products. We create a view for showing profit for each sub-category of products according to their shipping mode. For it, drag the dimension field “Sub-Category” to the Rows shelf and the measure field “profit” to the Columns shelf. Next, drag the Sub-Category dimension to the Filters shelf to open the Filter dialog box. Click the None button at the bottom of the list to deselect all segments. Then, select the Exclude option in the lower right corner of the dialog box. Finally, select Labels and Storage and then click OK. The following screenshot shows the result with the above two categories excluded. Filter Measures These filters are applied on the measure fields. Filtering is based on the calculations applied to the measure fields. Hence, while in dimension filters you use only values to filter, in measures filter you use calculations based on fields. Example You can use the Sample – Superstore data source to apply dimension filters on the average value of the profits. First, create a view with ship mode and subcategory as dimensions and Average of profit as shown in the following screenshot. Next, drag the AVG (profit) value to the filter pane. Choose Average as the filter mode. Next, choose “At least” and give a value to filter the rows, which meet these criteria. After completion of the above steps, we get the final view below showing only the subcategories whose average profit is greater than 20. Filter Dates Tableau treats the date field in three different ways while applying the date field. It can apply filter by taking a relative date as compared to today, an absolute date, or range of dates. Each of this option is presented when a date field is dragged out of the filter pane. Example We choose the sample – Superstore data source and create a view with order date in the column shelf and profit in the rows shelf as shown in the following screenshot. Next, drag the “order date” field to the filter shelf and choose Range of dates in the filter dialog box. Choose the dates as shown in the following screenshot. On clicking OK, the final view appears showing the result for the chosen range of dates as seen in the following screenshot. Print Page Previous Next Advertisements ”;