Tableau – Bar Chart ”; Previous Next A bar chart represents data in rectangular bars with the length of the bar proportional to the value of the variable. Tableau automatically produces a bar chart when you drag a dimension to the Row shelf and measure to the Column shelf. We can also use the bar chart option present in the Show Me button. If the data is not appropriate for bar chart, then this option will be automatically greyed out. In Tableau, various types of bar charts can be created by using a dimension and a measure. Simple Bar Chart From the Sample-Superstore, choose the dimension, take profit to the columns shelf and Sub-Category to the rows shelf. It automatically produces a horizontal bar chart as shown in the following screenshot. In case, it does not, you can choose the chart type from the Show Me tool to get the following result. Bar Chart with Color Range You can apply colors to the bars based on their ranges. The longer bars get darker shades and the smaller bars get the lighter shades. To do this, drag the profit field to the color palette under the Marks Pane. Also note that, it produces a different color for negative bars. Stacked Bar Chart You can add another dimension to the above bar chart to produce a stacked bar chart, which shows different colors in each bar. Drag the dimension field named segment to the Marks pane and drop it in colors. The following chart appears which shows the distribution of each segment in each bar. Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Sample planning
Statistics – Sample Planning ”; Previous Next Sample planning refers to a detailed outline of measurements to be taken: At what time – Decide the time when a survey is to be conducted. For example, taking people views on newspaper outreach before launch of a new newspaper in the area. On Which material – Decide the material on which the survey is to be conducted. It could be a online poll or paper based checklist. In what manner – Decide the sampling methods which will be used to choose people on whom the survey is to be conducted. By whom – Decide the person(s) who has to collect the observations. Sampling plans should be prepared in such a way that the result correctly represent the representative sample of interest and allows all questions to be answered. Steps Following are the steps involved in sample planning. Identification of parameters – Identify the attributes/ parameters to be measured. Identify the ranges, possible values and required resolution. Choose Sampling Method – Choose a sampling method with details like how and when samples are to be identified. Select Sample Size – Select an appropriate sample size to represent the population correctly. Large samples are generally proner to invalid conclusion. Select storage formats – Choose a data storage format in which the sampled data is to be kept. Assign Roles – Assign roles and responsibilities to each person involved in collecting, processing, statistically testing steps. Verify and execute – Sampling plan should be verifiable. Once verified, pass it to related parties to execute it. Print Page Previous Next Advertisements ”;
Tableau – Pie Chart
Tableau – Pie Chart ”; Previous Next A pie chart represents data as slices of a circle with different sizes and colors. The slices are labeled and the numbers corresponding to each slice is also represented in the chart. You can select the pie chart option from the Marks card to create a pie chart. Simple Pie Chart Choose one dimension and one measure to create a simple pie chart. For example, take the dimension named region with the measure named profit. Drop the Region dimension in the colors and label marks. Drop the Profit measure into the size mark. Choose the chart type as Pie. The following chart appears which shows the 4 regions in different colors. Drill-Down Pie Chart You can choose a dimension with hierarchy and as you go deeper into the hierarchy, the chart changes reflect the level of the dimension chosen. In the following example, we take the dimension Sub-Category which has two more levels – Manufacturer and Product Name. Take the measure profit and drop it to the Labels mark. The following pie chart appears which shows the values for each slice. Going one more level into the hierarchy, we get the manufacturer as the label and the above pie chart changes to the following one. Print Page Previous Next Advertisements ”;
Statistical Significance
Statistics – Statistical Significance ”; Previous Next Statistical Significance signifies that result of a statistical experiment or testing is not occuring randomly and is attributable to certain cause. Statistical significance of a result could be strong or weak and it is very important for sectors which are heavily dependent on research works like insurance, pharma, finance, physics and so. Statistical Significance helps in choosing the sample data so that one can judge the result or outcome of testing to be realistic and not be caused by a random cause. Statisticians generally formulates the degree of statistical significance by sampling error. Generally sampling error of 5% is acceptable. Sample size is also important as it should be representative sample instead of very large sample considering the fact that large samples are prone to errors. Significance Level A level at which an event is considered to be statistical significant is termed as significance level. Statisticians uses a test statistic called p-value to get the statistical significance. If p-value of an event falls below a particular level then the event is considered as statistical significant. p-value is function of standard deviations and means of data samples. p-value is the probability of an event which certifies that result of statistical testing is occuring by chance or due to some sampling error. In other words it is the risk of failure of a statistical test. Opposite of p-value is confidence level which is 1 – p-value. If p-value of a result is 5% then that means confidence level of the result is 95%. Print Page Previous Next Advertisements ”;
Residual analysis
Statistics – Residual analysis ”; Previous Next Residual analysis is used to assess the appropriateness of a linear regression model by defining residuals and examining the residual plot graphs. Residual Residual($ e $) refers to the difference between observed value($ y $) vs predicted value ($ hat y $). Every data point have one residual. ${ residual = observedValue – predictedValue \[7pt] e = y – hat y }$ Residual Plot A residual plot is a graph in which residuals are on tthe vertical axis and the independent variable is on the horizontal axis. If the dots are randomly dispersed around the horizontal axis then a linear regression model is appropriate for the data; otherwise, choose a non-linear model. Types of Residual Plot Following example shows few patterns in residual plots. In first case, dots are randomly dispersed. So linear regression model is preferred. In Second and third case, dots are non-randomly dispersed and suggests that a non-linear regression method is preferred. Example Problem Statement: Check where a linear regression model is appropriate for the following data. $ x $ 60 70 80 85 95 $ y $ (Actual Value) 70 65 70 95 85 $ hat y $ (Predicted Value) 65.411 71.849 78.288 81.507 87.945 Solution: Step 1: Compute residuals for each data point. $ x $ 60 70 80 85 95 $ y $ (Actual Value) 70 65 70 95 85 $ hat y $ (Predicted Value) 65.411 71.849 78.288 81.507 87.945 $ e $ (Residual) 4.589 -6.849 -8.288 13.493 -2.945 Step 2: – Draw the residual plot graph. Step 3: – Check the randomness of the residuals. Here residual plot exibits a random pattern – First residual is positive, following two are negative, the fourth one is positive, and the last residual is negative. As pattern is quite random which indicates that a linear regression model is appropriate for the above data. Print Page Previous Next Advertisements ”;
Standard normal table
Statistics – Standard normal table ”; Previous Next Standard Normal Table Z is the standard normal random variable. The table value for Z is the value of the cumulative normal distribution at z. This is the left-tailed normal table. As z-value increases, the normal table value also increases. For example, the value for Z=1.96 is P (Z < 1.96) = .9750. z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 0.0 .5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641 0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621 1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890 2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981 2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 3.1 .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .9993 3.2 .9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995 .9995 3.3 .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997 3.4 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998 Print Page Previous Next Advertisements ”;
Probability Bayes Theorem
Statistics – Probability Bayes Theorem ”; Previous Next One of the most significant developments in the probability field has been the development of Bayesian decision theory which has proved to be of immense help in making decisions under uncertain conditions. The Bayes Theorem was developed by a British Mathematician Rev. Thomas Bayes. The probability given under Bayes theorem is also known by the name of inverse probability, posterior probability or revised probability. This theorem finds the probability of an event by considering the given sample information; hence the name posterior probability. The bayes theorem is based on the formula of conditional probability. conditional probability of event ${A_1}$ given event ${B}$ is ${P(A_1/B) = frac{P(A_1 and B)}{P(B)}}$ Similarly probability of event ${A_1}$ given event ${B}$ is ${P(A_2/B) = frac{P(A_2 and B)}{P(B)}}$ Where ${P(B) = P(A_1 and B) + P(A_2 and B) \[7pt] P(B) = P(A_1) times P (B/A_1) + P (A_2) times P (BA_2) }$ ${P(A_1/B)}$ can be rewritten as ${P(A_1/B) = frac{P(A_1) times P (B/A_1)}{P(A_1)} times P (B/A_1) + P (A_2) times P (BA_2)}$ Hence the general form of Bayes Theorem is ${P(A_i/B) = frac{P(A_i) times P (B/A_i)}{sum_{i=1}^k P(A_i) times P (B/A_i)}}$ Where ${A_1}$, ${A_2}$…${A_i}$…${A_n}$ are set of n mutually exclusive and exhaustive events. Print Page Previous Next Advertisements ”;
Tableau – Tree Map
Tableau – Tree Map ”; Previous Next The tree map displays data in nested rectangles. The dimensions define the structure of the tree map and measures define the size or color of the individual rectangle. The rectangles are easy to visualize as both the size and shade of the color of the rectangle reflect the value of the measure. A Tree Map is created using one or more dimension with one or two measures. Creating a Tree Map Using the Sample-superstore, plan to find the size of profits for each Ship mode values. To achieve this objective, following are the steps. Step 1 − Drag and drop the measure profit two times to the Marks Card. Once to the Size shelf and again to the Color shelf. Step 2 − Drag and drop the dimension ship mode to the Label shelf. Choose the chart type Tree Map from Show Me. The following chart appears. Tree Map with Two Dimensions You can add the dimension Region to the above Tree map chart. Drag and drop it twice. Once to the Color shelf and again to the Label shelf. The chart that appears will show four outer boxes for four regions and then the boxes for ship modes nested inside them. All the different regions will now have different colors. Print Page Previous Next Advertisements ”;
Statistics – Qualitative Data Vs Quantitative Data ”; Previous Next Qualitative Data Qualitative data is a set of information which can not be measured using numbers. It generally consist of words, subjective narratives. Result of an qualitative data analysis can come in form of highlighting key words, extracting information and concepts elaboration. For example, a study on parents perception about the current education system for their kids. The resulted information collected from them might be in narrative form and you need to deduce the analysis that they are satisfied, un-satisfied or need improvement in certain areas and so on. Strengh Better understanding – Qualitative data gives a better understanding of the perspectives and needs of participants. Provides Explaination – Qualitative data along with quantitative data can explain the result of the survey and can measure the correction of the quantitative data. Better Identification of behavior patterns – Qualitative data can provide detailed information which can prove itself useful in identification of behaviorial patterns. Weakness Lesser reachability – Being subjective in nature, small population is generally covered to represent the large population. Time Consuming – Qualitative data is time consuming as large data is to be understood. Possiblity of Bias – Being subjective analysis; evaluator bias is quite feasible. Quantitative Data Quantitative data is a set of numbers collected from a group of people and involves statistical analysis.For example if you conduct a satisfaction survey from participants and ask them to rate their experience on a scale of 1 to 5. You can collect the ratings and being numerical in nature, you will use statistical techniques to draw conclusions about participants satisfaction. Strengh Specific Quantitative data is clear and specific to the survey conducted. High ReliabilityIf collected properly, quantitative data is normally accurate and hence highly reliable. Easy communicationQuantitative data is easy to communicate and elaborate using charts, graphs etc. Existing supportMany large datasets may be already present that can be analyzed to check the relevance of the survey. Weakness Limited Options – Respondents are required to choose from limited options. High Complexity – Qualitative data may need complex procedures to get correct sample. Require Expertise – Analysis of qualitative data requires certain expertise in statistical analysis. Print Page Previous Next Advertisements ”;
Statistics – Process Capability (Cp) & Process Performance (Pp) ”; Previous Next Process Capability Process capability can be defined as a measurable property of a process relative to its specification. It is expressed as a process capability index ${C_p}$. The process capability index is used to check the variability of the output generated by the process and to compare the variablity with the product tolerance. ${C_p}$ is governed by following formula: Formula ${ C_p = min[frac{USL – mu}{3 times sigma}, frac{mu – LSL}{3 times sigma}] }$ Where − ${USL}$ = Upper Specification Limit. ${LSL}$ = Lower Specification Limit. ${mu}$ = estimated mean of the process. ${sigma}$ = estimated variability of the process, standard deviation. Higher the value of process capability index ${C_p}$, better is the process. Example Consider the case of a car and its parking garage. garage size states the specification limits and car defines the process output. Here process capability will tell the relatonship between car size, garage size and how far from middle of the garage you can parked the car. If car size is litter smaller than garage size then you can easily fit your car into it. If car size is very small compared to garage size then it can fit from any distance from center. In term of process of control, such process with little variation, allows to park car easily in garage and meets the customer”s requirement. Let”s see the above stated example in terms of process capability index ${C_p}$. ${C_p = frac{1}{2}}$ – garage size is smaller than car and can not accomodate your car. ${C_p = 1}$ – garage size is just sufficient for car and can accomodate your car only. ${C_p = 2}$ – garage size is two times than your car and can accomodate two cars at a time. ${C_p = 3}$ – garage size is three times than your car and can accomodate three cars at a time. Process Performance Process performance works to check the conformance of the sample generated using the process. It is expressed as a process performance index ${P_p}$. It checks whether it is meeting customer requirement or not. It varies from Process Capability in the fact that Process Performance is applicable to a particular batch of material. Sampling method may need to be quite substancial to support of the variation in the batch. Process Performance is only to be used when a process control cannot be evaluated. ${P_p}$ is governed by following formula: Formula ${ P_p = frac{USL – LSL}{6 times sigma} }$ Where − ${USL}$ = Upper Specification Limit. ${LSL}$ = Lower Specification Limit. ${sigma}$ = estimated variability of the process, standard deviation. Higher the value of process performance index ${P_p}$, better is the process. Print Page Previous Next Advertisements ”;