Statistics – Outlier Function ”; Previous Next An outlier in a probability distribution function is a number that is more than 1.5 times the length of the data set away from either the lower or upper quartiles. Specifically, if a number is less than ${Q_1 – 1.5 times IQR}$ or greater than ${Q_3 + 1.5 times IQR}$, then it is an outlier. Outlier is defined and given by the following probability function: Formula ${Outlier datas are, lt Q_1 – 1.5 times IQR (or) gt Q_3 + 1.5 times IQR }$ Where − ${Q_1}$ = First Quartile ${Q_2}$ = Third Quartile ${IQR}$ = Inter Quartile Range Example Problem Statement: Consider a data set that represents the 8 different students periodic task count. The task count information set is, 11, 13, 15, 3, 16, 25, 12 and 14. Discover the outlier data from the students periodic task counts. Solution: Given data set is: 11 13 15 3 16 25 12 14 Arrange it in ascending order: 3 11 12 13 14 15 16 25 First Quartile Value() ${Q_1}$ ${ Q_1 = frac{(11 + 12)}{2} \[7pt] = 11.5 }$ Third Quartile Value() ${Q_3}$ ${ Q_3 = frac{(15 + 16)}{2} \[7pt] = 15.5 }$ Lower Outlier Range (L) ${ Q_1 – 1.5 times IQR \[7pt] = 11.5 – (1.5 times 4) \[7pt] = 11.5 – 6 \[7pt] = 5.5 }$ Upper Outlier Range (L) ${ Q_3 + 1.5 times IQR \[7pt] = 15.5 + (1.5 times 4) \[7pt] = 15.5 + 6 \[7pt] = 21.5 }$ In the given information, 5.5 and 21.5 is more greater than the other values in the given data set i.e. except from 3 and 25 since 3 is greater than 5.5 and 25 is lesser than 21.5. In this way, we utilize 3 and 25 as the outlier values. Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Statistics – Continuous Series Arithmetic Median ”; Previous Next When data is given based on ranges along with their frequencies. Following is an example of continous series − Items 0-5 5-10 10-20 20-30 30-40 Frequency 2 5 1 3 12 Formula $Median = {L} + frac{(frac{n}{2} – c.f.)}{f} times {i}$ Where − ${L}$ = Lower limit of median class, median class is that class where $frac{n}{2}^{th}$ item is lying. ${c.f.}$ = Cumulative frequency of the class preceding the median class. ${f}$ = Frequency of median class. ${i}$ = Class interval of median class. Arithmetic Median is a useful measure of central tendency in case the data type is nominal data. Since it is a positional average, it does not get affected by extreme values. Example Problem Statement − In a study conducted in an organization, the distribution of income across the workers is observed. Find the the median wage of the workers of the organization. 06 men get less than Rs. 500 13 men get less than Rs. 1000 22 men get less than Rs. 1500 30 men get less than Rs. 2000 34 men get less than Rs. 2500 40 men get less than Rs. 3000 Solution − Given are the cumulative frequencies of the workers. Hence we first find the simple frequency and present the data in tabular form. Income (rs.) M.P. m Frequency f (m-1250)/500 d fd c.f 0 – 500 250 6 -2 -12 6 500 – 1000 750 7 -1 -7 13 1000 – 1500 1250 9 0 0 22 1500 – 2000 1750 8 1 8 30 2000 – 2500 2250 4 2 8 34 2500 – 3000 2750 6 3 18 40 N = 40 ∑ fd = 15 In order to simplify the calculation, a common factor i = 500 has been taken. Using the following formula for calculating median wage. $Median = {L} + frac{(frac{n}{2} – c.f.)}{f} times {i}$ Where − ${L}$ = 1000 $frac{n}{2}$ = 20 ${c.f.}$ = 13 ${f}$ = 9 ${i}$ = 500 Thus $Median = {1000} + frac{(20 – 13)}{9} times {500} \[7pt] , = {1000 + 388.9} \[7pt] , = {1388.9}$ As 1388.9 ≃ 1389. The median wage is Rs. 1389. Calculator Print Page Previous Next Advertisements ”;
Hypergeometric Distribution
Statistics – Hypergeometric Distribution ”; Previous Next A hypergeometric random variable is the number of successes that result from a hypergeometric experiment. The probability distribution of a hypergeometric random variable is called a hypergeometric distribution. Hypergeometric distribution is defined and given by the following probability function: Formula ${h(x;N,n,K) = frac{[C(k,x)][C(N-k,n-x)]}{C(N,n)}}$ Where − ${N}$ = items in the population ${k}$ = successes in the population. ${n}$ = items in the random sample drawn from that population. ${x}$ = successes in the random sample. Example Problem Statement: Suppose we randomly select 5 cards without replacement from an ordinary deck of playing cards. What is the probability of getting exactly 2 red cards (i.e., hearts or diamonds)? Solution: This is a hypergeometric experiment in which we know the following: N = 52; since there are 52 cards in a deck. k = 26; since there are 26 red cards in a deck. n = 5; since we randomly select 5 cards from the deck. x = 2; since 2 of the cards we select are red. We plug these values into the hypergeometric formula as follows: ${h(x;N,n,k) = frac{[C(k,x)][C(N-k,n-x)]}{C(N,n)} \[7pt] h(2; 52, 5, 26) = frac{[C(26,2)][C(52-26,5-2)]}{C(52,5)} \[7pt] = frac{[325][2600]}{2598960} \[7pt] = 0.32513 }$ Thus, the probability of randomly selecting 2 red cards is 0.32513. Print Page Previous Next Advertisements ”;
Central limit theorem
Statistics – Central limit theorem ”; Previous Next If the population from which the sample has a been drawn is a normal population then the sample means would be equal to population mean and the sampling distribution would be normal. When the more population is skewed, as is the case illustrated in Figure, then the sampling distribution would tend to move closer to the normal distribution, provided the sample is large (i.e. greater then 30). According to Central Limit Theorem, for sufficiently large samples with size greater than 30, the shape of the sampling distribution will become more and more like a normal distribution, irrespective of the shape of the parent population. This theorem explains the relationship between the population distribution and sampling distribution. It highlights the fact that if there are large enough set of samples then the sampling distribution of mean approaches normal distribution. The importance of central limit theorem has been summed up by Richard. I. Levin in the following words: The significance of the central limit theorem lies in the fact that it permits us to use sample statistics to make inferences about population parameters without knowing anything about the shape of the frequency distribution of that population other than what we can get from the sample. Print Page Previous Next Advertisements ”;
Deciles Statistics
Statistics – Deciles Statistics ”; Previous Next A system of dividing the given random distribution of the data or values in a series into ten groups of similar frequency is known as deciles. Formula ${D_i = l + frac{h}{f}(frac{iN}{10} – c); i = 1,2,3…,9}$ Where − ${l}$ = lower boundry of deciles group. ${h}$ = width of deciles group. ${f}$ = frequency of deciles group. ${N}$ = total number of observations. ${c}$ = comulative frequency preceding deciles group. Example Problem Statement: Calculate the deciles of the distribution for the following table: fi Fi [50-60] 8 8 [60-60] 10 18 [70-60] 16 34 [80-60] 14 48 [90-60] 10 58 [100-60] 5 63 [110-60] 2 65 65 Solution: Calculation of First Decile $ {frac{65 times 1}{10} = 6.5 \[7pt] , D_1= 50 + frac{6.5 – 0}{8} times 10 , \[7pt] , = 58.12}$ Calculation of Second Decile $ {frac{65 times 2}{10} = 13 \[7pt] , D_2= 60 + frac{13 – 8}{10} times 10 , \[7pt] , = 65}$ Calculation of Third Decile $ {frac{65 times 3}{10} = 19.5 \[7pt] , D_3= 70 + frac{19.5 – 18}{16} times 10 , \[7pt] , = 70.94}$ Calculation of Fourth Decile $ {frac{65 times 4}{10} = 26 \[7pt] , D_4= 70 + frac{26 – 18}{16} times 10 , \[7pt] , = 75}$ Calculation of Fifth Decile $ {frac{65 times 5}{10} = 32.5 \[7pt] , D_5= 70 + frac{32.5 – 18}{16} times 10 , \[7pt] , = 79.06}$ Calculation of Sixth Decile $ {frac{65 times 6}{10} = 39 \[7pt] , D_6= 70 + frac{39 – 34}{14} times 10 , \[7pt] , = 83.57}$ Calculation of Seventh Decile $ {frac{65 times 7}{10} = 45.5 \[7pt] , D_7= 80 + frac{45.5 – 34}{14} times 10 , \[7pt] , = 88.21}$ Calculation of Eighth Decile $ {frac{65 times 8}{10} = 52 \[7pt] , D_8= 90 + frac{52 – 48}{10} times 10 , \[7pt] , = 94}$ Calculation of Nineth Decile $ {frac{65 times 9}{10} = 58.5 \[7pt] , D_9= 100 + frac{58.5 – 58}{5} times 10 , \[7pt] , = 101}$ Print Page Previous Next Advertisements ”;
Harmonic Number
Statistics – Harmonic Number ”; Previous Next Harmonic Number is the sum of the reciprocals of the first n natural numbers. It represents the phenomenon when the inductive reactance and the capacitive reactance of the power system becomes equal. Formula ${ H = frac{W_r}{W} \[7pt] , where W_r = sqrt{ frac{1}{LC}} } \[7pt] , and W = 2 pi f $ Where − ${f}$ = Harmonic resonance frequency. ${L}$ = inductance of the load. ${C}$ = capacitanc of the load. Example Calculate the harmonic number of a power system with the capcitance 5F, Inductance 6H and frequency 200Hz. Solution: Here capacitance, C is 5F. Inductance, L is 6H. Frequency, f is 200Hz. Using harmonic number formula, let”s compute the number as: ${ H = frac{sqrt{ frac{1}{LC}}}{2 pi f} \[7pt] implies H = frac{sqrt{ frac{1}{6 times 5}} }{2 times 3.14 times 200} \[7pt] , = frac{0.18257}{1256} \[7pt] , = 0.0001 }$ Thus harmonic number is $ { 0.0001 }$. Print Page Previous Next Advertisements ”;
Beta Distribution
Statistics – Beta Distribution ”; Previous Next The beta distribution represents continuous probability distribution parametrized by two positive shape parameters, $ alpha $ and $ beta $, which appear as exponents of the random variable x and control the shape of the distribution. Probability density function Probability density function of Beta distribution is given as: Formula ${ f(x) = frac{(x-a)^{alpha-1}(b-x)^{beta-1}}{B(alpha,beta) (b-a)^{alpha+beta-1}} hspace{.3in} a le x le b; alpha, beta > 0 \[7pt] , where B(alpha,beta) = int_{0}^{1} {t^{alpha-1}(1-t)^{beta-1}dt} }$ Where − ${ alpha, beta }$ = shape parameters. ${a, b}$ = upper and lower bounds. ${B(alpha,beta)}$ = Beta function. Standard Beta Distribution In case of having upper and lower bounds as 1 and 0, beta distribution is called the standard beta distribution. It is driven by following formula: Formula ${ f(x) = frac{x^{alpha-1}(1-x)^{beta-1}}{B(alpha,beta)} hspace{.3in} le x le 1; alpha, beta > 0}$ Cumulative distribution function Cumulative distribution function of Beta distribution is given as: Formula ${ F(x) = I_{x}(alpha,beta) = frac{int_{0}^{x}{t^{alpha-1}(1-t)^{beta-1}dt}}{B(alpha,beta)} hspace{.2in} 0 le x le 1; p, beta > 0 }$ Where − ${ alpha, beta }$ = shape parameters. ${a, b}$ = upper and lower bounds. ${B(alpha,beta)}$ = Beta function. It is also called incomplete beta function ratio. Print Page Previous Next Advertisements ”;
Cohen”s kappa coefficient
Statistics – Cohen”s kappa coefficient ”; Previous Next Cohen”s kappa coefficient is a statistic which measures inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, since k takes into account the agreement occurring by chance. Cohen”s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. Cohen”s kappa coefficient is defined and given by the following function − Formula ${k = frac{p_0 – p_e}{1-p_e} = 1 – frac{1-p_o}{1-p_e}}$ Where − ${p_0}$ = relative observed agreement among raters. ${p_e}$ = the hypothetical probability of chance agreement. ${p_0}$ and ${p_e}$ are computed using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then ${k}$ = 1. If there is no agreement among the raters other than what would be expected by chance (as given by ${p_e}$), ${k}$ ≤ 0. Example Problem Statement − Suppose that you were analyzing data related to a group of 50 people applying for a grant. Each grant proposal was read by two readers and each reader either said “Yes” or “No” to the proposal. Suppose the disagreement count data were as follows, where A and B are readers, data on the diagonal slanting left shows the count of agreements and the data on the diagonal slanting right, disagreements − B Yes No A Yes 20 5 No 10 15 Calculate Cohen”s kappa coefficient. Solution − Note that there were 20 proposals that were granted by both reader A and reader B and 15 proposals that were rejected by both readers. Thus, the observed proportionate agreement is ${p_0 = frac{20+15}{50} = 0.70}$ To calculate ${p_e}$ (the probability of random agreement) we note that − Reader A said “Yes” to 25 applicants and “No” to 25 applicants. Thus reader A said “Yes” 50% of the time. Reader B said “Yes” to 30 applicants and “No” to 20 applicants. Thus reader B said “Yes” 60% of the time. Using formula P(A and B) = P(A) x P(B) where P is probability of event occuring. The probability that both of them would say “Yes” randomly is 0.50 x 0.60 = 0.30 and the probability that both of them would say “No” is 0.50 x 0.40 = 0.20. Thus the overall probability of random agreement is ${p_e}$ = 0.3 + 0.2 = 0.5. So now applying our formula for Cohen”s Kappa we get: ${k = frac{p_0 – p_e}{1-p_e} = frac{0.70 – 0.50}{1-0.50} = 0.40}$ Calculator Print Page Previous Next Advertisements ”;
Individual Series Arithmetic Mean ”; Previous Next When data is given on individual basis. Following is an example of individual series − Items 5 10 20 30 40 50 60 70 For individual series, the Arithmetic Mean can be calculated using the following formula. Formula $bar{x} = sum_{i=1}^{n} X_{i}$ Alternatively, we can write same formula as follows − $bar{x} = frac{_{sum {x}}}{N}$ Where − $X_{1}, X_{2}, X_{3}, …. X_{n}$ = individual observation of variable. $sum {x}$ = sum of all observations of the variable ${N}$ = Number of observations Example Problem Statement − Calculate Arithmetic Mean for the following individual data − Items 14 36 45 70 105 Solution − Based on the above mentioned formula, Arithmetic Mean $bar{x}$ will be − $bar{x} = frac{14 + 36 + 45 + 70 + 105}{5} \[7pt] , = frac{270}{5} \[7pt] , = {54}$ The Arithmetic Mean of the given numbers is 54. Calculator Print Page Previous Next Advertisements ”;
Adjusted R-Squared
Statistics – Adjusted R-Squared ”; Previous Next R-squared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model. Adjusted R-squared adjusts the statistic based on the number of independent variables in the model.${R^2}$ shows how well terms (data points) fit a curve or line. Adjusted ${R^2}$ also indicates how well terms fit a curve or line, but adjusts for the number of terms in a model. If you add more and more useless variables to a model, adjusted r-squared will decrease. If you add more useful variables, adjusted r-squared will increase. Adjusted ${R_{adj}^2}$ will always be less than or equal to ${R^2}$. You only need ${R^2}$ when working with samples. In other words, ${R^2}$ isn”t necessary when you have data from an entire population. Formula ${R_{adj}^2 = 1 – [frac{(1-R^2)(n-1)}{n-k-1}]}$ Where − ${n}$ = the number of points in your data sample. ${k}$ = the number of independent regressors, i.e. the number of variables in your model, excluding the constant. Example Problem Statement − A fund has a sample R-squared value close to 0.5 and it is doubtlessly offering higher risk adjusted returns with the sample size of 50 for 5 predictors. Find Adjusted R square value. Solution − Sample size = 50 Number of predictor = 5 Sample R – square = 0.5.Substitute the qualities in the equation, $ {R_{adj}^2 = 1 – [frac{(1-0.5^2)(50-1)}{50-5-1}] \[7pt] , = 1 – (0.75) times frac{49}{44} , \[7pt] , = 1 – 0.8352 , \[7pt] , = 0.1648 }$ Calculator Print Page Previous Next Advertisements ”;