Statistics – Combination with replacement ”; Previous Next Each of several possible ways in which a set or number of things can be ordered or arranged is called permutation Combination with replacement in probability is selecting an object from an unordered list multiple times. Combination with replacement is defined and given by the following probability function − Formula ${^nC_r = frac{(n+r-1)!}{r!(n-1)!} }$ Where − ${n}$ = number of items which can be selected. ${r}$ = number of items which are selected. ${^nC_r}$ = Unordered list of items or combinations Example Problem Statement − There are five kinds of frozen yogurt: banana, chocolate, lemon, strawberry and vanilla. You can have three scoops. What number of varieties will there be? Solution − Here n = 5 and r = 3. Substitute the values in formula, ${^nC_r = frac{(n+r-1)!}{r!(n-1)!} \[7pt] = frac{(5+3+1)!}{3!(5-1)!} \[7pt] = frac{7!}{3!4!} \[7pt] = frac{5040}{6 times 24} \[7pt] = 35}$ Calculator Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Central limit theorem
Statistics – Central limit theorem ”; Previous Next If the population from which the sample has a been drawn is a normal population then the sample means would be equal to population mean and the sampling distribution would be normal. When the more population is skewed, as is the case illustrated in Figure, then the sampling distribution would tend to move closer to the normal distribution, provided the sample is large (i.e. greater then 30). According to Central Limit Theorem, for sufficiently large samples with size greater than 30, the shape of the sampling distribution will become more and more like a normal distribution, irrespective of the shape of the parent population. This theorem explains the relationship between the population distribution and sampling distribution. It highlights the fact that if there are large enough set of samples then the sampling distribution of mean approaches normal distribution. The importance of central limit theorem has been summed up by Richard. I. Levin in the following words: The significance of the central limit theorem lies in the fact that it permits us to use sample statistics to make inferences about population parameters without knowing anything about the shape of the frequency distribution of that population other than what we can get from the sample. Print Page Previous Next Advertisements ”;
Deciles Statistics
Statistics – Deciles Statistics ”; Previous Next A system of dividing the given random distribution of the data or values in a series into ten groups of similar frequency is known as deciles. Formula ${D_i = l + frac{h}{f}(frac{iN}{10} – c); i = 1,2,3…,9}$ Where − ${l}$ = lower boundry of deciles group. ${h}$ = width of deciles group. ${f}$ = frequency of deciles group. ${N}$ = total number of observations. ${c}$ = comulative frequency preceding deciles group. Example Problem Statement: Calculate the deciles of the distribution for the following table: fi Fi [50-60] 8 8 [60-60] 10 18 [70-60] 16 34 [80-60] 14 48 [90-60] 10 58 [100-60] 5 63 [110-60] 2 65 65 Solution: Calculation of First Decile $ {frac{65 times 1}{10} = 6.5 \[7pt] , D_1= 50 + frac{6.5 – 0}{8} times 10 , \[7pt] , = 58.12}$ Calculation of Second Decile $ {frac{65 times 2}{10} = 13 \[7pt] , D_2= 60 + frac{13 – 8}{10} times 10 , \[7pt] , = 65}$ Calculation of Third Decile $ {frac{65 times 3}{10} = 19.5 \[7pt] , D_3= 70 + frac{19.5 – 18}{16} times 10 , \[7pt] , = 70.94}$ Calculation of Fourth Decile $ {frac{65 times 4}{10} = 26 \[7pt] , D_4= 70 + frac{26 – 18}{16} times 10 , \[7pt] , = 75}$ Calculation of Fifth Decile $ {frac{65 times 5}{10} = 32.5 \[7pt] , D_5= 70 + frac{32.5 – 18}{16} times 10 , \[7pt] , = 79.06}$ Calculation of Sixth Decile $ {frac{65 times 6}{10} = 39 \[7pt] , D_6= 70 + frac{39 – 34}{14} times 10 , \[7pt] , = 83.57}$ Calculation of Seventh Decile $ {frac{65 times 7}{10} = 45.5 \[7pt] , D_7= 80 + frac{45.5 – 34}{14} times 10 , \[7pt] , = 88.21}$ Calculation of Eighth Decile $ {frac{65 times 8}{10} = 52 \[7pt] , D_8= 90 + frac{52 – 48}{10} times 10 , \[7pt] , = 94}$ Calculation of Nineth Decile $ {frac{65 times 9}{10} = 58.5 \[7pt] , D_9= 100 + frac{58.5 – 58}{5} times 10 , \[7pt] , = 101}$ Print Page Previous Next Advertisements ”;
Harmonic Number
Statistics – Harmonic Number ”; Previous Next Harmonic Number is the sum of the reciprocals of the first n natural numbers. It represents the phenomenon when the inductive reactance and the capacitive reactance of the power system becomes equal. Formula ${ H = frac{W_r}{W} \[7pt] , where W_r = sqrt{ frac{1}{LC}} } \[7pt] , and W = 2 pi f $ Where − ${f}$ = Harmonic resonance frequency. ${L}$ = inductance of the load. ${C}$ = capacitanc of the load. Example Calculate the harmonic number of a power system with the capcitance 5F, Inductance 6H and frequency 200Hz. Solution: Here capacitance, C is 5F. Inductance, L is 6H. Frequency, f is 200Hz. Using harmonic number formula, let”s compute the number as: ${ H = frac{sqrt{ frac{1}{LC}}}{2 pi f} \[7pt] implies H = frac{sqrt{ frac{1}{6 times 5}} }{2 times 3.14 times 200} \[7pt] , = frac{0.18257}{1256} \[7pt] , = 0.0001 }$ Thus harmonic number is $ { 0.0001 }$. Print Page Previous Next Advertisements ”;
Beta Distribution
Statistics – Beta Distribution ”; Previous Next The beta distribution represents continuous probability distribution parametrized by two positive shape parameters, $ alpha $ and $ beta $, which appear as exponents of the random variable x and control the shape of the distribution. Probability density function Probability density function of Beta distribution is given as: Formula ${ f(x) = frac{(x-a)^{alpha-1}(b-x)^{beta-1}}{B(alpha,beta) (b-a)^{alpha+beta-1}} hspace{.3in} a le x le b; alpha, beta > 0 \[7pt] , where B(alpha,beta) = int_{0}^{1} {t^{alpha-1}(1-t)^{beta-1}dt} }$ Where − ${ alpha, beta }$ = shape parameters. ${a, b}$ = upper and lower bounds. ${B(alpha,beta)}$ = Beta function. Standard Beta Distribution In case of having upper and lower bounds as 1 and 0, beta distribution is called the standard beta distribution. It is driven by following formula: Formula ${ f(x) = frac{x^{alpha-1}(1-x)^{beta-1}}{B(alpha,beta)} hspace{.3in} le x le 1; alpha, beta > 0}$ Cumulative distribution function Cumulative distribution function of Beta distribution is given as: Formula ${ F(x) = I_{x}(alpha,beta) = frac{int_{0}^{x}{t^{alpha-1}(1-t)^{beta-1}dt}}{B(alpha,beta)} hspace{.2in} 0 le x le 1; p, beta > 0 }$ Where − ${ alpha, beta }$ = shape parameters. ${a, b}$ = upper and lower bounds. ${B(alpha,beta)}$ = Beta function. It is also called incomplete beta function ratio. Print Page Previous Next Advertisements ”;
Cohen”s kappa coefficient
Statistics – Cohen”s kappa coefficient ”; Previous Next Cohen”s kappa coefficient is a statistic which measures inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, since k takes into account the agreement occurring by chance. Cohen”s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. Cohen”s kappa coefficient is defined and given by the following function − Formula ${k = frac{p_0 – p_e}{1-p_e} = 1 – frac{1-p_o}{1-p_e}}$ Where − ${p_0}$ = relative observed agreement among raters. ${p_e}$ = the hypothetical probability of chance agreement. ${p_0}$ and ${p_e}$ are computed using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then ${k}$ = 1. If there is no agreement among the raters other than what would be expected by chance (as given by ${p_e}$), ${k}$ ≤ 0. Example Problem Statement − Suppose that you were analyzing data related to a group of 50 people applying for a grant. Each grant proposal was read by two readers and each reader either said “Yes” or “No” to the proposal. Suppose the disagreement count data were as follows, where A and B are readers, data on the diagonal slanting left shows the count of agreements and the data on the diagonal slanting right, disagreements − B Yes No A Yes 20 5 No 10 15 Calculate Cohen”s kappa coefficient. Solution − Note that there were 20 proposals that were granted by both reader A and reader B and 15 proposals that were rejected by both readers. Thus, the observed proportionate agreement is ${p_0 = frac{20+15}{50} = 0.70}$ To calculate ${p_e}$ (the probability of random agreement) we note that − Reader A said “Yes” to 25 applicants and “No” to 25 applicants. Thus reader A said “Yes” 50% of the time. Reader B said “Yes” to 30 applicants and “No” to 20 applicants. Thus reader B said “Yes” 60% of the time. Using formula P(A and B) = P(A) x P(B) where P is probability of event occuring. The probability that both of them would say “Yes” randomly is 0.50 x 0.60 = 0.30 and the probability that both of them would say “No” is 0.50 x 0.40 = 0.20. Thus the overall probability of random agreement is ${p_e}$ = 0.3 + 0.2 = 0.5. So now applying our formula for Cohen”s Kappa we get: ${k = frac{p_0 – p_e}{1-p_e} = frac{0.70 – 0.50}{1-0.50} = 0.40}$ Calculator Print Page Previous Next Advertisements ”;
Splunk – Sort Command
Splunk – Sort Command ”; Previous Next The sort command sorts all the results by specified fields. The missing fields are treated as having the smallest or largest possible value of that field if the order is descending or ascending, respectively. If the first argument to the sort command is a number, then at most that many results are returned, in order. If no number is specified, the default limit of 10000 is used. If the number 0 is specified, all of the results are returned. Sorting By Field Types We can assign specific data type for the fields being searched. The existing data type in the Splunk dataset may be different than the data type we enforce in the search query. In the below example, we sort the status field as numeric in ascending order. Also, the field named url is searched as a string and the negative sign indicates descending order of sorting. Sorting up to a Limit We can also specify the number of results that will be sorted instead of the entire search result. The below search result shows the sorting of only 50 events with status as ascending and url as descending. Using Reverse We can toggle the result of an entire search query by using the reverse clause. It is useful to use the existing query without altering and reversing the sort result as and when needed. Print Page Previous Next Advertisements ”;
Splunk – Basic Chart
Splunk – Basic Chart ”; Previous Next Splunk has great visualization features which shows a variety of charts. These charts are created from the results of a search query where appropriate functions are used to give numerical outputs. For example, if we look for the average file size in bytes from the data set named web_applications, we can see the result in the statistics tab as shown below − Creating Charts In order to create a basic chart, we first ensure that the data is visible in the statistics tab as shown above. Then we click on the Visualization tab to get the corresponding chart. The above data produces a pie chart by default as shown below. Changing the Chart Type We can change the chart type by selecting a different chart option from the chart name. Clicking on one of these options will produce the chart for that type of graph. Formatting a Chart The charts can also be formatted by using the Format option. This option allows to set the values for the axes, set the legends or show the data values in the chart. In the below example, we have chosen the horizontal chart and selected the option to show the data values as a Format option. Print Page Previous Next Advertisements ”;
Splunk – Lookups
Splunk – Lookups ”; Previous Next In the result of a search query, we sometimes get values which may not clearly convey the meaning of the field. For example, we may get a field which lists the value of product id as a numeric result. These numbers will not give us any idea of what kind of product it is. But if we list the product name along with the product id, that gives us a good report where we understand the meaning of the search result. Such linking of values of one field to a field with same name in another dataset using equal values from both the data sets is called a lookup process. The advantage is, we retrieve the related values from two different data sets. Steps to Create and Use Lookup File In order to successfully create a lookup field in a dataset, we need to follow the below steps − Create Lookup File We consider the dataset with host as web_application, and look at the productid field. This field is just a number, but we want product names to be reflected in our query result set. We create a lookup file with the following details. Here, we have kept the name of the first field as productid which is same as the field we are going to use from the dataset. productId,productdescription WC-SH-G04,Tablets DB-SG-G01,PCs DC-SG-G02,MobilePhones SC-MG-G10,Wearables WSC-MG-G10,Usb Light GT-SC-G01,Battery SF-BVS-G01,Hard Drive Add the Lookup File Next, we add the lookup file to Splunk environment by using the Settings screens as shown below − After selecting the Lookups, we are presented with a screen to create and configure lookup. We select lookup table files as shown below. We browse to select the file productidvals.csv as our lookup file to be uploaded and select search as our destination app. We also keep the same destination file name. On clicking the save button, the file gets saved to the Splunk repository as a lookup file. Create Lookup Definitions For a search query to be able to lookup values from the Lookup file we just uploaded above, we need to create a lookup definition. We do this by again going to Settings → Lookups → Lookup Definition → Add New . Next, we check the availability of the lookup definition we added by going to Settings → Lookups → Lookup Definition . Selecting Lookup Field Next, we need to select the lookup field for our search query. This is done my going to New search → All Fields . Then check the box for productid which will automatically add the productdescription field from the lookup file also. Using the Lookup Field Now we use the Lookup field in the search query as shown below. The visualization shows the result with productdescription field instead of productid. Print Page Previous Next Advertisements ”;
Splunk – Knowledge Management ”; Previous Next Splunk knowledge management is about maintenance of knowledge objects for a Splunk Enterprise implementation. Below are the main features of knowledge management − Ensure that knowledge objects are being shared and used by the right groups of people in the organization. Normalize event data by implementing knowledge object naming conventions and retiring duplicate or obsolete objects. Oversee strategies for improved search and pivot performance (report acceleration, data model acceleration, summary indexing, batch mode search). Build data models for Pivot users. Knowledge Object It is a Splunk object to get specific information about your data. When you create a knowledge object, you can keep it private or you can share it with other users. The examples of knowledge object are: saved searches, tags, field extractions, lookups, etc. Uses of Knowledge Objects On using the Splunk software, the knowledge objects are created and saved. But they may contain duplicate information, or they may not be used effectively by all the intended audience. To address such issues, we need to manage these objects. This is done by classifying them properly and then using proper permission management to handle them. Below are the uses and classification of various knowledge objects − Fields and field extractions Fields and field extractions is the first layer of Splunk software knowledge. The fields automatically extracted from the Splunk software from the IT data help bring meaning to the raw data. The manually extracted fields expand and improve upon this layer of meaning. Event types and transactions Use event types and transactions to group together interesting sets of similar events. Event types group together sets of events discovered through searches. Transactions are collections of conceptually-related events that span time. Lookups and workflow actions Lookups and workflow actions are categories of knowledge objects that extend the usefulness of your data in various ways. Field lookups enable you to add fields to your data from external data sources such as static tables (CSV files) or Python-based commands. Workflow actions enable interactions between fields in your data and other applications or web resources, such as a WHOIS lookup on a field containing an IP address. Tags and aliases Tags and aliases are used to manage and normalize sets of field information. You can use tags and aliases to group sets of related field values together, and to give extracted field tags that reflect different aspects of their identity. For example, you can group events from set of hosts in a particular location (such as a building or city) together by giving the same tag to each host. If you have two different sources using different field names to refer to same data, then you can normalize your data by using aliases (by aliasing clientip to ipaddress, for example). Data models Data models are representations of one or more datasets, and they drive the Pivot tool, enabling Pivot users to quickly generate useful tables, complex visualizations, and robust reports without needing to interact with the Splunk software search language. Data models are designed by knowledge managers who fully understand the format and semantics of their indexed data. A typical data model makes use of other knowledge object types. We will discuss some of the examples of these knowledge objects in the subsequent chapters. Print Page Previous Next Advertisements ”;