Python Chi-square Test

Python – Chi-Square Test ”; Previous Next Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like − Yes/No, Male/Female, Red/Green etc. For example, we can build a data set with observations on people”s ice-cream buying pattern and try to correlate the gender of a person with the flavour of the ice-cream they prefer. If a correlation is found we can plan for appropriate stock of flavours by knowing the number of gender of people visiting. We use various functions in numpy library to carry out the chi-square test. from scipy import stats import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 10, 100) fig,ax = plt.subplots(1,1) linestyles = [”:”, ”–”, ”-.”, ”-”] deg_of_freedom = [1, 4, 7, 6] for df, ls in zip(deg_of_freedom, linestyles): ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls) plt.xlim(0, 10) plt.ylim(0, 0.4) plt.xlabel(”Value”) plt.ylabel(”Frequency”) plt.title(”Chi-Square Distribution”) plt.legend() plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Python Normal Distribution

Python – Normal Distribution ”; Previous Next The normal distribution is a form presenting data by arranging the probability distribution of each value in the data.Most values remain around the mean value making the arrangement symmetric. We use various functions in numpy library to mathematically calculate the values for a normal distribution. Histograms are created over which we plot the probability distribution curve. import matplotlib.pyplot as plt import numpy as np mu, sigma = 0.5, 0.1 s = np.random.normal(mu, sigma, 1000) # Create the bins and histogram count, bins, ignored = plt.hist(s, 20, normed=True) # Plot the distribution curve plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( – (bins – mu)**2 / (2 * sigma**2) ), linewidth=3, color=”y”) plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Python Correlation

Python – Correlation ”; Previous Next Correlation refers to some statistical relationships involving dependence between two data sets. Simple examples of dependent phenomena include the correlation between the physical appearance of parents and their offspring, and the correlation between the price for a product and its supplied quantity. We take example of the iris data set available in seaborn python library. In it we try to establish the correlation between the length and the width of the sepals and petals of three species of iris flower. Based on the correlation found, a strong model could be created which easily distinguishes one species from another. import matplotlib.pyplot as plt import seaborn as sns df = sns.load_dataset(”iris”) #without regression sns.pairplot(df, kind=”scatter”) plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Python Reading HTML Pages

Python – Reading HTML Pages ”; Previous Next library known as beautifulsoup. Using this library, we can search for the values of html tags and get specific data like title of the page and the list of headers in the page. Install Beautifulsoup Use the Anaconda package manager to install the required package and its dependent packages. conda install Beaustifulsoap Reading the HTML file In the below example we make a request to an url to be loaded into the python environment. Then use the html parser parameter to read the entire html file. Next, we print first few lines of the html page. import urllib2 from bs4 import BeautifulSoup # Fetch the html file response = urllib2.urlopen(”http://tutorialspoint.com/python/python_overview.htm”) html_doc = response.read() # Parse the html file soup = BeautifulSoup(html_doc, ”html.parser”) # Format the parsed html file strhtm = soup.prettify() # Print the first few characters print (strhtm[:225]) When we execute the above code, it produces the following result. <!DOCTYPE html> <!–[if IE 8]><html class=”ie ie8″> <![endif]–> <!–[if IE 9]><html class=”ie ie9″> <![endif]–> <!–[if gt IE 9]><!–> <html> <!–<![endif]–> <head> <!– Basic –> <meta charset=”utf-8″/> <title> Extracting Tag Value We can extract tag value from the first instance of the tag using the following code. import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen(”http://tutorialspoint.com/python/python_overview.htm”) html_doc = response.read() soup = BeautifulSoup(html_doc, ”html.parser”) print (soup.title) print(soup.title.string) print(soup.a.string) print(soup.b.string) When we execute the above code, it produces the following result. Python Overview Python Overview None Python is Interpreted Extracting All Tags We can extract tag value from all the instances of a tag using the following code. import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen(”http://tutorialspoint.com/python/python_overview.htm”) html_doc = response.read() soup = BeautifulSoup(html_doc, ”html.parser”) for x in soup.find_all(”b”): print(x.string) When we execute the above code, it produces the following result. Python is Interpreted Python is Interactive Python is Object-Oriented Python is a Beginner”s Language Easy-to-learn Easy-to-read Easy-to-maintain A broad standard library Interactive Mode Portable Extendable Databases GUI Programming Scalable Print Page Previous Next Advertisements ”;

Python Poisson Distribution

Python – Poisson Distribution ”; Previous Next A Poisson distribution is a distribution which shows the likely number of times that an event will occur within a pre-determined period of time. It is used for independent events which occur at a constant rate within a given interval of time. The Poisson distribution is a discrete function, meaning that the event can only be measured as occurring or not as occurring, meaning the variable can only be measured in whole numbers. We use the seaborn python library which has in-built functions to create such probability distribution graphs. Also the scipy package helps is creating the binomial distribution. from scipy.stats import poisson import seaborn as sb data_binom = poisson.rvs(mu=4, size=10000) ax = sb.distplot(data_binom, kde=True, color=”green”, hist_kws={“linewidth”: 25,”alpha”:1}) ax.set(xlabel=”Poisson”, ylabel=”Frequency”) Its output is as follows − Print Page Previous Next Advertisements ”;

Python Data Aggregation

Python – Data Aggregation ”; Previous Next Python has several methods are available to perform aggregations on data. It is done using the pandas and numpy libraries. The data must be available or converted to a dataframe to apply the aggregation functions. Applying Aggregations on DataFrame Let us create a DataFrame and apply aggregations on it. import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 0.790670 -0.387854 -0.668132 0.267283 2000-01-03 -0.575523 -0.965025 0.060427 -2.179780 2000-01-04 1.669653 1.211759 -0.254695 1.429166 2000-01-05 0.100568 -0.236184 0.491646 -0.466081 2000-01-06 0.155172 0.992975 -1.205134 0.320958 2000-01-07 0.309468 -0.724053 -1.412446 0.627919 2000-01-08 0.099489 -1.028040 0.163206 -1.274331 2000-01-09 1.639500 -0.068443 0.714008 -0.565969 2000-01-10 0.326761 1.479841 0.664282 -1.361169 Rolling [window=3,min_periods=1,center=False,axis=0] We can aggregate by passing a function to the entire DataFrame, or select a column via the standard get item method. Apply Aggregation on a Whole Dataframe import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r.aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 Apply Aggregation on a Single Column of a Dataframe import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r[”A”].aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 2000-01-01 1.088512 2000-01-02 1.879182 2000-01-03 1.303660 2000-01-04 1.884801 2000-01-05 1.194699 2000-01-06 1.925393 2000-01-07 0.565208 2000-01-08 0.564129 2000-01-09 2.048458 2000-01-10 2.065750 Freq: D, Name: A, dtype: float64 Apply Aggregation on Multiple Columns of a DataFrame import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r[[”A”,”B”]].aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 A B 2000-01-01 1.088512 -0.650942 2000-01-02 1.879182 -1.038796 2000-01-03 1.303660 -2.003821 2000-01-04 1.884801 -0.141119 2000-01-05 1.194699 0.010551 2000-01-06 1.925393 1.968551 2000-01-07 0.565208 0.032738 2000-01-08 0.564129 -0.759118 2000-01-09 2.048458 -1.820537 2000-01-10 2.065750 0.383357 Print Page Previous Next Advertisements ”;

Python Time Series

Python – Time Series ”; Previous Next Time series is a series of data points in which each data point is associated with a timestamp. A simple example is the price of a stock in the stock market at different points of time on a given day. Another example is the amount of rainfall in a region at different months of the year. In the below example we take the value of stock prices every day for a quarter for a particular stock symbol. We capture these values as a csv file and then organize them to a dataframe using pandas library. We then set the date field as index of the dataframe by recreating the additional Valuedate column as index and deleting the old valuedate column. Sample Data Below is the sample data for the price of the stock on different days of a given quarter. The data is saved in a file named as stock.csv ValueDate Price 01-01-2018, 1042.05 02-01-2018, 1033.55 03-01-2018, 1029.7 04-01-2018, 1021.3 05-01-2018, 1015.4 … … … … 23-03-2018, 1161.3 26-03-2018, 1167.6 27-03-2018, 1155.25 28-03-2018, 1154 Creating Time Series from datetime import datetime import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv(”path_to_file/stock.csv”) df = pd.DataFrame(data, columns = [”ValueDate”, ”Price”]) # Set the Date as Index df[”ValueDate”] = pd.to_datetime(df[”ValueDate”]) df.index = df[”ValueDate”] del df[”ValueDate”] df.plot(figsize=(15, 6)) plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Python Data Wrangling

Python – Data Wrangling ”; Previous Next Data wrangling involves processing the data in various formats like – merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in features to apply these wrangling methods to various data sets to achieve the analytical goal. In this chapter we will look at few examples describing these methods. Merging Data The Pandas library in python provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects − pd.merge(left, right, how=”inner”, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True) Let us now create two different DataFrames and perform the merging operations on it. # import the pandas library import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame( {”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print left print right Its output is as follows − Name id subject_id 0 Alex 1 sub1 1 Amy 2 sub2 2 Allen 3 sub4 3 Alice 4 sub6 4 Ayoung 5 sub5 Name id subject_id 0 Billy 1 sub2 1 Brian 2 sub4 2 Bran 3 sub3 3 Bryce 4 sub6 4 Betty 5 sub5 Grouping Data Grouping data sets is a frequent need in data analysis where we need the result in terms of various groups present in the data set. Panadas has in-built methods which can roll the data into various groups. In the below example we group the data by year and then get the result for a specific year. # import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby(”Year”) print grouped.get_group(2014) Its output is as follows − Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014 Concatenating Data Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects. In the below example the concat function performs concatenation operations along an axis. Let us create different objects and do concatenation. import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print pd.concat([one,two]) Its output is as follows − Marks_scored Name subject_id 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5 Print Page Previous Next Advertisements ”;

Python Relational databases

Python – Relational Databases ”; Previous Next We can connect to relational databases for analysing data using the pandas library as well as another additional library for implementing database connectivity. This package is named as sqlalchemy which provides full SQL language functionality to be used in python. Installing SQLAlchemy The installation is very straight forward using Anaconda which we have discussed in the chapter Data Science Environment. Assuming you have installed Anaconda as described in this chapter, run the following command in the Anaconda Prompt Window to install the SQLAlchemy package. conda install sqlalchemy Reading Relational Tables We will use Sqlite3 as our relational database as it is very light weight and easy to use. Though the SQLAlchemy library can connect to a variety of relational sources including MySql, Oracle and Postgresql and Mssql. We first create a database engine and then connect to the database engine using the to_sql function of the SQLAlchemy library. In the below example we create the relational table by using the to_sql function from a dataframe already created by reading a csv file. Then we use the read_sql_query function from pandas to execute and capture the results from various SQL queries. from sqlalchemy import create_engine import pandas as pd data = pd.read_csv(”/path/input.csv”) # Create the db engine engine = create_engine(”sqlite:///:memory:”) # Store the dataframe as a table data.to_sql(”data_table”, engine) # Query 1 on the relational table res1 = pd.read_sql_query(”SELECT * FROM data_table”, engine) print(”Result 1”) print(res1) print(””) # Query 2 on the relational table res2 = pd.read_sql_query(”SELECT dept,sum(salary) FROM data_table group by dept”, engine) print(”Result 2”) print(res2) When we execute the above code, it produces the following result. Result 1 index id name salary start_date dept 0 0 1 Rick 623.30 2012-01-01 IT 1 1 2 Dan 515.20 2013-09-23 Operations 2 2 3 Tusar 611.00 2014-11-15 IT 3 3 4 Ryan 729.00 2014-05-11 HR 4 4 5 Gary 843.25 2015-03-27 Finance 5 5 6 Rasmi 578.00 2013-05-21 IT 6 6 7 Pranab 632.80 2013-07-30 Operations 7 7 8 Guru 722.50 2014-06-17 Finance Result 2 dept sum(salary) 0 Finance 1565.75 1 HR 729.00 2 IT 1812.30 3 Operations 1148.00 Inserting Data to Relational Tables We can also insert data into relational tables using sql.execute function available in pandas. In the below code we previous csv file as input data set, store it in a relational table and then insert another record using sql.execute. from sqlalchemy import create_engine from pandas.io import sql import pandas as pd data = pd.read_csv(”C:/Users/Rasmi/Documents/pydatasci/input.csv”) engine = create_engine(”sqlite:///:memory:”) # Store the Data in a relational table data.to_sql(”data_table”, engine) # Insert another row sql.execute(”INSERT INTO data_table VALUES(?,?,?,?,?,?)”, engine, params=[(”id”,9,”Ruby”,711.20,”2015-03-27”,”IT”)]) # Read from the relational table res = pd.read_sql_query(”SELECT ID,Dept,Name,Salary,start_date FROM data_table”, engine) print(res) When we execute the above code, it produces the following result. id dept name salary start_date 0 1 IT Rick 623.30 2012-01-01 1 2 Operations Dan 515.20 2013-09-23 2 3 IT Tusar 611.00 2014-11-15 3 4 HR Ryan 729.00 2014-05-11 4 5 Finance Gary 843.25 2015-03-27 5 6 IT Rasmi 578.00 2013-05-21 6 7 Operations Pranab 632.80 2013-07-30 7 8 Finance Guru 722.50 2014-06-17 8 9 IT Ruby 711.20 2015-03-27 Deleting Data from Relational Tables We can also delete data into relational tables using sql.execute function available in pandas. The below code deletes a row based on the input condition given. from sqlalchemy import create_engine from pandas.io import sql import pandas as pd data = pd.read_csv(”C:/Users/Rasmi/Documents/pydatasci/input.csv”) engine = create_engine(”sqlite:///:memory:”) data.to_sql(”data_table”, engine) sql.execute(”Delete from data_table where name = (?) ”, engine, params=[(”Gary”)]) res = pd.read_sql_query(”SELECT ID,Dept,Name,Salary,start_date FROM data_table”, engine) print(res) When we execute the above code, it produces the following result. id dept name salary start_date 0 1 IT Rick 623.3 2012-01-01 1 2 Operations Dan 515.2 2013-09-23 2 3 IT Tusar 611.0 2014-11-15 3 4 HR Ryan 729.0 2014-05-11 4 6 IT Rasmi 578.0 2013-05-21 5 7 Operations Pranab 632.8 2013-07-30 6 8 Finance Guru 722.5 2014-06-17 Print Page Previous Next Advertisements ”;

Python Processing Unstructured Data

Python – Processing Unstructured Data ”; Previous Next The data that is already present in a row and column format or which can be easily converted to rows and columns so that later it can fit nicely into a database is known as structured data. Examples are CSV, TXT, XLS files etc. These files have a delimiter and either fixed or variable width where the missing values are represented as blanks in between the delimiters. But sometimes we get data where the lines are not fixed width, or they are just HTML, image or pdf files. Such data is known as unstructured data. While the HTML file can be handled by processing the HTML tags, a feed from twitter or a plain text document from a news feed can without having a delimiter does not have tags to handle. In such scenario we use different in-built functions from various python libraries to process the file. Reading Data In the below example we take a text file and read the file segregating each of the lines in it. Next we can divide the output into further lines and words. The original file is a text file containing some paragraphs describing the python language. filename = ”pathinput.txt” with open(filename) as fn: # Read each line ln = fn.readline() # Keep count of lines lncnt = 1 while ln: print(“Line {}: {}”.format(lncnt, ln.strip())) ln = fn.readline() lncnt += 1 When we execute the above code, it produces the following result. Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library. Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation. Counting Word Frequency We can count the frequency of the words in the file using the counter function as follows. from collections import Counter with open(r”pathinput2.txt”) as f: p = Counter(f.read().split()) print(p) When we execute the above code, it produces the following result. Counter({”and”: 3, ”Python”: 3, ”that”: 2, ”a”: 2, ”programming”: 2, ”code”: 1, ”1991,”: 1, ”is”: 1, ”programming.”: 1, ”dynamic”: 1, ”an”: 1, ”design”: 1, ”in”: 1, ”high-level”: 1, ”management.”: 1, ”features”: 1, ”readability,”: 1, ”van”: 1, ”both”: 1, ”for”: 1, ”Rossum”: 1, ”system”: 1, ”provides”: 1, ”memory”: 1, ”has”: 1, ”type”: 1, ”enable”: 1, ”Created”: 1, ”philosophy”: 1, ”constructs”: 1, ”emphasizes”: 1, ”general-purpose”: 1, ”notably”: 1, ”released”: 1, ”significant”: 1, ”Guido”: 1, ”using”: 1, ”interpreted”: 1, ”by”: 1, ”on”: 1, ”language”: 1, ”whitespace.”: 1, ”clear”: 1, ”It”: 1, ”large”: 1, ”small”: 1, ”automatic”: 1, ”scales.”: 1, ”first”: 1}) Print Page Previous Next Advertisements ”;