python Pandas Archives - Donotsad where can learn any thing work project and make money

Aug 09

Comparison with SQL

Python Pandas – Comparison with SQL ”; Previous Next Since many potential Pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations can be performed using pandas. import pandas as pd url = ”https://raw.github.com/pandasdev/ pandas/master/pandas/tests/data/tips.csv” tips=pd.read_csv(url) print tips.head() Its output is as follows − total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 SELECT In SQL, selection is done using a comma-separated list of columns that you select (or a * to select all columns) − SELECT total_bill, tip, smoker, time FROM tips LIMIT 5; With Pandas, column selection is done by passing a list of column names to your DataFrame − tips[[”total_bill”, ”tip”, ”smoker”, ”time”]].head(5) Let’s check the full program − import pandas as pd url = ”https://raw.github.com/pandasdev/ pandas/master/pandas/tests/data/tips.csv” tips=pd.read_csv(url) print tips[[”total_bill”, ”tip”, ”smoker”, ”time”]].head(5) Its output is as follows − total_bill tip smoker time 0 16.99 1.01 No Dinner 1 10.34 1.66 No Dinner 2 21.01 3.50 No Dinner 3 23.68 3.31 No Dinner 4 24.59 3.61 No Dinner Calling the DataFrame without the list of column names will display all columns (akin to SQL’s *). WHERE Filtering in SQL is done via a WHERE clause. SELECT * FROM tips WHERE time = ”Dinner” LIMIT 5; DataFrames can be filtered in multiple ways; the most intuitive of which is using Boolean indexing. tips[tips[”time”] == ”Dinner”].head(5) Let’s check the full program − import pandas as pd url = ”https://raw.github.com/pandasdev/ pandas/master/pandas/tests/data/tips.csv” tips=pd.read_csv(url) print tips[tips[”time”] == ”Dinner”].head(5) Its output is as follows − total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 The above statement passes a Series of True/False objects to the DataFrame, returning all rows with True. GroupBy This operation fetches the count of records in each group throughout a dataset. For instance, a query fetching us the number of tips left by sex − SELECT sex, count(*) FROM tips GROUP BY sex; The Pandas equivalent would be − tips.groupby(”sex”).size() Let’s check the full program − import pandas as pd url = ”https://raw.github.com/pandasdev/ pandas/master/pandas/tests/data/tips.csv” tips=pd.read_csv(url) print tips.groupby(”sex”).size() Its output is as follows − sex Female 87 Male 157 dtype: int64 Top N rows SQL returns the top n rows using LIMIT − SELECT * FROM tips LIMIT 5 ; The Pandas equivalent would be − tips.head(5) Let’s check the full example − import pandas as pd url = ”https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv” tips=pd.read_csv(url) tips = tips[[”smoker”, ”day”, ”time”]].head(5) print tips Its output is as follows − smoker day time 0 No Sun Dinner 1 No Sun Dinner 2 No Sun Dinner 3 No Sun Dinner 4 No Sun Dinner These are the few basic operations we compared are, which we learnt, in the previous chapters of the Pandas Library. Print Page Previous Next Advertisements ”;

Aug 09

Python Pandas – Discussion

Discuss Python Pandas ”; Previous Next Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.In this tutorial, we will learn the various features of Python Pandas and how to use them in practice. Print Page Previous Next Advertisements ”;

Aug 09

Python Pandas – Visualization

Python Pandas – Visualization ”; Previous Next Basic Plotting: plot This functionality on Series and DataFrame is just a simple wrapper around the matplotlib libraries plot() method. import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range(”1/1/2000”, periods=10), columns=list(”ABCD”)) df.plot() Its output is as follows − If the index consists of dates, it calls gct().autofmt_xdate() to format the x-axis as shown in the above illustration. We can plot one column versus another using the x and y keywords. Plotting methods allow a handful of plot styles other than the default line plot. These methods can be provided as the kind keyword argument to plot(). These include − bar or barh for bar plots hist for histogram box for boxplot ”area” for area plots ”scatter” for scatter plots Bar Plot Let us now see what a Bar Plot is by creating one. A bar plot can be created in the following way − import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(10,4),columns=[”a”,”b”,”c”,”d”) df.plot.bar() Its output is as follows − To produce a stacked bar plot, pass stacked=True − import pandas as pd df = pd.DataFrame(np.random.rand(10,4),columns=[”a”,”b”,”c”,”d”) df.plot.bar(stacked=True) Its output is as follows − To get horizontal bar plots, use the barh method − import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(10,4),columns=[”a”,”b”,”c”,”d”) df.plot.barh(stacked=True) Its output is as follows − Histograms Histograms can be plotted using the plot.hist() method. We can specify number of bins. import pandas as pd import numpy as np df = pd.DataFrame({”a”:np.random.randn(1000)+1,”b”:np.random.randn(1000),”c”: np.random.randn(1000) – 1}, columns=[”a”, ”b”, ”c”]) df.plot.hist(bins=20) Its output is as follows − To plot different histograms for each column, use the following code − import pandas as pd import numpy as np df=pd.DataFrame({”a”:np.random.randn(1000)+1,”b”:np.random.randn(1000),”c”: np.random.randn(1000) – 1}, columns=[”a”, ”b”, ”c”]) df.diff.hist(bins=20) Its output is as follows − Box Plots Boxplot can be drawn calling Series.box.plot() and DataFrame.box.plot(), or DataFrame.boxplot() to visualize the distribution of values within each column. For instance, here is a boxplot representing five trials of 10 observations of a uniform random variable on [0,1). import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(10, 5), columns=[”A”, ”B”, ”C”, ”D”, ”E”]) df.plot.box() Its output is as follows − Area Plot Area plot can be created using the Series.plot.area() or the DataFrame.plot.area() methods. import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(10, 4), columns=[”a”, ”b”, ”c”, ”d”]) df.plot.area() Its output is as follows − Scatter Plot Scatter plot can be created using the DataFrame.plot.scatter() methods. import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(50, 4), columns=[”a”, ”b”, ”c”, ”d”]) df.plot.scatter(x=”a”, y=”b”) Its output is as follows − Pie Chart Pie chart can be created using the DataFrame.plot.pie() method. import pandas as pd import numpy as np df = pd.DataFrame(3 * np.random.rand(4), index=[”a”, ”b”, ”c”, ”d”], columns=[”x”]) df.plot.pie(subplots=True) Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python Pandas – GroupBy

Python Pandas – GroupBy ”; Previous Next Any groupby operation involves one of the following operations on the original object. They are − Splitting the Object Applying a function Combining the results In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations − Aggregation − computing a summary statistic Transformation − perform some group-specific operation Filtration − discarding the data with some condition Let us now create a DataFrame object and perform all the operations on it − Live Demo #import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df Its output is as follows − Points Rank Team Year 0 876 1 Riders 2014 1 789 2 Riders 2015 2 863 2 Devils 2014 3 673 3 Devils 2015 4 741 3 Kings 2014 5 812 4 kings 2015 6 756 1 Kings 2016 7 788 1 Kings 2017 8 694 2 Riders 2016 9 701 4 Royals 2014 10 804 1 Royals 2015 11 690 2 Riders 2017 Split Data into Groups Pandas object can be split into any of their objects. There are multiple ways to split an object like − obj.groupby(”key”) obj.groupby([”key1”,”key2”]) obj.groupby(key,axis=1) Let us now see how the grouping objects can be applied to the DataFrame object Example Live Demo # import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby(”Team”) Its output is as follows − <pandas.core.groupby.DataFrameGroupBy object at 0x7fa46a977e50> View Groups Live Demo # import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby(”Team”).groups Its output is as follows − {”Kings”: Int64Index([4, 6, 7], dtype=”int64”), ”Devils”: Int64Index([2, 3], dtype=”int64”), ”Riders”: Int64Index([0, 1, 8, 11], dtype=”int64”), ”Royals”: Int64Index([9, 10], dtype=”int64”), ”kings” : Int64Index([5], dtype=”int64”)} Example Group by with multiple columns − Live Demo # import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby([”Team”,”Year”]).groups Its output is as follows − {(”Kings”, 2014): Int64Index([4], dtype=”int64”), (”Royals”, 2014): Int64Index([9], dtype=”int64”), (”Riders”, 2014): Int64Index([0], dtype=”int64”), (”Riders”, 2015): Int64Index([1], dtype=”int64”), (”Kings”, 2016): Int64Index([6], dtype=”int64”), (”Riders”, 2016): Int64Index([8], dtype=”int64”), (”Riders”, 2017): Int64Index([11], dtype=”int64”), (”Devils”, 2014): Int64Index([2], dtype=”int64”), (”Devils”, 2015): Int64Index([3], dtype=”int64”), (”kings”, 2015): Int64Index([5], dtype=”int64”), (”Royals”, 2015): Int64Index([10], dtype=”int64”), (”Kings”, 2017): Int64Index([7], dtype=”int64”)} Iterating through Groups With the groupby object in hand, we can iterate through the object similar to itertools.obj. Live Demo # import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby(”Year”) for name,group in grouped: print name print group Its output is as follows − 2014 Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014 2015 Points Rank Team Year 1 789 2 Riders 2015 3 673 3 Devils 2015 5 812 4 kings 2015 10 804 1 Royals 2015 2016 Points Rank Team Year 6 756 1 Kings 2016 8 694 2 Riders 2016 2017 Points Rank Team Year 7 788 1 Kings 2017 11 690 2 Riders 2017 By default, the groupby object has the same label name as the group name. Select a Group Using the get_group() method, we can select a single group. Live Demo # import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby(”Year”) print grouped.get_group(2014) Its output is as follows − Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014 Aggregations An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data. An obvious one is aggregation via the aggregate or equivalent agg method − Live Demo # import the pandas library import pandas as pd import numpy as np ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby(”Year”) print grouped[”Points”].agg(np.mean) Its output is as follows − Year 2014 795.25 2015 769.50 2016 725.00 2017 739.00 Name: Points, dtype: float64 Another way to see the size of each group is by applying the size() function − Live Demo import pandas as pd import numpy as np ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) Attribute Access in Python Pandas grouped = df.groupby(”Team”) print grouped.agg(np.size) Its output is as follows − Points Rank Year Team Devils 2 2 2 Kings 3 3 3 Riders 4 4 4 Royals 2 2 2 kings 1 1 1 Applying Multiple Aggregation Functions at Once With grouped Series, you can also pass a list or dict of functions to do aggregation with, and generate DataFrame as output − Live Demo # import the pandas library import pandas as pd import

Aug 09

Python Pandas – Sparse Data

Python Pandas – Sparse Data ”; Previous Next Sparse objects are “compressed” when any data matching a specific value (NaN / missing value, though any value can be chosen) is omitted. A special SparseIndex object tracks where data has been “sparsified”. This will make much more sense in an example. All of the standard Pandas data structures apply the to_sparse method − Live Demo import pandas as pd import numpy as np ts = pd.Series(np.random.randn(10)) ts[2:-2] = np.nan sts = ts.to_sparse() print sts Its output is as follows − 0 -0.810497 1 -1.419954 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 0.439240 9 -1.095910 dtype: float64 BlockIndex Block locations: array([0, 8], dtype=int32) Block lengths: array([2, 2], dtype=int32) The sparse objects exist for memory efficiency reasons. Let us now assume you had a large NA DataFrame and execute the following code − Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10000, 4)) df.ix[:9998] = np.nan sdf = df.to_sparse() print sdf.density Its output is as follows − 0.0001 Any sparse object can be converted back to the standard dense form by calling to_dense − Live Demo import pandas as pd import numpy as np ts = pd.Series(np.random.randn(10)) ts[2:-2] = np.nan sts = ts.to_sparse() print sts.to_dense() Its output is as follows − 0 -0.810497 1 -1.419954 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 0.439240 9 -1.095910 dtype: float64 Sparse Dtypes Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and booldtypes are supported. Depending on the original dtype, fill_value default changes − float64 − np.nan int64 − 0 bool − False Let us execute the following code to understand the same − Live Demo import pandas as pd import numpy as np s = pd.Series([1, np.nan, np.nan]) print s s.to_sparse() print s Its output is as follows − 0 1.0 1 NaN 2 NaN dtype: float64 0 1.0 1 NaN 2 NaN dtype: float64 Print Page Previous Next Advertisements ”;

Aug 09

Python Pandas – Date Functionality

Python Pandas – Date Functionality ”; Previous Next Extending the Time series, Date functionalities play major role in financial data analysis. While working with Date data, we will frequently come across the following − Generating sequence of dates Convert the date series to different frequencies Create a Range of Dates Using the date.range() function by specifying the periods and the frequency, we can create the date series. By default, the frequency of range is Days. Live Demo import pandas as pd print pd.date_range(”1/1/2011”, periods=5) Its output is as follows − DatetimeIndex([”2011-01-01”, ”2011-01-02”, ”2011-01-03”, ”2011-01-04”, ”2011-01-05”], dtype=”datetime64[ns]”, freq=”D”) Change the Date Frequency Live Demo import pandas as pd print pd.date_range(”1/1/2011”, periods=5,freq=”M”) Its output is as follows − DatetimeIndex([”2011-01-31”, ”2011-02-28”, ”2011-03-31”, ”2011-04-30”, ”2011-05-31”], dtype=”datetime64[ns]”, freq=”M”) bdate_range bdate_range() stands for business date ranges. Unlike date_range(), it excludes Saturday and Sunday. Live Demo import pandas as pd print pd.date_range(”1/1/2011”, periods=5) Its output is as follows − DatetimeIndex([”2011-01-01”, ”2011-01-02”, ”2011-01-03”, ”2011-01-04”, ”2011-01-05”], dtype=”datetime64[ns]”, freq=”D”) Observe, after 3rd March, the date jumps to 6th march excluding 4th and 5th. Just check your calendar for the days. Convenience functions like date_range and bdate_range utilize a variety of frequency aliases. The default frequency for date_range is a calendar day while the default for bdate_range is a business day. Live Demo import pandas as pd start = pd.datetime(2011, 1, 1) end = pd.datetime(2011, 1, 5) print pd.date_range(start, end) Its output is as follows − DatetimeIndex([”2011-01-01”, ”2011-01-02”, ”2011-01-03”, ”2011-01-04”, ”2011-01-05”], dtype=”datetime64[ns]”, freq=”D”) Offset Aliases A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset aliases. Alias Description Alias Description B business day frequency BQS business quarter start frequency D calendar day frequency A annual(Year) end frequency W weekly frequency BA business year end frequency M month end frequency BAS business year start frequency SM semi-month end frequency BH business hour frequency BM business month end frequency H hourly frequency MS month start frequency T, min minutely frequency SMS SMS semi month start frequency S secondly frequency BMS business month start frequency L, ms milliseconds Q quarter end frequency U, us microseconds BQ business quarter end frequency N nanoseconds QS quarter start frequency Print Page Previous Next Advertisements ”;

Aug 09

Python Pandas – Timedelta

Python Pandas – Timedelta ”; Previous Next Timedeltas are differences in times, expressed in difference units, for example, days, hours, minutes, seconds. They can be both positive and negative. We can create Timedelta objects using various arguments as shown below − String By passing a string literal, we can create a timedelta object. Live Demo import pandas as pd print pd.Timedelta(”2 days 2 hours 15 minutes 30 seconds”) Its output is as follows − 2 days 02:15:30 Integer By passing an integer value with the unit, an argument creates a Timedelta object. Live Demo import pandas as pd print pd.Timedelta(6,unit=”h”) Its output is as follows − 0 days 06:00:00 Data Offsets Data offsets such as – weeks, days, hours, minutes, seconds, milliseconds, microseconds, nanoseconds can also be used in construction. Live Demo import pandas as pd print pd.Timedelta(days=2) Its output is as follows − 2 days 00:00:00 to_timedelta() Using the top-level pd.to_timedelta, you can convert a scalar, array, list, or series from a recognized timedelta format/ value into a Timedelta type. It will construct Series if the input is a Series, a scalar if the input is scalar-like, otherwise will output a TimedeltaIndex. Live Demo import pandas as pd print pd.Timedelta(days=2) Its output is as follows − 2 days 00:00:00 Operations You can operate on Series/ DataFrames and construct timedelta64[ns] Series through subtraction operations on datetime64[ns] Series, or Timestamps. Let us now create a DataFrame with Timedelta and datetime objects and perform some arithmetic operations on it − Live Demo import pandas as pd s = pd.Series(pd.date_range(”2012-1-1”, periods=3, freq=”D”)) td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ]) df = pd.DataFrame(dict(A = s, B = td)) print df Its output is as follows − A B 0 2012-01-01 0 days 1 2012-01-02 1 days 2 2012-01-03 2 days Addition Operations Live Demo import pandas as pd s = pd.Series(pd.date_range(”2012-1-1”, periods=3, freq=”D”)) td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ]) df = pd.DataFrame(dict(A = s, B = td)) df[”C”]=df[”A”]+df[”B”] print df Its output is as follows − A B C 0 2012-01-01 0 days 2012-01-01 1 2012-01-02 1 days 2012-01-03 2 2012-01-03 2 days 2012-01-05 Subtraction Operation Live Demo import pandas as pd s = pd.Series(pd.date_range(”2012-1-1”, periods=3, freq=”D”)) td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ]) df = pd.DataFrame(dict(A = s, B = td)) df[”C”]=df[”A”]+df[”B”] df[”D”]=df[”C”]+df[”B”] print df Its output is as follows − A B C D 0 2012-01-01 0 days 2012-01-01 2012-01-01 1 2012-01-02 1 days 2012-01-03 2012-01-04 2 2012-01-03 2 days 2012-01-05 2012-01-07 Print Page Previous Next Advertisements ”;

Aug 09

Statistical Functions

Python Pandas – Statistical Functions ”; Previous Next Statistical methods help in the understanding and analyzing the behavior of data. We will now learn a few statistical functions, which we can apply on Pandas objects. Percent_change Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage. Live Demo import pandas as pd import numpy as np s = pd.Series([1,2,3,4,5,4]) print s.pct_change() df = pd.DataFrame(np.random.randn(5, 2)) print df.pct_change() Its output is as follows − 0 NaN 1 1.000000 2 0.500000 3 0.333333 4 0.250000 5 -0.200000 dtype: float64 0 1 0 NaN NaN 1 -15.151902 0.174730 2 -0.746374 -1.449088 3 -3.582229 -3.165836 4 15.601150 -1.860434 By default, the pct_change() operates on columns; if you want to apply the same row wise, then use axis=1() argument. Covariance Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically. Cov Series Live Demo import pandas as pd import numpy as np s1 = pd.Series(np.random.randn(10)) s2 = pd.Series(np.random.randn(10)) print s1.cov(s2) Its output is as follows − -0.12978405324 Covariance method when applied on a DataFrame, computes cov between all the columns. Live Demo import pandas as pd import numpy as np frame = pd.DataFrame(np.random.randn(10, 5), columns=[”a”, ”b”, ”c”, ”d”, ”e”]) print frame[”a”].cov(frame[”b”]) print frame.cov() Its output is as follows − -0.58312921152741437 a b c d e a 1.780628 -0.583129 -0.185575 0.003679 -0.136558 b -0.583129 1.297011 0.136530 -0.523719 0.251064 c -0.185575 0.136530 0.915227 -0.053881 -0.058926 d 0.003679 -0.523719 -0.053881 1.521426 -0.487694 e -0.136558 0.251064 -0.058926 -0.487694 0.960761 Note − Observe the cov between a and b column in the first statement and the same is the value returned by cov on DataFrame. Correlation Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall. Live Demo import pandas as pd import numpy as np frame = pd.DataFrame(np.random.randn(10, 5), columns=[”a”, ”b”, ”c”, ”d”, ”e”]) print frame[”a”].corr(frame[”b”]) print frame.corr() Its output is as follows − -0.383712785514 a b c d e a 1.000000 -0.383713 -0.145368 0.002235 -0.104405 b -0.383713 1.000000 0.125311 -0.372821 0.224908 c -0.145368 0.125311 1.000000 -0.045661 -0.062840 d 0.002235 -0.372821 -0.045661 1.000000 -0.403380 e -0.104405 0.224908 -0.062840 -0.403380 1.000000 If any non-numeric column is present in the DataFrame, it is excluded automatically. Data Ranking Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean rank. Live Demo import pandas as pd import numpy as np s = pd.Series(np.random.np.random.randn(5), index=list(”abcde”)) s[”d”] = s[”b”] # so there”s a tie print s.rank() Its output is as follows − a 1.0 b 3.5 c 2.0 d 3.5 e 5.0 dtype: float64 Rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank. Rank supports different tie-breaking methods, specified with the method parameter − average − average rank of tied group min − lowest rank in the group max − highest rank in the group first − ranks assigned in the order they appear in the array Print Page Previous Next Advertisements ”;

Aug 09

Python Pandas – Missing Data

Python Pandas – Missing Data ”; Previous Next Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid. When and Why Is Data Missed? Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time. Let us now see how we can handle missing values (say NA or NaN) using Pandas. Live Demo # import the pandas library import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df Its output is as follows − one two three a 0.077988 0.476149 0.965836 b NaN NaN NaN c -0.390208 -0.551605 -2.301950 d NaN NaN NaN e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g NaN NaN NaN h 0.085100 0.532791 0.887415 Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number. Check for Missing Values To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects − Example 1 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df[”one”].isnull() Its output is as follows − a False b True c False d True e False f False g True h False Name: one, dtype: bool Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df[”one”].notnull() Its output is as follows − a True b False c True d False e True f True g False h True Name: one, dtype: bool Calculations with Missing Data When summing data, NA will be treated as Zero If the data are all NA, then the result will be NA Example 1 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df[”one”].sum() Its output is as follows − 2.02357685917 Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(index=[0,1,2,3,4,5],columns=[”one”,”two”]) print df[”one”].sum() Its output is as follows − nan Cleaning / Filling Missing Data Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections. Replace NaN with a Scalar Value The following program shows how you can replace “NaN” with “0”. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3), index=[”a”, ”c”, ”e”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”]) print df print (“NaN replaced with ”0”:”) print df.fillna(0) Its output is as follows − one two three a -0.576991 -0.741695 0.553172 b NaN NaN NaN c 0.744328 -1.735166 1.749580 NaN replaced with ”0”: one two three a -0.576991 -0.741695 0.553172 b 0.000000 0.000000 0.000000 c 0.744328 -1.735166 1.749580 Here, we are filling with value zero; instead we can also fill with any other value. Fill NA Forward and Backward Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values. Sr.No Method & Action 1 pad/fill Fill methods Forward 2 bfill/backfill Fill methods Backward Example 1 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.fillna(method=”pad”) Its output is as follows − one two three a 0.077988 0.476149 0.965836 b 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 d -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415 Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.fillna(method=”backfill”) Its output is as follows − one two three a 0.077988 0.476149 0.965836 b -0.390208 -0.551605 -2.301950 c -0.390208 -0.551605 -2.301950 d -2.000303 -0.788201 1.510072 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g 0.085100 0.532791 0.887415 h 0.085100 0.532791 0.887415 Drop Missing Values If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded. Example 1 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.dropna() Its output is as follows − one two three a 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415 Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.dropna(axis=1) Its output is as follows − Empty DataFrame Columns: [ ] Index: [a, b, c, d, e,

Aug 09

Python Pandas – Window Functions

Python Pandas – Window Functions ”; Previous Next For working on numerical data, Pandas provide few variants like rolling, expanding and exponentially moving weights for window statistics. Among these are sum, mean, median, variance, covariance, correlation, etc. We will now learn how each of these can be applied on DataFrame objects. .rolling() Function This function can be applied on a series of data. Specify the window=n argument and apply the appropriate statistical function on top of it. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df.rolling(window=3).mean() Its output is as follows − A B C D 2000-01-01 NaN NaN NaN NaN 2000-01-02 NaN NaN NaN NaN 2000-01-03 0.434553 -0.667940 -1.051718 -0.826452 2000-01-04 0.628267 -0.047040 -0.287467 -0.161110 2000-01-05 0.398233 0.003517 0.099126 -0.405565 2000-01-06 0.641798 0.656184 -0.322728 0.428015 2000-01-07 0.188403 0.010913 -0.708645 0.160932 2000-01-08 0.188043 -0.253039 -0.818125 -0.108485 2000-01-09 0.682819 -0.606846 -0.178411 -0.404127 2000-01-10 0.688583 0.127786 0.513832 -1.067156 Note − Since the window size is 3, for first two elements there are nulls and from third the value will be the average of the n, n-1 and n-2 elements. Thus we can also apply various functions as mentioned above. .expanding() Function This function can be applied on a series of data. Specify the min_periods=n argument and apply the appropriate statistical function on top of it. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df.expanding(min_periods=3).mean() Its output is as follows − A B C D 2000-01-01 NaN NaN NaN NaN 2000-01-02 NaN NaN NaN NaN 2000-01-03 0.434553 -0.667940 -1.051718 -0.826452 2000-01-04 0.743328 -0.198015 -0.852462 -0.262547 2000-01-05 0.614776 -0.205649 -0.583641 -0.303254 2000-01-06 0.538175 -0.005878 -0.687223 -0.199219 2000-01-07 0.505503 -0.108475 -0.790826 -0.081056 2000-01-08 0.454751 -0.223420 -0.671572 -0.230215 2000-01-09 0.586390 -0.206201 -0.517619 -0.267521 2000-01-10 0.560427 -0.037597 -0.399429 -0.376886 .ewm() Function ewm is applied on a series of data. Specify any of the com, span, halflife argument and apply the appropriate statistical function on top of it. It assigns the weights exponentially. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df.ewm(com=0.5).mean() Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 0.865131 -0.453626 -1.137961 0.058747 2000-01-03 -0.132245 -0.807671 -0.308308 -1.491002 2000-01-04 1.084036 0.555444 -0.272119 0.480111 2000-01-05 0.425682 0.025511 0.239162 -0.153290 2000-01-06 0.245094 0.671373 -0.725025 0.163310 2000-01-07 0.288030 -0.259337 -1.183515 0.473191 2000-01-08 0.162317 -0.771884 -0.285564 -0.692001 2000-01-09 1.147156 -0.302900 0.380851 -0.607976 2000-01-10 0.600216 0.885614 0.569808 -1.110113 Window functions are majorly used in finding the trends within the data graphically by smoothing the curve. If there is lot of variation in the everyday data and a lot of data points are available, then taking the samples and plotting is one method and applying the window computations and plotting the graph on the results is another method. By these methods, we can smooth the curve or the trend. Print Page Previous Next Advertisements ”;