Function Application

Python Pandas – Function Application ”; Previous Next To apply your own or another library’s functions to Pandas objects, you should be aware of the three important methods. The methods have been discussed below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame, row- or column-wise, or element wise. Table wise Function Application: pipe() Row or Column Wise Function Application: apply() Element wise Function Application: applymap() Table-wise Function Application Custom operations can be performed by passing the function and the appropriate number of parameters as pipe arguments. Thus, operation is performed on the whole DataFrame. For example, add a value 2 to all the elements in the DataFrame. Then, adder function The adder function adds two numeric values as parameters and returns the sum. def adder(ele1,ele2): return ele1+ele2 We will now use the custom function to conduct operation on the DataFrame. df = pd.DataFrame(np.random.randn(5,3),columns=[”col1”,”col2”,”col3”]) df.pipe(adder,2) Let’s see the full program − Live Demo import pandas as pd import numpy as np def adder(ele1,ele2): return ele1+ele2 df = pd.DataFrame(np.random.randn(5,3),columns=[”col1”,”col2”,”col3”]) df.pipe(adder,2) print df.apply(np.mean) Its output is as follows − col1 col2 col3 0 2.176704 2.219691 1.509360 1 2.222378 2.422167 3.953921 2 2.241096 1.135424 2.696432 3 2.355763 0.376672 1.182570 4 2.308743 2.714767 2.130288 Row or Column Wise Function Application Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument. By default, the operation performs column wise, taking each column as an array-like. Example 1 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3),columns=[”col1”,”col2”,”col3”]) df.apply(np.mean) print df.apply(np.mean) Its output is as follows − col1 -0.288022 col2 1.044839 col3 -0.187009 dtype: float64 By passing axis parameter, operations can be performed row wise. Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3),columns=[”col1”,”col2”,”col3”]) df.apply(np.mean,axis=1) print df.apply(np.mean) Its output is as follows − col1 0.034093 col2 -0.152672 col3 -0.229728 dtype: float64 Example 3 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3),columns=[”col1”,”col2”,”col3”]) df.apply(lambda x: x.max() – x.min()) print df.apply(np.mean) Its output is as follows − col1 -0.167413 col2 -0.370495 col3 -0.707631 dtype: float64 Element Wise Function Application Not all functions can be vectorized (neither the NumPy arrays which return another array nor any value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value. Example 1 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3),columns=[”col1”,”col2”,”col3”]) # My custom function df[”col1”].map(lambda x:x*100) print df.apply(np.mean) Its output is as follows − col1 0.480742 col2 0.454185 col3 0.266563 dtype: float64 Example 2 Live Demo import pandas as pd import numpy as np # My custom function df = pd.DataFrame(np.random.randn(5,3),columns=[”col1”,”col2”,”col3”]) df.applymap(lambda x:x*100) print df.apply(np.mean) Its output is as follows − col1 0.395263 col2 0.204418 col3 -0.795188 dtype: float64 Print Page Previous Next Advertisements ”;

Python Pandas – Sorting

Python Pandas – Sorting ”; Previous Next There are two kinds of sorting available in Pandas. They are − By label By Actual Value Let us consider an example with an output. import pandas as pd import numpy as np unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu mns=[”col2”,”col1”]) print unsorted_df Its output is as follows − col2 col1 1 -2.063177 0.537527 4 0.142932 -0.684884 6 0.012667 -0.389340 2 -0.548797 1.848743 3 -1.044160 0.837381 5 0.385605 1.300185 9 1.031425 -1.002967 8 -0.407374 -0.435142 0 2.237453 -1.067139 7 -1.445831 -1.701035 In unsorted_df, the labels and the values are unsorted. Let us see how these can be sorted. By Label Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order. import pandas as pd import numpy as np unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu mns = [”col2”,”col1”]) sorted_df=unsorted_df.sort_index() print sorted_df Its output is as follows − col2 col1 0 0.208464 0.627037 1 0.641004 0.331352 2 -0.038067 -0.464730 3 -0.638456 -0.021466 4 0.014646 -0.737438 5 -0.290761 -1.669827 6 -0.797303 -0.018737 7 0.525753 1.628921 8 -0.567031 0.775951 9 0.060724 -0.322425 Order of Sorting By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same. import pandas as pd import numpy as np unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu mns = [”col2”,”col1”]) sorted_df = unsorted_df.sort_index(ascending=False) print sorted_df Its output is as follows − col2 col1 9 0.825697 0.374463 8 -1.699509 0.510373 7 -0.581378 0.622958 6 -0.202951 0.954300 5 -1.289321 -1.551250 4 1.302561 0.851385 3 -0.157915 -0.388659 2 -1.222295 0.166609 1 0.584890 -0.291048 0 0.668444 -0.061294 Sort the Columns By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same. import pandas as pd import numpy as np unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu mns = [”col2”,”col1”]) sorted_df=unsorted_df.sort_index(axis=1) print sorted_df Its output is as follows − col1 col2 1 -0.291048 0.584890 4 0.851385 1.302561 6 0.954300 -0.202951 2 0.166609 -1.222295 3 -0.388659 -0.157915 5 -1.551250 -1.289321 9 0.374463 0.825697 8 0.510373 -1.699509 0 -0.061294 0.668444 7 0.622958 -0.581378 By Value Like index sorting, sort_values() is the method for sorting by values. It accepts a ”by” argument which will use the column name of the DataFrame with which the values are to be sorted. import pandas as pd import numpy as np unsorted_df = pd.DataFrame({”col1”:[2,1,1,1],”col2”:[1,3,2,4]}) sorted_df = unsorted_df.sort_values(by=”col1”) print sorted_df Its output is as follows − col1 col2 1 1 3 2 1 2 3 1 4 0 2 1 Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. Thus, they look unsorted. ”by” argument takes a list of column values. import pandas as pd import numpy as np unsorted_df = pd.DataFrame({”col1”:[2,1,1,1],”col2”:[1,3,2,4]}) sorted_df = unsorted_df.sort_values(by=[”col1”,”col2”]) print sorted_df Its output is as follows − col1 col2 2 1 2 1 1 3 3 1 4 0 2 1 Sorting Algorithm sort_values() provides a provision to choose the algorithm from mergesort, heapsort and quicksort. Mergesort is the only stable algorithm. Live Demo import pandas as pd import numpy as np unsorted_df = pd.DataFrame({”col1”:[2,1,1,1],”col2”:[1,3,2,4]}) sorted_df = unsorted_df.sort_values(by=”col1” ,kind=”mergesort”) print sorted_df Its output is as follows − col1 col2 1 1 3 2 1 2 3 1 4 0 2 1 Print Page Previous Next Advertisements ”;

Indexing & Selecting Data

Python Pandas – Indexing and Selecting Data ”; Previous Next In this chapter, we will discuss how to slice and dice the date and generally get the subset of pandas object. The Python and NumPy indexing operators “[ ]” and attribute operator “.” provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommend that you take advantage of the optimized pandas data access methods explained in this chapter. Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table − Sr.No Indexing & Description 1 .loc() Label based 2 .iloc() Integer based 3 .ix() Both Label and Integer based .loc() Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position. .loc() has multiple access methods like − A single scalar label A list of labels A slice object A Boolean array loc takes two single/list/range operator separated by ”,”. The first one indicates the row and the second one indicates columns. Example 1 Live Demo #import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) #select all rows for a specific column print df.loc[:,”A”] Its output is as follows − a 0.391548 b -0.070649 c -0.317212 d -2.162406 e 2.202797 f 0.613709 g 1.050559 h 1.122680 Name: A, dtype: float64 Example 2 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) # Select all rows for multiple columns, say list[] print df.loc[:,[”A”,”C”]] Its output is as follows − A C a 0.391548 0.745623 b -0.070649 1.620406 c -0.317212 1.448365 d -2.162406 -0.873557 e 2.202797 0.528067 f 0.613709 0.286414 g 1.050559 0.216526 h 1.122680 -1.621420 Example 3 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) # Select few rows for multiple columns, say list[] print df.loc[[”a”,”b”,”f”,”h”],[”A”,”C”]] Its output is as follows − A C a 0.391548 0.745623 b -0.070649 1.620406 f 0.613709 0.286414 h 1.122680 -1.621420 Example 4 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) # Select range of rows for all columns print df.loc[”a”:”h”] Its output is as follows − A B C D a 0.391548 -0.224297 0.745623 0.054301 b -0.070649 -0.880130 1.620406 1.419743 c -0.317212 -1.929698 1.448365 0.616899 d -2.162406 0.614256 -0.873557 1.093958 e 2.202797 -2.315915 0.528067 0.612482 f 0.613709 -0.157674 0.286414 -0.500517 g 1.050559 -2.272099 0.216526 0.928449 h 1.122680 0.324368 -1.621420 -0.741470 Example 5 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) # for getting values with a boolean array print df.loc[”a”]>0 Its output is as follows − A False B True C False D False Name: a, dtype: bool .iloc() Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing. The various access methods are as follows − An Integer A list of integers A range of values Example 1 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # select all rows for a specific column print df.iloc[:4] Its output is as follows − A B C D 0 0.699435 0.256239 -1.270702 -0.645195 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 3 0.539042 -1.284314 0.826977 -0.026251 Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # Integer slicing print df.iloc[:4] print df.iloc[1:5, 2:4] Its output is as follows − A B C D 0 0.699435 0.256239 -1.270702 -0.645195 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 3 0.539042 -1.284314 0.826977 -0.026251 C D 1 -0.813012 0.631615 2 0.025070 0.230806 3 0.826977 -0.026251 4 1.423332 1.130568 Example 3 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # Slicing through list of values print df.iloc[[1, 3, 5], [1, 3]] print df.iloc[1:3, :] print df.iloc[:,1:3] Its output is as follows − B D 1 0.890791 0.631615 3 -1.284314 -0.026251 5 -0.512888 -0.518930 A B C D 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 B C 0 0.256239 -1.270702 1 0.890791 -0.813012 2 -0.531378 0.025070 3 -1.284314 0.826977 4 -0.460729 1.423332 5 -0.512888 0.581409 6 -1.204853 0.098060 7 -0.947857 0.641358 .ix() Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator. Example 1 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # Integer slicing print df.ix[:4] Its output is as follows − A B C D 0 0.699435 0.256239 -1.270702 -0.645195 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 3 0.539042 -1.284314 0.826977 -0.026251 Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # Index slicing print df.ix[:,”A”] Its output is as follows − 0 0.699435 1 -0.685354 2 -0.783192 3 0.539042 4 -1.044209 5 -1.415411 6 1.062095 7 0.994204 Name: A, dtype: float64 Use of Notations Getting values

Python Pandas – Merging/Joining

Python Pandas – Merging/Joining ”; Previous Next Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects − pd.merge(left, right, how=”inner”, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True) Here, we have used the following parameters − left − A DataFrame object. right − Another DataFrame object. on − Columns (names) to join on. Must be found in both the left and right DataFrame objects. left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame. right_index − Same usage as left_index for the right DataFrame. how − One of ”left”, ”right”, ”outer”, ”inner”. Defaults to inner. Each method has been described below. sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases. Let us now create two different DataFrames and perform the merging operations on it. Live Demo # import the pandas library import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame( {”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print left print right Its output is as follows − Name id subject_id 0 Alex 1 sub1 1 Amy 2 sub2 2 Allen 3 sub4 3 Alice 4 sub6 4 Ayoung 5 sub5 Name id subject_id 0 Billy 1 sub2 1 Brian 2 sub4 2 Bran 3 sub3 3 Bryce 4 sub6 4 Betty 5 sub5 Merge Two DataFrames on a Key Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left,right,on=”id”) Its output is as follows − Name_x id subject_id_x Name_y subject_id_y 0 Alex 1 sub1 Billy sub2 1 Amy 2 sub2 Brian sub4 2 Allen 3 sub4 Bran sub3 3 Alice 4 sub6 Bryce sub6 4 Ayoung 5 sub5 Betty sub5 Merge Two DataFrames on Multiple Keys Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left,right,on=[”id”,”subject_id”]) Its output is as follows − Name_x id subject_id Name_y 0 Alice 4 sub6 Bryce 1 Ayoung 5 sub5 Betty Merge Using ”how” Argument The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or the right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names − Merge Method SQL Equivalent Description left LEFT OUTER JOIN Use keys from left object right RIGHT OUTER JOIN Use keys from right object outer FULL OUTER JOIN Use union of keys inner INNER JOIN Use intersection of keys Left Join Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left, right, on=”subject_id”, how=”left”) Its output is as follows − Name_x id_x subject_id Name_y id_y 0 Alex 1 sub1 NaN NaN 1 Amy 2 sub2 Billy 1.0 2 Allen 3 sub4 Brian 2.0 3 Alice 4 sub6 Bryce 4.0 4 Ayoung 5 sub5 Betty 5.0 Right Join Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left, right, on=”subject_id”, how=”right”) Its output is as follows − Name_x id_x subject_id Name_y id_y 0 Amy 2.0 sub2 Billy 1 1 Allen 3.0 sub4 Brian 2 2 Alice 4.0 sub6 Bryce 4 3 Ayoung 5.0 sub5 Betty 5 4 NaN NaN sub3 Bran 3 Outer Join Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left, right, how=”outer”, on=”subject_id”) Its output is as follows − Name_x id_x subject_id Name_y id_y 0 Alex 1.0 sub1 NaN NaN 1 Amy 2.0 sub2 Billy 1.0 2 Allen 3.0 sub4 Brian 2.0 3 Alice 4.0 sub6 Bryce 4.0 4 Ayoung 5.0 sub5 Betty 5.0 5 NaN NaN sub3 Bran 3.0 Inner Join Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a). Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left, right, on=”subject_id”, how=”inner”) Its output is as follows − Name_x id_x subject_id Name_y id_y 0 Amy 2 sub2 Billy 1 1 Allen 3 sub4 Brian 2 2 Alice 4 sub6 Bryce 4 3 Ayoung 5 sub5 Betty 5 Print Page Previous Next Advertisements ”;

Python Pandas – Panel

Python Pandas – Panel ”; Previous Next A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are − items − axis 0, each item corresponds to a DataFrame contained inside. major_axis − axis 1, it is the index (rows) of each of the DataFrames. minor_axis − axis 2, it is the columns of each of the DataFrames. pandas.Panel() A Panel can be created using the following constructor − pandas.Panel(data, items, major_axis, minor_axis, dtype, copy) The parameters of the constructor are as follows − Parameter Description data Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame items axis=0 major_axis axis=1 minor_axis axis=2 dtype Data type of each column copy Copy data. Default, false Create Panel A Panel can be created using multiple ways like − From ndarrays From dict of DataFrames From 3D ndarray Live Demo # creating an empty panel import pandas as pd import numpy as np data = np.random.rand(2,4,5) p = pd.Panel(data) print p Its output is as follows − <class ”pandas.core.panel.Panel”> Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis) Items axis: 0 to 1 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 4 Note − Observe the dimensions of the empty panel and the above panel, all the objects are different. From dict of DataFrame Objects Live Demo #creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p Its output is as follows − Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 2 Create an Empty Panel An empty panel can be created using the Panel constructor as follows − Live Demo #creating an empty panel import pandas as pd p = pd.Panel() print p Its output is as follows − <class ”pandas.core.panel.Panel”> Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis) Items axis: None Major_axis axis: None Minor_axis axis: None Selecting the Data from Panel Select the data from the panel using − Items Major_axis Minor_axis Using Items Live Demo # creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p[”Item1”] Its output is as follows − 0 1 2 0 0.488224 -0.128637 0.930817 1 0.417497 0.896681 0.576657 2 -2.775266 0.571668 0.290082 3 -0.400538 -0.144234 1.110535 We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions. Using major_axis Data can be accessed using the method panel.major_axis(index). Live Demo # creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p.major_xs(1) Its output is as follows − Item1 Item2 0 0.417497 0.748412 1 0.896681 -0.557322 2 0.576657 NaN Using minor_axis Data can be accessed using the method panel.minor_axis(index). Live Demo # creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p.minor_xs(1) Its output is as follows − Item1 Item2 0 -0.128637 -1.047032 1 0.896681 -0.557322 2 0.571668 0.431953 3 -0.144234 1.302466 Note − Observe the changes in the dimensions. Print Page Previous Next Advertisements ”;

Python Pandas – Caveats & Gotchas

Python Pandas – Caveats & Gotchas ”; Previous Next Caveats means warning and gotcha means an unseen problem. Using If/Truth Statement with Pandas Pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in an if or when using the Boolean operations, and, or, or not. It is not clear what the result should be. Should it be True because it is not zerolength? False because there are False values? It is unclear, so instead, Pandas raises a ValueError − Live Demo import pandas as pd if pd.Series([False, True, False]): print ”I am True” Its output is as follows − ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool() a.item(),a.any() or a.all(). In if condition, it is unclear what to do with it. The error is suggestive of whether to use a None or any of those. Live Demo import pandas as pd if pd.Series([False, True, False]).any(): print(“I am any”) Its output is as follows − I am any To evaluate single-element pandas objects in a Boolean context, use the method .bool() − Live Demo import pandas as pd print pd.Series([True]).bool() Its output is as follows − True Bitwise Boolean Bitwise Boolean operators like == and != will return a Boolean series, which is almost always what is required anyways. Live Demo import pandas as pd s = pd.Series(range(5)) print s==4 Its output is as follows − 0 False 1 False 2 False 3 False 4 True dtype: bool isin Operation This returns a Boolean series showing whether each element in the Series is exactly contained in the passed sequence of values. Live Demo import pandas as pd s = pd.Series(list(”abc”)) s = s.isin([”a”, ”c”, ”e”]) print s Its output is as follows − 0 True 1 False 2 True dtype: bool Reindexing vs ix Gotcha Many users will find themselves using the ix indexing capabilities as a concise means of selecting data from a Pandas object − Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(6, 4), columns=[”one”, ”two”, ”three”, ”four”],index=list(”abcdef”)) print df print df.ix[[”b”, ”c”, ”e”]] Its output is as follows − one two three four a -1.582025 1.335773 0.961417 -1.272084 b 1.461512 0.111372 -0.072225 0.553058 c -1.240671 0.762185 1.511936 -0.630920 d -2.380648 -0.029981 0.196489 0.531714 e 1.846746 0.148149 0.275398 -0.244559 f -1.842662 -0.933195 2.303949 0.677641 one two three four b 1.461512 0.111372 -0.072225 0.553058 c -1.240671 0.762185 1.511936 -0.630920 e 1.846746 0.148149 0.275398 -0.244559 This is, of course, completely equivalent in this case to using the reindex method − Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(6, 4), columns=[”one”, ”two”, ”three”, ”four”],index=list(”abcdef”)) print df print df.reindex([”b”, ”c”, ”e”]) Its output is as follows − one two three four a 1.639081 1.369838 0.261287 -1.662003 b -0.173359 0.242447 -0.494384 0.346882 c -0.106411 0.623568 0.282401 -0.916361 d -1.078791 -0.612607 -0.897289 -1.146893 e 0.465215 1.552873 -1.841959 0.329404 f 0.966022 -0.190077 1.324247 0.678064 one two three four b -0.173359 0.242447 -0.494384 0.346882 c -0.106411 0.623568 0.282401 -0.916361 e 0.465215 1.552873 -1.841959 0.329404 Some might conclude that ix and reindex are 100% equivalent based on this. This is true except in the case of integer indexing. For example, the above operation can alternatively be expressed as − Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(6, 4), columns=[”one”, ”two”, ”three”, ”four”],index=list(”abcdef”)) print df print df.ix[[1, 2, 4]] print df.reindex([1, 2, 4]) Its output is as follows − one two three four a -1.015695 -0.553847 1.106235 -0.784460 b -0.527398 -0.518198 -0.710546 -0.512036 c -0.842803 -1.050374 0.787146 0.205147 d -1.238016 -0.749554 -0.547470 -0.029045 e -0.056788 1.063999 -0.767220 0.212476 f 1.139714 0.036159 0.201912 0.710119 one two three four b -0.527398 -0.518198 -0.710546 -0.512036 c -0.842803 -1.050374 0.787146 0.205147 e -0.056788 1.063999 -0.767220 0.212476 one two three four 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 4 NaN NaN NaN NaN It is important to remember that reindex is strict label indexing only. This can lead to some potentially surprising results in pathological cases where an index contains, say, both integers and strings. Print Page Previous Next Advertisements ”;

Python Pandas – Series

Python Pandas – Series ”; Previous Next Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. pandas.Series A pandas Series can be created using the following constructor − pandas.Series( data, index, dtype, copy) The parameters of the constructor are as follows − Sr.No Parameter & Description 1 data data takes various forms like ndarray, list, constants 2 index Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed. 3 dtype dtype is for data type. If None, data type will be inferred 4 copy Copy data. Default False A series can be created using various inputs like − Array Dict Scalar value or constant Create an Empty Series A basic series, which can be created is an Empty Series. Example Live Demo #import the pandas library and aliasing as pd import pandas as pd s = pd.Series() print s Its output is as follows − Series([], dtype: float64) Create a Series from ndarray If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1]. Example 1 Live Demo #import the pandas library and aliasing as pd import pandas as pd import numpy as np data = np.array([”a”,”b”,”c”,”d”]) s = pd.Series(data) print s Its output is as follows − 0 a 1 b 2 c 3 d dtype: object We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3. Example 2 Live Demo #import the pandas library and aliasing as pd import pandas as pd import numpy as np data = np.array([”a”,”b”,”c”,”d”]) s = pd.Series(data,index=[100,101,102,103]) print s Its output is as follows − 100 a 101 b 102 c 103 d dtype: object We passed the index values here. Now we can see the customized indexed values in the output. Create a Series from dict A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out. Example 1 Live Demo #import the pandas library and aliasing as pd import pandas as pd import numpy as np data = {”a” : 0., ”b” : 1., ”c” : 2.} s = pd.Series(data) print s Its output is as follows − a 0.0 b 1.0 c 2.0 dtype: float64 Observe − Dictionary keys are used to construct index. Example 2 Live Demo #import the pandas library and aliasing as pd import pandas as pd import numpy as np data = {”a” : 0., ”b” : 1., ”c” : 2.} s = pd.Series(data,index=[”b”,”c”,”d”,”a”]) print s Its output is as follows − b 1.0 c 2.0 d NaN a 0.0 dtype: float64 Observe − Index order is persisted and the missing element is filled with NaN (Not a Number). Create a Series from Scalar If data is a scalar value, an index must be provided. The value will be repeated to match the length of index Live Demo #import the pandas library and aliasing as pd import pandas as pd import numpy as np s = pd.Series(5, index=[0, 1, 2, 3]) print s Its output is as follows − 0 5 1 5 2 5 3 5 dtype: int64 Accessing Data from Series with Position Data in the series can be accessed similar to that in an ndarray. Example 1 Retrieve the first element. As we already know, the counting starts from zero for the array, which means the first element is stored at zeroth position and so on. Live Demo import pandas as pd s = pd.Series([1,2,3,4,5],index = [”a”,”b”,”c”,”d”,”e”]) #retrieve the first element print s[0] Its output is as follows − 1 Example 2 Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index) Live Demo import pandas as pd s = pd.Series([1,2,3,4,5],index = [”a”,”b”,”c”,”d”,”e”]) #retrieve the first three element print s[:3] Its output is as follows − a 1 b 2 c 3 dtype: int64 Example 3 Retrieve the last three elements. Live Demo import pandas as pd s = pd.Series([1,2,3,4,5],index = [”a”,”b”,”c”,”d”,”e”]) #retrieve the last three element print s[-3:] Its output is as follows − c 3 d 4 e 5 dtype: int64 Retrieve Data Using Label (Index) A Series is like a fixed-size dict in that you can get and set values by index label. Example 1 Retrieve a single element using index label value. Live Demo import pandas as pd s = pd.Series([1,2,3,4,5],index = [”a”,”b”,”c”,”d”,”e”]) #retrieve a single element print s[”a”] Its output is as follows − 1 Example 2 Retrieve multiple elements using a list of index label values. Live Demo import pandas as pd s = pd.Series([1,2,3,4,5],index = [”a”,”b”,”c”,”d”,”e”]) #retrieve multiple elements print s[[”a”,”c”,”d”]] Its output is as follows − a 1 c 3 d 4 dtype: int64 Example 3 If a label is not contained, an exception is raised. import pandas as pd s = pd.Series([1,2,3,4,5],index = [”a”,”b”,”c”,”d”,”e”]) #retrieve multiple elements print s[”f”] Its output is as follows − … KeyError: ”f” Print Page Previous Next Advertisements ”;

Python Pandas – DataFrame

Python Pandas – DataFrame ”; Previous Next A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Features of DataFrame Potentially columns are of different types Size – Mutable Labeled axes (rows and columns) Can Perform Arithmetic operations on rows and columns Structure Let us assume that we are creating a data frame with student’s data. You can think of it as an SQL table or a spreadsheet data representation. pandas.DataFrame A pandas DataFrame can be created using the following constructor − pandas.DataFrame( data, index, columns, dtype, copy) The parameters of the constructor are as follows − Sr.No Parameter & Description 1 data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. 2 index For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed. 3 columns For column labels, the optional default syntax is – np.arange(n). This is only true if no index is passed. 4 dtype Data type of each column. 5 copy This command (or whatever it is) is used for copying of data, if the default is False. Create DataFrame A pandas DataFrame can be created using various inputs like − Lists dict Series Numpy ndarrays Another DataFrame In the subsequent sections of this chapter, we will see how to create a DataFrame using these inputs. Create an Empty DataFrame A basic DataFrame, which can be created is an Empty Dataframe. Example Live Demo #import the pandas library and aliasing as pd import pandas as pd df = pd.DataFrame() print df Its output is as follows − Empty DataFrame Columns: [] Index: [] Create a DataFrame from Lists The DataFrame can be created using a single list or a list of lists. Example 1 Live Demo import pandas as pd data = [1,2,3,4,5] df = pd.DataFrame(data) print df Its output is as follows − 0 0 1 1 2 2 3 3 4 4 5 Example 2 Live Demo import pandas as pd data = [[”Alex”,10],[”Bob”,12],[”Clarke”,13]] df = pd.DataFrame(data,columns=[”Name”,”Age”]) print df Its output is as follows − Name Age 0 Alex 10 1 Bob 12 2 Clarke 13 Example 3 Live Demo import pandas as pd data = [[”Alex”,10],[”Bob”,12],[”Clarke”,13]] df = pd.DataFrame(data,columns=[”Name”,”Age”],dtype=float) print df Its output is as follows − Name Age 0 Alex 10.0 1 Bob 12.0 2 Clarke 13.0 Note − Observe, the dtype parameter changes the type of Age column to floating point. Create a DataFrame from Dict of ndarrays / Lists All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays. If no index is passed, then by default, index will be range(n), where n is the array length. Example 1 Live Demo import pandas as pd data = {”Name”:[”Tom”, ”Jack”, ”Steve”, ”Ricky”],”Age”:[28,34,29,42]} df = pd.DataFrame(data) print df Its output is as follows − Age Name 0 28 Tom 1 34 Jack 2 29 Steve 3 42 Ricky Note − Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n). Example 2 Let us now create an indexed DataFrame using arrays. Live Demo import pandas as pd data = {”Name”:[”Tom”, ”Jack”, ”Steve”, ”Ricky”],”Age”:[28,34,29,42]} df = pd.DataFrame(data, index=[”rank1”,”rank2”,”rank3”,”rank4”]) print df Its output is as follows − Age Name rank1 28 Tom rank2 34 Jack rank3 29 Steve rank4 42 Ricky Note − Observe, the index parameter assigns an index to each row. Create a DataFrame from List of Dicts List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names. Example 1 The following example shows how to create a DataFrame by passing a list of dictionaries. Live Demo import pandas as pd data = [{”a”: 1, ”b”: 2},{”a”: 5, ”b”: 10, ”c”: 20}] df = pd.DataFrame(data) print df Its output is as follows − a b c 0 1 2 NaN 1 5 10 20.0 Note − Observe, NaN (Not a Number) is appended in missing areas. Example 2 The following example shows how to create a DataFrame by passing a list of dictionaries and the row indices. Live Demo import pandas as pd data = [{”a”: 1, ”b”: 2},{”a”: 5, ”b”: 10, ”c”: 20}] df = pd.DataFrame(data, index=[”first”, ”second”]) print df Its output is as follows − a b c first 1 2 NaN second 5 10 20.0 Example 3 The following example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices. Live Demo import pandas as pd data = [{”a”: 1, ”b”: 2},{”a”: 5, ”b”: 10, ”c”: 20}] #With two column indices, values same as dictionary keys df1 = pd.DataFrame(data, index=[”first”, ”second”], columns=[”a”, ”b”]) #With two column indices with one index with other name df2 = pd.DataFrame(data, index=[”first”, ”second”], columns=[”a”, ”b1”]) print df1 print df2 Its output is as follows − #df1 output a b first 1 2 second 5 10 #df2 output a b1 first 1 NaN second 5 NaN Note − Observe, df2 DataFrame is created with a column index other than the dictionary key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as dictionary keys, so NaN’s appended. Create a DataFrame from Dict of Series Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed. Example Live Demo import pandas as pd d = {”one” : pd.Series([1, 2, 3], index=[”a”, ”b”, ”c”]), ”two” : pd.Series([1, 2, 3, 4], index=[”a”, ”b”, ”c”, ”d”])} df = pd.DataFrame(d) print df Its output is as follows − one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label, NaN is appended with NaN. Let us now understand

Introduction to Data Structures

Introduction to Data Structures ”; Previous Next Pandas deals with the following three data structures − Series DataFrame Panel These data structures are built on top of Numpy array, which means they are fast. Dimension & Description The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame. Data Structure Dimensions Description Series 1 1D labeled homogeneous array, sizeimmutable. Data Frames 2 General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns. Panel 3 General 3D labeled, size-mutable array. Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced. For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1. Mutability All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Note − DataFrame is widely used and one of the most important data structures. Panel is used much less. Series Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, … 10 23 56 17 52 61 73 90 26 72 Key Points Homogeneous data Size Immutable Values of Data Mutable DataFrame DataFrame is a two-dimensional array with heterogeneous data. For example, Name Age Gender Rating Steve 32 Male 3.45 Lia 28 Female 4.6 Vin 45 Male 3.9 Katie 38 Female 2.78 The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. Data Type of Columns The data types of the four columns are as follows − Column Type Name String Age Integer Gender String Rating Float Key Points Heterogeneous data Size Mutable Data Mutable Panel Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. Key Points Heterogeneous data Size Mutable Data Mutable Print Page Previous Next Advertisements ”;

Python Pandas – Environment Setup

Python Pandas – Environment Setup ”; Previous Next Standard Python distribution doesn”t come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip. pip install pandas If you install Anaconda Python package, Pandas will be installed by default with the following − Windows Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac. Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac. Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/) Linux Package managers of respective Linux distributions are used to install one or more packages in SciPy stack. For Ubuntu Users sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook python-pandas python-sympy python-nose For Fedora Users sudo yum install numpyscipy python-matplotlibipython python-pandas sympy python-nose atlas-devel Print Page Previous Next Advertisements ”;