Python Pandas – Indexing and Selecting Data ”; Previous Next In this chapter, we will discuss how to slice and dice the date and generally get the subset of pandas object. The Python and NumPy indexing operators “[ ]” and attribute operator “.” provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommend that you take advantage of the optimized pandas data access methods explained in this chapter. Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table − Sr.No Indexing & Description 1 .loc() Label based 2 .iloc() Integer based 3 .ix() Both Label and Integer based .loc() Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position. .loc() has multiple access methods like − A single scalar label A list of labels A slice object A Boolean array loc takes two single/list/range operator separated by ”,”. The first one indicates the row and the second one indicates columns. Example 1 Live Demo #import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) #select all rows for a specific column print df.loc[:,”A”] Its output is as follows − a 0.391548 b -0.070649 c -0.317212 d -2.162406 e 2.202797 f 0.613709 g 1.050559 h 1.122680 Name: A, dtype: float64 Example 2 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) # Select all rows for multiple columns, say list[] print df.loc[:,[”A”,”C”]] Its output is as follows − A C a 0.391548 0.745623 b -0.070649 1.620406 c -0.317212 1.448365 d -2.162406 -0.873557 e 2.202797 0.528067 f 0.613709 0.286414 g 1.050559 0.216526 h 1.122680 -1.621420 Example 3 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) # Select few rows for multiple columns, say list[] print df.loc[[”a”,”b”,”f”,”h”],[”A”,”C”]] Its output is as follows − A C a 0.391548 0.745623 b -0.070649 1.620406 f 0.613709 0.286414 h 1.122680 -1.621420 Example 4 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) # Select range of rows for all columns print df.loc[”a”:”h”] Its output is as follows − A B C D a 0.391548 -0.224297 0.745623 0.054301 b -0.070649 -0.880130 1.620406 1.419743 c -0.317212 -1.929698 1.448365 0.616899 d -2.162406 0.614256 -0.873557 1.093958 e 2.202797 -2.315915 0.528067 0.612482 f 0.613709 -0.157674 0.286414 -0.500517 g 1.050559 -2.272099 0.216526 0.928449 h 1.122680 0.324368 -1.621420 -0.741470 Example 5 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), index = [”a”,”b”,”c”,”d”,”e”,”f”,”g”,”h”], columns = [”A”, ”B”, ”C”, ”D”]) # for getting values with a boolean array print df.loc[”a”]>0 Its output is as follows − A False B True C False D False Name: a, dtype: bool .iloc() Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing. The various access methods are as follows − An Integer A list of integers A range of values Example 1 Live Demo # import the pandas library and aliasing as pd import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # select all rows for a specific column print df.iloc[:4] Its output is as follows − A B C D 0 0.699435 0.256239 -1.270702 -0.645195 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 3 0.539042 -1.284314 0.826977 -0.026251 Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # Integer slicing print df.iloc[:4] print df.iloc[1:5, 2:4] Its output is as follows − A B C D 0 0.699435 0.256239 -1.270702 -0.645195 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 3 0.539042 -1.284314 0.826977 -0.026251 C D 1 -0.813012 0.631615 2 0.025070 0.230806 3 0.826977 -0.026251 4 1.423332 1.130568 Example 3 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # Slicing through list of values print df.iloc[[1, 3, 5], [1, 3]] print df.iloc[1:3, :] print df.iloc[:,1:3] Its output is as follows − B D 1 0.890791 0.631615 3 -1.284314 -0.026251 5 -0.512888 -0.518930 A B C D 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 B C 0 0.256239 -1.270702 1 0.890791 -0.813012 2 -0.531378 0.025070 3 -1.284314 0.826977 4 -0.460729 1.423332 5 -0.512888 0.581409 6 -1.204853 0.098060 7 -0.947857 0.641358 .ix() Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator. Example 1 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # Integer slicing print df.ix[:4] Its output is as follows − A B C D 0 0.699435 0.256239 -1.270702 -0.645195 1 -0.685354 0.890791 -0.813012 0.631615 2 -0.783192 -0.531378 0.025070 0.230806 3 0.539042 -1.284314 0.826977 -0.026251 Example 2 Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8, 4), columns = [”A”, ”B”, ”C”, ”D”]) # Index slicing print df.ix[:,”A”] Its output is as follows − 0 0.699435 1 -0.685354 2 -0.783192 3 0.539042 4 -1.044209 5 -1.415411 6 1.062095 7 0.994204 Name: A, dtype: float64 Use of Notations Getting values
Category: python Pandas
Python Pandas – Merging/Joining ”; Previous Next Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects − pd.merge(left, right, how=”inner”, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True) Here, we have used the following parameters − left − A DataFrame object. right − Another DataFrame object. on − Columns (names) to join on. Must be found in both the left and right DataFrame objects. left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame. right_index − Same usage as left_index for the right DataFrame. how − One of ”left”, ”right”, ”outer”, ”inner”. Defaults to inner. Each method has been described below. sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases. Let us now create two different DataFrames and perform the merging operations on it. Live Demo # import the pandas library import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame( {”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print left print right Its output is as follows − Name id subject_id 0 Alex 1 sub1 1 Amy 2 sub2 2 Allen 3 sub4 3 Alice 4 sub6 4 Ayoung 5 sub5 Name id subject_id 0 Billy 1 sub2 1 Brian 2 sub4 2 Bran 3 sub3 3 Bryce 4 sub6 4 Betty 5 sub5 Merge Two DataFrames on a Key Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left,right,on=”id”) Its output is as follows − Name_x id subject_id_x Name_y subject_id_y 0 Alex 1 sub1 Billy sub2 1 Amy 2 sub2 Brian sub4 2 Allen 3 sub4 Bran sub3 3 Alice 4 sub6 Bryce sub6 4 Ayoung 5 sub5 Betty sub5 Merge Two DataFrames on Multiple Keys Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left,right,on=[”id”,”subject_id”]) Its output is as follows − Name_x id subject_id Name_y 0 Alice 4 sub6 Bryce 1 Ayoung 5 sub5 Betty Merge Using ”how” Argument The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or the right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names − Merge Method SQL Equivalent Description left LEFT OUTER JOIN Use keys from left object right RIGHT OUTER JOIN Use keys from right object outer FULL OUTER JOIN Use union of keys inner INNER JOIN Use intersection of keys Left Join Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left, right, on=”subject_id”, how=”left”) Its output is as follows − Name_x id_x subject_id Name_y id_y 0 Alex 1 sub1 NaN NaN 1 Amy 2 sub2 Billy 1.0 2 Allen 3 sub4 Brian 2.0 3 Alice 4 sub6 Bryce 4.0 4 Ayoung 5 sub5 Betty 5.0 Right Join Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left, right, on=”subject_id”, how=”right”) Its output is as follows − Name_x id_x subject_id Name_y id_y 0 Amy 2.0 sub2 Billy 1 1 Allen 3.0 sub4 Brian 2 2 Alice 4.0 sub6 Bryce 4 3 Ayoung 5.0 sub5 Betty 5 4 NaN NaN sub3 Bran 3 Outer Join Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left, right, how=”outer”, on=”subject_id”) Its output is as follows − Name_x id_x subject_id Name_y id_y 0 Alex 1.0 sub1 NaN NaN 1 Amy 2.0 sub2 Billy 1.0 2 Allen 3.0 sub4 Brian 2.0 3 Alice 4.0 sub6 Bryce 4.0 4 Ayoung 5.0 sub5 Betty 5.0 5 NaN NaN sub3 Bran 3.0 Inner Join Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a). Live Demo import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print pd.merge(left, right, on=”subject_id”, how=”inner”) Its output is as follows − Name_x id_x subject_id Name_y id_y 0 Amy 2 sub2 Billy 1 1 Allen 3 sub4 Brian 2 2 Alice 4 sub6 Bryce 4 3 Ayoung 5 sub5 Betty 5 Print Page Previous Next Advertisements ”;
Python Pandas – Iteration
Python Pandas – Iteration ”; Previous Next The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects. In short, basic iteration (for i in object) produces − Series − values DataFrame − column labels Panel − item labels Iterating a DataFrame Iterating a DataFrame gives column names. Let us consider the following example to understand the same. Live Demo import pandas as pd import numpy as np N=20 df = pd.DataFrame({ ”A”: pd.date_range(start=”2016-01-01”,periods=N,freq=”D”), ”x”: np.linspace(0,stop=N-1,num=N), ”y”: np.random.rand(N), ”C”: np.random.choice([”Low”,”Medium”,”High”],N).tolist(), ”D”: np.random.normal(100, 10, size=(N)).tolist() }) for col in df: print col Its output is as follows − A C D x y To iterate over the rows of the DataFrame, we can use the following functions − iteritems() − to iterate over the (key,value) pairs iterrows() − iterate over the rows as (index,series) pairs itertuples() − iterate over the rows as namedtuples iteritems() Iterates over each column as key, value pair with label as key and column value as a Series object. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3),columns=[”col1”,”col2”,”col3”]) for key,value in df.iteritems(): print key,value Its output is as follows − col1 0 0.802390 1 0.324060 2 0.256811 3 0.839186 Name: col1, dtype: float64 col2 0 1.624313 1 -1.033582 2 1.796663 3 1.856277 Name: col2, dtype: float64 col3 0 -0.022142 1 -0.230820 2 1.160691 3 -0.830279 Name: col3, dtype: float64 Observe, each column is iterated separately as a key-value pair in a Series. iterrows() iterrows() returns the iterator yielding each index value along with a series containing the data in each row. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3),columns = [”col1”,”col2”,”col3”]) for row_index,row in df.iterrows(): print row_index,row Its output is as follows − 0 col1 1.529759 col2 0.762811 col3 -0.634691 Name: 0, dtype: float64 1 col1 -0.944087 col2 1.420919 col3 -0.507895 Name: 1, dtype: float64 2 col1 -0.077287 col2 -0.858556 col3 -0.663385 Name: 2, dtype: float64 3 col1 -1.638578 col2 0.059866 col3 0.493482 Name: 3, dtype: float64 Note − Because iterrows() iterate over the rows, it doesn”t preserve the data type across the row. 0,1,2 are the row indices and col1,col2,col3 are column indices. itertuples() itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3),columns = [”col1”,”col2”,”col3”]) for row in df.itertuples(): print row Its output is as follows − Pandas(Index=0, col1=1.5297586201375899, col2=0.76281127433814944, col3=- 0.6346908238310438) Pandas(Index=1, col1=-0.94408735763808649, col2=1.4209186418359423, col3=- 0.50789517967096232) Pandas(Index=2, col1=-0.07728664756791935, col2=-0.85855574139699076, col3=- 0.6633852507207626) Pandas(Index=3, col1=0.65734942534106289, col2=-0.95057710432604969, col3=0.80344487462316527) Note − Do not try to modify any object while iterating. Iterating is meant for reading and the iterator returns a copy of the original object (a view), thus the changes will not reflect on the original object. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3),columns = [”col1”,”col2”,”col3”]) for index, row in df.iterrows(): row[”a”] = 10 print df Its output is as follows − col1 col2 col3 0 -1.739815 0.735595 -0.295589 1 0.635485 0.106803 1.527922 2 -0.939064 0.547095 0.038585 3 -1.016509 -0.116580 -0.523158 Observe, no changes reflected. Print Page Previous Next Advertisements ”;
Working with Text Data
Python Pandas – Working with Text Data ”; Previous Next In this chapter, we will discuss the string operations with our basic Series/Index. In the subsequent chapters, we will learn how to apply these string functions on the DataFrame. Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values. Almost, all of these methods work with Python string functions (refer: https://docs.python.org/3/library/stdtypes.html#string-methods). So, convert the Series Object to String Object and then perform the operation. Let us now see how each operation performs. Sr.No Function & Description 1 lower() Converts strings in the Series/Index to lower case. 2 upper() Converts strings in the Series/Index to upper case. 3 len() Computes String length(). 4 strip() Helps strip whitespace(including newline) from each string in the Series/index from both the sides. 5 split(” ”) Splits each string with the given pattern. 6 cat(sep=” ”) Concatenates the series/index elements with given separator. 7 get_dummies() Returns the DataFrame with One-Hot Encoded values. 8 contains(pattern) Returns a Boolean value True for each element if the substring contains in the element, else False. 9 replace(a,b) Replaces the value a with the value b. 10 repeat(value) Repeats each element with specified number of times. 11 count(pattern) Returns count of appearance of pattern in each element. 12 startswith(pattern) Returns true if the element in the Series/Index starts with the pattern. 13 endswith(pattern) Returns true if the element in the Series/Index ends with the pattern. 14 find(pattern) Returns the first position of the first occurrence of the pattern. 15 findall(pattern) Returns a list of all occurrence of the pattern. 16 swapcase Swaps the case lower/upper. 17 islower() Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean 18 isupper() Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean. 19 isnumeric() Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean. Let us now create a Series and see how all the above functions work. Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom”, ”William Rick”, ”John”, ”Alber@t”, np.nan, ”1234”,”SteveSmith”]) print s Its output is as follows − 0 Tom 1 William Rick 2 John 3 Alber@t 4 NaN 5 1234 6 Steve Smith dtype: object lower() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom”, ”William Rick”, ”John”, ”Alber@t”, np.nan, ”1234”,”SteveSmith”]) print s.str.lower() Its output is as follows − 0 tom 1 william rick 2 john 3 alber@t 4 NaN 5 1234 6 steve smith dtype: object upper() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom”, ”William Rick”, ”John”, ”Alber@t”, np.nan, ”1234”,”SteveSmith”]) print s.str.upper() Its output is as follows − 0 TOM 1 WILLIAM RICK 2 JOHN 3 ALBER@T 4 NaN 5 1234 6 STEVE SMITH dtype: object len() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom”, ”William Rick”, ”John”, ”Alber@t”, np.nan, ”1234”,”SteveSmith”]) print s.str.len() Its output is as follows − 0 3.0 1 12.0 2 4.0 3 7.0 4 NaN 5 4.0 6 10.0 dtype: float64 strip() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s print (“After Stripping:”) print s.str.strip() Its output is as follows − 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object After Stripping: 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object split(pattern) Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s print (“Split Pattern:”) print s.str.split(” ”) Its output is as follows − 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object Split Pattern: 0 [Tom, , , , , , , , , , ] 1 [, , , , , William, Rick] 2 [John] 3 [Alber@t] dtype: object cat(sep=pattern) Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s.str.cat(sep=”_”) Its output is as follows − Tom _ William Rick_John_Alber@t get_dummies() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s.str.get_dummies() Its output is as follows − William Rick Alber@t John Tom 0 0 0 0 1 1 1 0 0 0 2 0 0 1 0 3 0 1 0 0 contains () Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s.str.contains(” ”) Its output is as follows − 0 True 1 True 2 False 3 False dtype: bool replace(a,b) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s print (“After replacing @ with $:”) print s.str.replace(”@”,”$”) Its output is as follows − 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object After replacing @ with $: 0 Tom 1 William Rick 2 John 3 Alber$t dtype: object repeat(value) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s.str.repeat(2) Its output is as follows − 0 Tom Tom 1 William Rick William Rick 2 JohnJohn 3 Alber@tAlber@t dtype: object count(pattern) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print (“The number of ”m”s in each string:”) print s.str.count(”m”) Its output is as follows − The number of ”m”s in each string: 0 1 1 1 2 0 3 0 startswith(pattern) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print (“Strings that start with ”T”:”) print s.str. startswith (”T”) Its output is as follows − 0 True 1 False 2 False 3 False dtype: bool endswith(pattern) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print (“Strings that end with ”t”:”) print s.str.endswith(”t”) Its output is as follows − Strings
Python Pandas – Panel
Python Pandas – Panel ”; Previous Next A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are − items − axis 0, each item corresponds to a DataFrame contained inside. major_axis − axis 1, it is the index (rows) of each of the DataFrames. minor_axis − axis 2, it is the columns of each of the DataFrames. pandas.Panel() A Panel can be created using the following constructor − pandas.Panel(data, items, major_axis, minor_axis, dtype, copy) The parameters of the constructor are as follows − Parameter Description data Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame items axis=0 major_axis axis=1 minor_axis axis=2 dtype Data type of each column copy Copy data. Default, false Create Panel A Panel can be created using multiple ways like − From ndarrays From dict of DataFrames From 3D ndarray Live Demo # creating an empty panel import pandas as pd import numpy as np data = np.random.rand(2,4,5) p = pd.Panel(data) print p Its output is as follows − <class ”pandas.core.panel.Panel”> Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis) Items axis: 0 to 1 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 4 Note − Observe the dimensions of the empty panel and the above panel, all the objects are different. From dict of DataFrame Objects Live Demo #creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p Its output is as follows − Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 2 Create an Empty Panel An empty panel can be created using the Panel constructor as follows − Live Demo #creating an empty panel import pandas as pd p = pd.Panel() print p Its output is as follows − <class ”pandas.core.panel.Panel”> Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis) Items axis: None Major_axis axis: None Minor_axis axis: None Selecting the Data from Panel Select the data from the panel using − Items Major_axis Minor_axis Using Items Live Demo # creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p[”Item1”] Its output is as follows − 0 1 2 0 0.488224 -0.128637 0.930817 1 0.417497 0.896681 0.576657 2 -2.775266 0.571668 0.290082 3 -0.400538 -0.144234 1.110535 We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions. Using major_axis Data can be accessed using the method panel.major_axis(index). Live Demo # creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p.major_xs(1) Its output is as follows − Item1 Item2 0 0.417497 0.748412 1 0.896681 -0.557322 2 0.576657 NaN Using minor_axis Data can be accessed using the method panel.minor_axis(index). Live Demo # creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p.minor_xs(1) Its output is as follows − Item1 Item2 0 -0.128637 -1.047032 1 0.896681 -0.557322 2 0.571668 0.431953 3 -0.144234 1.302466 Note − Observe the changes in the dimensions. Print Page Previous Next Advertisements ”;
Python Pandas – Caveats & Gotchas ”; Previous Next Caveats means warning and gotcha means an unseen problem. Using If/Truth Statement with Pandas Pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in an if or when using the Boolean operations, and, or, or not. It is not clear what the result should be. Should it be True because it is not zerolength? False because there are False values? It is unclear, so instead, Pandas raises a ValueError − Live Demo import pandas as pd if pd.Series([False, True, False]): print ”I am True” Its output is as follows − ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool() a.item(),a.any() or a.all(). In if condition, it is unclear what to do with it. The error is suggestive of whether to use a None or any of those. Live Demo import pandas as pd if pd.Series([False, True, False]).any(): print(“I am any”) Its output is as follows − I am any To evaluate single-element pandas objects in a Boolean context, use the method .bool() − Live Demo import pandas as pd print pd.Series([True]).bool() Its output is as follows − True Bitwise Boolean Bitwise Boolean operators like == and != will return a Boolean series, which is almost always what is required anyways. Live Demo import pandas as pd s = pd.Series(range(5)) print s==4 Its output is as follows − 0 False 1 False 2 False 3 False 4 True dtype: bool isin Operation This returns a Boolean series showing whether each element in the Series is exactly contained in the passed sequence of values. Live Demo import pandas as pd s = pd.Series(list(”abc”)) s = s.isin([”a”, ”c”, ”e”]) print s Its output is as follows − 0 True 1 False 2 True dtype: bool Reindexing vs ix Gotcha Many users will find themselves using the ix indexing capabilities as a concise means of selecting data from a Pandas object − Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(6, 4), columns=[”one”, ”two”, ”three”, ”four”],index=list(”abcdef”)) print df print df.ix[[”b”, ”c”, ”e”]] Its output is as follows − one two three four a -1.582025 1.335773 0.961417 -1.272084 b 1.461512 0.111372 -0.072225 0.553058 c -1.240671 0.762185 1.511936 -0.630920 d -2.380648 -0.029981 0.196489 0.531714 e 1.846746 0.148149 0.275398 -0.244559 f -1.842662 -0.933195 2.303949 0.677641 one two three four b 1.461512 0.111372 -0.072225 0.553058 c -1.240671 0.762185 1.511936 -0.630920 e 1.846746 0.148149 0.275398 -0.244559 This is, of course, completely equivalent in this case to using the reindex method − Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(6, 4), columns=[”one”, ”two”, ”three”, ”four”],index=list(”abcdef”)) print df print df.reindex([”b”, ”c”, ”e”]) Its output is as follows − one two three four a 1.639081 1.369838 0.261287 -1.662003 b -0.173359 0.242447 -0.494384 0.346882 c -0.106411 0.623568 0.282401 -0.916361 d -1.078791 -0.612607 -0.897289 -1.146893 e 0.465215 1.552873 -1.841959 0.329404 f 0.966022 -0.190077 1.324247 0.678064 one two three four b -0.173359 0.242447 -0.494384 0.346882 c -0.106411 0.623568 0.282401 -0.916361 e 0.465215 1.552873 -1.841959 0.329404 Some might conclude that ix and reindex are 100% equivalent based on this. This is true except in the case of integer indexing. For example, the above operation can alternatively be expressed as − Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(6, 4), columns=[”one”, ”two”, ”three”, ”four”],index=list(”abcdef”)) print df print df.ix[[1, 2, 4]] print df.reindex([1, 2, 4]) Its output is as follows − one two three four a -1.015695 -0.553847 1.106235 -0.784460 b -0.527398 -0.518198 -0.710546 -0.512036 c -0.842803 -1.050374 0.787146 0.205147 d -1.238016 -0.749554 -0.547470 -0.029045 e -0.056788 1.063999 -0.767220 0.212476 f 1.139714 0.036159 0.201912 0.710119 one two three four b -0.527398 -0.518198 -0.710546 -0.512036 c -0.842803 -1.050374 0.787146 0.205147 e -0.056788 1.063999 -0.767220 0.212476 one two three four 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 4 NaN NaN NaN NaN It is important to remember that reindex is strict label indexing only. This can lead to some potentially surprising results in pathological cases where an index contains, say, both integers and strings. Print Page Previous Next Advertisements ”;
Introduction to Data Structures ”; Previous Next Pandas deals with the following three data structures − Series DataFrame Panel These data structures are built on top of Numpy array, which means they are fast. Dimension & Description The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame. Data Structure Dimensions Description Series 1 1D labeled homogeneous array, sizeimmutable. Data Frames 2 General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns. Panel 3 General 3D labeled, size-mutable array. Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced. For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1. Mutability All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Note − DataFrame is widely used and one of the most important data structures. Panel is used much less. Series Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, … 10 23 56 17 52 61 73 90 26 72 Key Points Homogeneous data Size Immutable Values of Data Mutable DataFrame DataFrame is a two-dimensional array with heterogeneous data. For example, Name Age Gender Rating Steve 32 Male 3.45 Lia 28 Female 4.6 Vin 45 Male 3.9 Katie 38 Female 2.78 The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. Data Type of Columns The data types of the four columns are as follows − Column Type Name String Age Integer Gender String Rating Float Key Points Heterogeneous data Size Mutable Data Mutable Panel Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. Key Points Heterogeneous data Size Mutable Data Mutable Print Page Previous Next Advertisements ”;
Python Pandas – Environment Setup ”; Previous Next Standard Python distribution doesn”t come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip. pip install pandas If you install Anaconda Python package, Pandas will be installed by default with the following − Windows Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac. Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac. Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/) Linux Package managers of respective Linux distributions are used to install one or more packages in SciPy stack. For Ubuntu Users sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook python-pandas python-sympy python-nose For Fedora Users sudo yum install numpyscipy python-matplotlibipython python-pandas sympy python-nose atlas-devel Print Page Previous Next Advertisements ”;
Descriptive Statistics
Python Pandas – Descriptive Statistics ”; Previous Next A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them, like sumsum(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the axis can be specified by name or integer DataFrame − “index” (axis=0, default), “columns” (axis=1) Let us create a DataFrame and use this object throughout this chapter for all the operations. Example Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”David”,”Gasper”,”Betina”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } #Create a DataFrame df = pd.DataFrame(d) print df Its output is as follows − Age Name Rating 0 25 Tom 4.23 1 26 James 3.24 2 25 Ricky 3.98 3 23 Vin 2.56 4 30 Steve 3.20 5 29 Smith 4.60 6 23 Jack 3.80 7 34 Lee 3.78 8 40 David 2.98 9 30 Gasper 4.80 10 51 Betina 4.10 11 46 Andres 3.65 sum() Returns the sum of the values for the requested axis. By default, axis is index (axis=0). Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”David”,”Gasper”,”Betina”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } #Create a DataFrame df = pd.DataFrame(d) print df.sum() Its output is as follows − Age 382 Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe… Rating 44.92 dtype: object Each individual column is added individually (Strings are appended). axis=1 This syntax will give the output as shown below. Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”David”,”Gasper”,”Betina”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } #Create a DataFrame df = pd.DataFrame(d) print df.sum(1) Its output is as follows − 0 29.23 1 29.24 2 28.98 3 25.56 4 33.20 5 33.60 6 26.80 7 37.78 8 42.98 9 34.80 10 55.10 11 49.65 dtype: float64 mean() Returns the average value Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”David”,”Gasper”,”Betina”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } #Create a DataFrame df = pd.DataFrame(d) print df.mean() Its output is as follows − Age 31.833333 Rating 3.743333 dtype: float64 std() Returns the Bressel standard deviation of the numerical columns. Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”David”,”Gasper”,”Betina”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } #Create a DataFrame df = pd.DataFrame(d) print df.std() Its output is as follows − Age 9.232682 Rating 0.661628 dtype: float64 Functions & Description Let us now understand the functions under Descriptive Statistics in Python Pandas. The following table list down the important functions − Sr.No. Function Description 1 count() Number of non-null observations 2 sum() Sum of values 3 mean() Mean of Values 4 median() Median of Values 5 mode() Mode of values 6 std() Standard Deviation of the Values 7 min() Minimum Value 8 max() Maximum Value 9 abs() Absolute Value 10 prod() Product of Values 11 cumsum() Cumulative Sum 12 cumprod() Cumulative Product Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work with all functions. Functions like sum(), cumsum() work with both numeric and character (or) string data elements without any error. Though n practice, character aggregations are never used generally, these functions do not throw any exception. Functions like abs(), cumprod() throw exception when the DataFrame contains character or string data because such operations cannot be performed. Summarizing Data The describe() function computes a summary of statistics pertaining to the DataFrame columns. Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”David”,”Gasper”,”Betina”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } #Create a DataFrame df = pd.DataFrame(d) print df.describe() Its output is as follows − Age Rating count 12.000000 12.000000 mean 31.833333 3.743333 std 9.232682 0.661628 min 23.000000 2.560000 25% 25.000000 3.230000 50% 29.500000 3.790000 75% 35.500000 4.132500 max 51.000000 4.800000 This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns. ”include” is the argument which is used to pass necessary information regarding what columns need to be considered for summarizing. Takes the list of values; by default, ”number”. object − Summarizes String columns number − Summarizes Numeric columns all − Summarizes all columns together (Should not pass it as a list value) Now, use the following statement in the program and check the output − Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”David”,”Gasper”,”Betina”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } #Create a DataFrame df = pd.DataFrame(d) print df.describe(include=[”object”]) Its output is as follows − Name count 12 unique 12 top Ricky freq 1 Now, use the following statement and check the output − Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”David”,”Gasper”,”Betina”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } #Create a DataFrame df = pd.DataFrame(d) print df. describe(include=”all”) Its output is as follows − Age Name Rating count 12.000000 12 12.000000 unique NaN 12 NaN top NaN Ricky NaN freq NaN 1 NaN mean 31.833333 NaN 3.743333 std 9.232682 NaN 0.661628 min 23.000000 NaN 2.560000 25% 25.000000 NaN 3.230000 50% 29.500000 NaN 3.790000 75% 35.500000 NaN 4.132500 max 51.000000 NaN 4.800000 Print Page Previous Next Advertisements ”;
Python Pandas – Introduction
Python Pandas – Introduction ”; Previous Next Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data. In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data. Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc. Key Features of Pandas Fast and efficient DataFrame object with default and customized indexing. Tools for loading data into in-memory data objects from different file formats. Data alignment and integrated handling of missing data. Reshaping and pivoting of date sets. Label-based slicing, indexing and subsetting of large data sets. Columns from a data structure can be deleted or inserted. Group by data for aggregation and transformations. High performance merging and joining of data. Time Series functionality. Print Page Previous Next Advertisements ”;