Python Pandas – Useful Resources ”; Previous Next The following resources contain additional information on Python Pandas. Please use them to get more in-depth knowledge on this topic. Useful Video Courses Python Flask and SQLAlchemy ORM 22 Lectures 1.5 hours Jack Chan More Detail Python and Elixir Programming Bundle Course 81 Lectures 9.5 hours Pranjal Srivastava More Detail TKinter Course – Build Python GUI Apps 49 Lectures 4 hours John Elder More Detail A Beginner”s Guide to Python and Data Science 81 Lectures 8.5 hours Datai Team Academy More Detail Deploy Face Recognition Project With Python, Django, And Machine Learning Best Seller 93 Lectures 6.5 hours Srikanth Guskra More Detail Professional Python Web Development with Flask 80 Lectures 12 hours Stone River ELearning More Detail Print Page Previous Next Advertisements ”;
Category: python Pandas
Python Pandas – Categorical Data ”; Previous Next Often in real-time, data includes the text columns, which are repetitive. Features like gender, country, and codes are always repetitive. These are the examples for categorical data. Categorical variables can take on only a limited, and usually fixed number of possible values. Besides the fixed length, categorical data might have an order but cannot perform numerical operation. Categorical are a Pandas data type. The categorical data type is useful in the following cases − A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory. The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order. As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types). Object Creation Categorical object can be created in multiple ways. The different ways have been described below − category By specifying the dtype as “category” in pandas object creation. Live Demo import pandas as pd s = pd.Series([“a”,”b”,”c”,”a”], dtype=”category”) print s Its output is as follows − 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a, b, c] The number of elements passed to the series object is four, but the categories are only three. Observe the same in the output Categories. pd.Categorical Using the standard pandas Categorical constructor, we can create a category object. pandas.Categorical(values, categories, ordered) Let’s take an example − Live Demo import pandas as pd cat = pd.Categorical([”a”, ”b”, ”c”, ”a”, ”b”, ”c”]) print cat Its output is as follows − [a, b, c, a, b, c] Categories (3, object): [a, b, c] Let’s have another example − Live Demo import pandas as pd cat = cat=pd.Categorical([”a”,”b”,”c”,”a”,”b”,”c”,”d”], [”c”, ”b”, ”a”]) print cat Its output is as follows − [a, b, c, a, b, c, NaN] Categories (3, object): [c, b, a] Here, the second argument signifies the categories. Thus, any value which is not present in the categories will be treated as NaN. Now, take a look at the following example − Live Demo import pandas as pd cat = cat=pd.Categorical([”a”,”b”,”c”,”a”,”b”,”c”,”d”], [”c”, ”b”, ”a”],ordered=True) print cat Its output is as follows − [a, b, c, a, b, c, NaN] Categories (3, object): [c < b < a] Logically, the order means that, a is greater than b and b is greater than c. Description Using the .describe() command on the categorical data, we get similar output to a Series or DataFrame of the type string. Live Demo import pandas as pd import numpy as np cat = pd.Categorical([“a”, “c”, “c”, np.nan], categories=[“b”, “a”, “c”]) df = pd.DataFrame({“cat”:cat, “s”:[“a”, “c”, “c”, np.nan]}) print df.describe() print df[“cat”].describe() Its output is as follows − cat s count 3 3 unique 2 2 top c c freq 2 2 count 3 unique 2 top c freq 2 Name: cat, dtype: object Get the Properties of the Category obj.cat.categories command is used to get the categories of the object. Live Demo import pandas as pd import numpy as np s = pd.Categorical([“a”, “c”, “c”, np.nan], categories=[“b”, “a”, “c”]) print s.categories Its output is as follows − Index([u”b”, u”a”, u”c”], dtype=”object”) obj.ordered command is used to get the order of the object. Live Demo import pandas as pd import numpy as np cat = pd.Categorical([“a”, “c”, “c”, np.nan], categories=[“b”, “a”, “c”]) print cat.ordered Its output is as follows − False The function returned false because we haven”t specified any order. Renaming Categories Renaming categories is done by assigning new values to the series.cat.categoriesseries.cat.categories property. Live Demo import pandas as pd s = pd.Series([“a”,”b”,”c”,”a”], dtype=”category”) s.cat.categories = [“Group %s” % g for g in s.cat.categories] print s.cat.categories Its output is as follows − Index([u”Group a”, u”Group b”, u”Group c”], dtype=”object”) Initial categories [a,b,c] are updated by the s.cat.categories property of the object. Appending New Categories Using the Categorical.add.categories() method, new categories can be appended. Live Demo import pandas as pd s = pd.Series([“a”,”b”,”c”,”a”], dtype=”category”) s = s.cat.add_categories([4]) print s.cat.categories Its output is as follows − Index([u”a”, u”b”, u”c”, 4], dtype=”object”) Removing Categories Using the Categorical.remove_categories() method, unwanted categories can be removed. Live Demo import pandas as pd s = pd.Series([“a”,”b”,”c”,”a”], dtype=”category”) print (“Original object:”) print s print (“After removal:”) print s.cat.remove_categories(“a”) Its output is as follows − Original object: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a, b, c] After removal: 0 NaN 1 b 2 c 3 NaN dtype: category Categories (2, object): [b, c] Comparison of Categorical Data Comparing categorical data with other objects is possible in three cases − comparing equality (== and !=) to a list-like object (list, Series, array, …) of the same length as the categorical data. all comparisons (==, !=, >, >=, <, and <=) of categorical data to another categorical Series, when ordered==True and the categories are the same. all comparisons of a categorical data to a scalar. Take a look at the following example − Live Demo import pandas as pd cat = pd.Series([1,2,3]).astype(“category”, categories=[1,2,3], ordered=True) cat1 = pd.Series([2,2,2]).astype(“category”, categories=[1,2,3], ordered=True) print cat>cat1 Its output is as follows − 0 False 1 False 2 True dtype: bool Print Page Previous Next Advertisements ”;
Python Pandas – Quick Guide
Python Pandas – Quick Guide ”; Previous Next Python Pandas – Introduction Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data. In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data. Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc. Key Features of Pandas Fast and efficient DataFrame object with default and customized indexing. Tools for loading data into in-memory data objects from different file formats. Data alignment and integrated handling of missing data. Reshaping and pivoting of date sets. Label-based slicing, indexing and subsetting of large data sets. Columns from a data structure can be deleted or inserted. Group by data for aggregation and transformations. High performance merging and joining of data. Time Series functionality. Python Pandas – Environment Setup Standard Python distribution doesn”t come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip. pip install pandas If you install Anaconda Python package, Pandas will be installed by default with the following − Windows Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac. Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac. Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/) Linux Package managers of respective Linux distributions are used to install one or more packages in SciPy stack. For Ubuntu Users sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook python-pandas python-sympy python-nose For Fedora Users sudo yum install numpyscipy python-matplotlibipython python-pandas sympy python-nose atlas-devel Introduction to Data Structures Pandas deals with the following three data structures − Series DataFrame Panel These data structures are built on top of Numpy array, which means they are fast. Dimension & Description The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame. Data Structure Dimensions Description Series 1 1D labeled homogeneous array, sizeimmutable. Data Frames 2 General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns. Panel 3 General 3D labeled, size-mutable array. Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced. For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1. Mutability All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Note − DataFrame is widely used and one of the most important data structures. Panel is used much less. Series Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, … 10 23 56 17 52 61 73 90 26 72 Key Points Homogeneous data Size Immutable Values of Data Mutable DataFrame DataFrame is a two-dimensional array with heterogeneous data. For example, Name Age Gender Rating Steve 32 Male 3.45 Lia 28 Female 4.6 Vin 45 Male 3.9 Katie 38 Female 2.78 The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. Data Type of Columns The data types of the four columns are as follows − Column Type Name String Age Integer Gender String Rating Float Key Points Heterogeneous data Size Mutable Data Mutable Panel Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. Key Points Heterogeneous data Size Mutable Data Mutable Python Pandas – Series Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. pandas.Series A pandas Series can be created using the following constructor − pandas.Series( data, index, dtype, copy) The parameters of the constructor are as follows − Sr.No Parameter & Description 1 data data takes various forms like ndarray, list, constants 2 index Index values must be unique and hashable, same length as data. Default np.arange(n) if no index is passed. 3 dtype dtype is for data type. If None, data type will be inferred 4 copy Copy data. Default False A series can be created using various inputs like − Array Dict Scalar value or constant Create an Empty Series A basic series, which can be created is an Empty Series. Example Live Demo #import the pandas library and aliasing as pd import pandas as pd s = pd.Series() print s Its output is as follows − Series([], dtype: float64) Create a Series from ndarray If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1]. Example 1 Live Demo #import the pandas library and aliasing as pd import pandas as pd import numpy as np data = np.array([”a”,”b”,”c”,”d”]) s
Python Pandas – Window Functions ”; Previous Next For working on numerical data, Pandas provide few variants like rolling, expanding and exponentially moving weights for window statistics. Among these are sum, mean, median, variance, covariance, correlation, etc. We will now learn how each of these can be applied on DataFrame objects. .rolling() Function This function can be applied on a series of data. Specify the window=n argument and apply the appropriate statistical function on top of it. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df.rolling(window=3).mean() Its output is as follows − A B C D 2000-01-01 NaN NaN NaN NaN 2000-01-02 NaN NaN NaN NaN 2000-01-03 0.434553 -0.667940 -1.051718 -0.826452 2000-01-04 0.628267 -0.047040 -0.287467 -0.161110 2000-01-05 0.398233 0.003517 0.099126 -0.405565 2000-01-06 0.641798 0.656184 -0.322728 0.428015 2000-01-07 0.188403 0.010913 -0.708645 0.160932 2000-01-08 0.188043 -0.253039 -0.818125 -0.108485 2000-01-09 0.682819 -0.606846 -0.178411 -0.404127 2000-01-10 0.688583 0.127786 0.513832 -1.067156 Note − Since the window size is 3, for first two elements there are nulls and from third the value will be the average of the n, n-1 and n-2 elements. Thus we can also apply various functions as mentioned above. .expanding() Function This function can be applied on a series of data. Specify the min_periods=n argument and apply the appropriate statistical function on top of it. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df.expanding(min_periods=3).mean() Its output is as follows − A B C D 2000-01-01 NaN NaN NaN NaN 2000-01-02 NaN NaN NaN NaN 2000-01-03 0.434553 -0.667940 -1.051718 -0.826452 2000-01-04 0.743328 -0.198015 -0.852462 -0.262547 2000-01-05 0.614776 -0.205649 -0.583641 -0.303254 2000-01-06 0.538175 -0.005878 -0.687223 -0.199219 2000-01-07 0.505503 -0.108475 -0.790826 -0.081056 2000-01-08 0.454751 -0.223420 -0.671572 -0.230215 2000-01-09 0.586390 -0.206201 -0.517619 -0.267521 2000-01-10 0.560427 -0.037597 -0.399429 -0.376886 .ewm() Function ewm is applied on a series of data. Specify any of the com, span, halflife argument and apply the appropriate statistical function on top of it. It assigns the weights exponentially. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df.ewm(com=0.5).mean() Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 0.865131 -0.453626 -1.137961 0.058747 2000-01-03 -0.132245 -0.807671 -0.308308 -1.491002 2000-01-04 1.084036 0.555444 -0.272119 0.480111 2000-01-05 0.425682 0.025511 0.239162 -0.153290 2000-01-06 0.245094 0.671373 -0.725025 0.163310 2000-01-07 0.288030 -0.259337 -1.183515 0.473191 2000-01-08 0.162317 -0.771884 -0.285564 -0.692001 2000-01-09 1.147156 -0.302900 0.380851 -0.607976 2000-01-10 0.600216 0.885614 0.569808 -1.110113 Window functions are majorly used in finding the trends within the data graphically by smoothing the curve. If there is lot of variation in the everyday data and a lot of data points are available, then taking the samples and plotting is one method and applying the window computations and plotting the graph on the results is another method. By these methods, we can smooth the curve or the trend. Print Page Previous Next Advertisements ”;
Python Pandas – IO Tools
Python Pandas – IO Tools ”; Previous Next The Pandas I/O API is a set of top level reader functions accessed like pd.read_csv() that generally return a Pandas object. The two workhorse functions for reading text files (or the flat files) are read_csv() and read_table(). They both use the same parsing code to intelligently convert tabular data into a DataFrame object − pandas.read_csv(filepath_or_buffer, sep=”,”, delimiter=None, header=”infer”, names=None, index_col=None, usecols=None pandas.read_csv(filepath_or_buffer, sep=”t”, delimiter=None, header=”infer”, names=None, index_col=None, usecols=None Here is how the csv file data looks like − S.No,Name,Age,City,Salary 1,Tom,28,Toronto,20000 2,Lee,32,HongKong,3000 3,Steven,43,Bay Area,8300 4,Ram,38,Hyderabad,3900 Save this data as temp.csv and conduct operations on it. S.No,Name,Age,City,Salary 1,Tom,28,Toronto,20000 2,Lee,32,HongKong,3000 3,Steven,43,Bay Area,8300 4,Ram,38,Hyderabad,3900 Save this data as temp.csv and conduct operations on it. read.csv read.csv reads data from the csv files and creates a DataFrame object. import pandas as pd df=pd.read_csv(“temp.csv”) print df Its output is as follows − S.No Name Age City Salary 0 1 Tom 28 Toronto 20000 1 2 Lee 32 HongKong 3000 2 3 Steven 43 Bay Area 8300 3 4 Ram 38 Hyderabad 3900 custom index This specifies a column in the csv file to customize the index using index_col. import pandas as pd df=pd.read_csv(“temp.csv”,index_col=[”S.No”]) print df Its output is as follows − S.No Name Age City Salary 1 Tom 28 Toronto 20000 2 Lee 32 HongKong 3000 3 Steven 43 Bay Area 8300 4 Ram 38 Hyderabad 3900 Converters dtype of the columns can be passed as a dict. import pandas as pd df = pd.read_csv(“temp.csv”, dtype={”Salary”: np.float64}) print df.dtypes Its output is as follows − S.No int64 Name object Age int64 City object Salary float64 dtype: object By default, the dtype of the Salary column is int, but the result shows it as float because we have explicitly casted the type. Thus, the data looks like float − S.No Name Age City Salary 0 1 Tom 28 Toronto 20000.0 1 2 Lee 32 HongKong 3000.0 2 3 Steven 43 Bay Area 8300.0 3 4 Ram 38 Hyderabad 3900.0 header_names Specify the names of the header using the names argument. import pandas as pd df=pd.read_csv(“temp.csv”, names=[”a”, ”b”, ”c”,”d”,”e”]) print df Its output is as follows − a b c d e 0 S.No Name Age City Salary 1 1 Tom 28 Toronto 20000 2 2 Lee 32 HongKong 3000 3 3 Steven 43 Bay Area 8300 4 4 Ram 38 Hyderabad 3900 Observe, the header names are appended with the custom names, but the header in the file has not been eliminated. Now, we use the header argument to remove that. If the header is in a row other than the first, pass the row number to header. This will skip the preceding rows. import pandas as pd df=pd.read_csv(“temp.csv”,names=[”a”,”b”,”c”,”d”,”e”],header=0) print df Its output is as follows − a b c d e 0 S.No Name Age City Salary 1 1 Tom 28 Toronto 20000 2 2 Lee 32 HongKong 3000 3 3 Steven 43 Bay Area 8300 4 4 Ram 38 Hyderabad 3900 skiprows skiprows skips the number of rows specified. import pandas as pd df=pd.read_csv(“temp.csv”, skiprows=2) print df Its output is as follows − 2 Lee 32 HongKong 3000 0 3 Steven 43 Bay Area 8300 1 4 Ram 38 Hyderabad 3900 Print Page Previous Next Advertisements ”;
Python Pandas – Concatenation ”; Previous Next Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects. pd.concat(objs,axis=0,join=”outer”,join_axes=None, ignore_index=False) objs − This is a sequence or mapping of Series, DataFrame, or Panel objects. axis − {0, 1, …}, default 0. This is the axis to concatenate along. join − {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection. ignore_index − boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, …, n – 1. join_axes − This is the list of Index objects. Specific indexes to use for the other (n-1) axes instead of performing inner/outer set logic. Concatenating Objects The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create different objects and do concatenation. Live Demo import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print pd.concat([one,two]) Its output is as follows − Marks_scored Name subject_id 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5 Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame. We can do this by using the keys argument − Live Demo import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print pd.concat([one,two],keys=[”x”,”y”]) Its output is as follows − x 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5 y 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5 The index of the resultant is duplicated; each index is repeated. If the resultant object has to follow its own indexing, set ignore_index to True. Live Demo import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print pd.concat([one,two],keys=[”x”,”y”],ignore_index=True) Its output is as follows − Marks_scored Name subject_id 0 98 Alex sub1 1 90 Amy sub2 2 87 Allen sub4 3 69 Alice sub6 4 78 Ayoung sub5 5 89 Billy sub2 6 80 Brian sub4 7 79 Bran sub3 8 97 Bryce sub6 9 88 Betty sub5 Observe, the index changes completely and the Keys are also overridden. If two objects need to be added along axis=1, then the new columns will be appended. Live Demo import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print pd.concat([one,two],axis=1) Its output is as follows − Marks_scored Name subject_id Marks_scored Name subject_id 1 98 Alex sub1 89 Billy sub2 2 90 Amy sub2 80 Brian sub4 3 87 Allen sub4 79 Bran sub3 4 69 Alice sub6 97 Bryce sub6 5 78 Ayoung sub5 88 Betty sub5 Concatenating Using append A useful shortcut to concat are the append instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index − Live Demo import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print one.append(two) Its output is as follows − Marks_scored Name subject_id 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5 The append function can take multiple objects as well − Live Demo import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print one.append([two,one,two]) Its output is as follows − Marks_scored Name subject_id 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5 Time Series Pandas provide a robust tool for working time with Time series data, especially in the financial sector. While working with time series data, we frequently come across the following − Generating sequence of time Convert the time series to different frequencies Pandas provides a relatively compact and self-contained set of tools for performing the above tasks. Get Current Time datetime.now() gives you the current date and time. Live Demo import pandas as pd print pd.datetime.now() Its output is as follows − 2017-05-11 06:10:13.393147 Create a TimeStamp Time-stamped data is the most basic type of timeseries data that associates values with points in time. For pandas objects, it means using the points in time. Let’s take an example − Live Demo import pandas as pd print pd.Timestamp(”2017-03-01”) Its output is as follows − 2017-03-01 00:00:00 It is also possible to convert integer or float epoch times. The default unit for these is nanoseconds (since these are how Timestamps are stored). However, often epochs are stored in another unit which can be specified. Let’s take another example Live Demo import pandas as pd print pd.Timestamp(1587687255,unit=”s”) Its output is as follows − 2020-04-24 00:14:15 Create a Range of Time Live Demo import pandas as pd print pd.date_range(“11:00”,
Options & Customization
Python Pandas – Options and Customization ”; Previous Next Pandas provide API to customize some aspects of its behavior, display is being mostly used. The API is composed of five relevant functions. They are − get_option() set_option() reset_option() describe_option() option_context() Let us now understand how the functions operate. get_option(param) get_option takes a single parameter and returns the value as given in the output below − display.max_rows Displays the default number of value. Interpreter reads this value and displays the rows with this value as upper limit to display. Live Demo import pandas as pd print pd.get_option(“display.max_rows”) Its output is as follows − 60 display.max_columns Displays the default number of value. Interpreter reads this value and displays the rows with this value as upper limit to display. Live Demo import pandas as pd print pd.get_option(“display.max_columns”) Its output is as follows − 20 Here, 60 and 20 are the default configuration parameter values. set_option(param,value) set_option takes two arguments and sets the value to the parameter as shown below − display.max_rows Using set_option(), we can change the default number of rows to be displayed. Live Demo import pandas as pd pd.set_option(“display.max_rows”,80) print pd.get_option(“display.max_rows”) Its output is as follows − 80 display.max_columns Using set_option(), we can change the default number of rows to be displayed. Live Demo import pandas as pd pd.set_option(“display.max_columns”,30) print pd.get_option(“display.max_columns”) Its output is as follows − 30 reset_option(param) reset_option takes an argument and sets the value back to the default value. display.max_rows Using reset_option(), we can change the value back to the default number of rows to be displayed. Live Demo import pandas as pd pd.reset_option(“display.max_rows”) print pd.get_option(“display.max_rows”) Its output is as follows − 60 describe_option(param) describe_option prints the description of the argument. display.max_rows Using reset_option(), we can change the value back to the default number of rows to be displayed. Live Demo import pandas as pd pd.describe_option(“display.max_rows”) Its output is as follows − display.max_rows : int If max_rows is exceeded, switch to truncate view. Depending on ”large_repr”, objects are either centrally truncated or printed as a summary view. ”None” value means unlimited. In case python/IPython is running in a terminal and `large_repr` equals ”truncate” this can be set to 0 and pandas will auto-detect the height of the terminal and print a truncated object which fits the screen height. The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to do correct auto-detection. [default: 60] [currently: 60] option_context() option_context context manager is used to set the option in with statement temporarily. Option values are restored automatically when you exit the with block − display.max_rows Using option_context(), we can set the value temporarily. Live Demo import pandas as pd with pd.option_context(“display.max_rows”,10): print(pd.get_option(“display.max_rows”)) print(pd.get_option(“display.max_rows”)) Its output is as follows − 10 10 See, the difference between the first and the second print statements. The first statement prints the value set by option_context() which is temporary within the with context itself. After the with context, the second print statement prints the configured value. Frequently used Parameters Sr.No Parameter & Description 1 display.max_rows Displays maximum number of rows to display 2 2 display.max_columns Displays maximum number of columns to display 3 display.expand_frame_repr Displays DataFrames to Stretch Pages 4 display.max_colwidth Displays maximum column width 5 display.precision Displays precision for decimal numbers Print Page Previous Next Advertisements ”;
Python Pandas – Iteration
Python Pandas – Iteration ”; Previous Next The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects. In short, basic iteration (for i in object) produces − Series − values DataFrame − column labels Panel − item labels Iterating a DataFrame Iterating a DataFrame gives column names. Let us consider the following example to understand the same. Live Demo import pandas as pd import numpy as np N=20 df = pd.DataFrame({ ”A”: pd.date_range(start=”2016-01-01”,periods=N,freq=”D”), ”x”: np.linspace(0,stop=N-1,num=N), ”y”: np.random.rand(N), ”C”: np.random.choice([”Low”,”Medium”,”High”],N).tolist(), ”D”: np.random.normal(100, 10, size=(N)).tolist() }) for col in df: print col Its output is as follows − A C D x y To iterate over the rows of the DataFrame, we can use the following functions − iteritems() − to iterate over the (key,value) pairs iterrows() − iterate over the rows as (index,series) pairs itertuples() − iterate over the rows as namedtuples iteritems() Iterates over each column as key, value pair with label as key and column value as a Series object. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3),columns=[”col1”,”col2”,”col3”]) for key,value in df.iteritems(): print key,value Its output is as follows − col1 0 0.802390 1 0.324060 2 0.256811 3 0.839186 Name: col1, dtype: float64 col2 0 1.624313 1 -1.033582 2 1.796663 3 1.856277 Name: col2, dtype: float64 col3 0 -0.022142 1 -0.230820 2 1.160691 3 -0.830279 Name: col3, dtype: float64 Observe, each column is iterated separately as a key-value pair in a Series. iterrows() iterrows() returns the iterator yielding each index value along with a series containing the data in each row. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3),columns = [”col1”,”col2”,”col3”]) for row_index,row in df.iterrows(): print row_index,row Its output is as follows − 0 col1 1.529759 col2 0.762811 col3 -0.634691 Name: 0, dtype: float64 1 col1 -0.944087 col2 1.420919 col3 -0.507895 Name: 1, dtype: float64 2 col1 -0.077287 col2 -0.858556 col3 -0.663385 Name: 2, dtype: float64 3 col1 -1.638578 col2 0.059866 col3 0.493482 Name: 3, dtype: float64 Note − Because iterrows() iterate over the rows, it doesn”t preserve the data type across the row. 0,1,2 are the row indices and col1,col2,col3 are column indices. itertuples() itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3),columns = [”col1”,”col2”,”col3”]) for row in df.itertuples(): print row Its output is as follows − Pandas(Index=0, col1=1.5297586201375899, col2=0.76281127433814944, col3=- 0.6346908238310438) Pandas(Index=1, col1=-0.94408735763808649, col2=1.4209186418359423, col3=- 0.50789517967096232) Pandas(Index=2, col1=-0.07728664756791935, col2=-0.85855574139699076, col3=- 0.6633852507207626) Pandas(Index=3, col1=0.65734942534106289, col2=-0.95057710432604969, col3=0.80344487462316527) Note − Do not try to modify any object while iterating. Iterating is meant for reading and the iterator returns a copy of the original object (a view), thus the changes will not reflect on the original object. Live Demo import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3),columns = [”col1”,”col2”,”col3”]) for index, row in df.iterrows(): row[”a”] = 10 print df Its output is as follows − col1 col2 col3 0 -1.739815 0.735595 -0.295589 1 0.635485 0.106803 1.527922 2 -0.939064 0.547095 0.038585 3 -1.016509 -0.116580 -0.523158 Observe, no changes reflected. Print Page Previous Next Advertisements ”;
Working with Text Data
Python Pandas – Working with Text Data ”; Previous Next In this chapter, we will discuss the string operations with our basic Series/Index. In the subsequent chapters, we will learn how to apply these string functions on the DataFrame. Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values. Almost, all of these methods work with Python string functions (refer: https://docs.python.org/3/library/stdtypes.html#string-methods). So, convert the Series Object to String Object and then perform the operation. Let us now see how each operation performs. Sr.No Function & Description 1 lower() Converts strings in the Series/Index to lower case. 2 upper() Converts strings in the Series/Index to upper case. 3 len() Computes String length(). 4 strip() Helps strip whitespace(including newline) from each string in the Series/index from both the sides. 5 split(” ”) Splits each string with the given pattern. 6 cat(sep=” ”) Concatenates the series/index elements with given separator. 7 get_dummies() Returns the DataFrame with One-Hot Encoded values. 8 contains(pattern) Returns a Boolean value True for each element if the substring contains in the element, else False. 9 replace(a,b) Replaces the value a with the value b. 10 repeat(value) Repeats each element with specified number of times. 11 count(pattern) Returns count of appearance of pattern in each element. 12 startswith(pattern) Returns true if the element in the Series/Index starts with the pattern. 13 endswith(pattern) Returns true if the element in the Series/Index ends with the pattern. 14 find(pattern) Returns the first position of the first occurrence of the pattern. 15 findall(pattern) Returns a list of all occurrence of the pattern. 16 swapcase Swaps the case lower/upper. 17 islower() Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean 18 isupper() Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean. 19 isnumeric() Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean. Let us now create a Series and see how all the above functions work. Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom”, ”William Rick”, ”John”, ”Alber@t”, np.nan, ”1234”,”SteveSmith”]) print s Its output is as follows − 0 Tom 1 William Rick 2 John 3 Alber@t 4 NaN 5 1234 6 Steve Smith dtype: object lower() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom”, ”William Rick”, ”John”, ”Alber@t”, np.nan, ”1234”,”SteveSmith”]) print s.str.lower() Its output is as follows − 0 tom 1 william rick 2 john 3 alber@t 4 NaN 5 1234 6 steve smith dtype: object upper() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom”, ”William Rick”, ”John”, ”Alber@t”, np.nan, ”1234”,”SteveSmith”]) print s.str.upper() Its output is as follows − 0 TOM 1 WILLIAM RICK 2 JOHN 3 ALBER@T 4 NaN 5 1234 6 STEVE SMITH dtype: object len() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom”, ”William Rick”, ”John”, ”Alber@t”, np.nan, ”1234”,”SteveSmith”]) print s.str.len() Its output is as follows − 0 3.0 1 12.0 2 4.0 3 7.0 4 NaN 5 4.0 6 10.0 dtype: float64 strip() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s print (“After Stripping:”) print s.str.strip() Its output is as follows − 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object After Stripping: 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object split(pattern) Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s print (“Split Pattern:”) print s.str.split(” ”) Its output is as follows − 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object Split Pattern: 0 [Tom, , , , , , , , , , ] 1 [, , , , , William, Rick] 2 [John] 3 [Alber@t] dtype: object cat(sep=pattern) Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s.str.cat(sep=”_”) Its output is as follows − Tom _ William Rick_John_Alber@t get_dummies() Live Demo import pandas as pd import numpy as np s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s.str.get_dummies() Its output is as follows − William Rick Alber@t John Tom 0 0 0 0 1 1 1 0 0 0 2 0 0 1 0 3 0 1 0 0 contains () Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s.str.contains(” ”) Its output is as follows − 0 True 1 True 2 False 3 False dtype: bool replace(a,b) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s print (“After replacing @ with $:”) print s.str.replace(”@”,”$”) Its output is as follows − 0 Tom 1 William Rick 2 John 3 Alber@t dtype: object After replacing @ with $: 0 Tom 1 William Rick 2 John 3 Alber$t dtype: object repeat(value) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print s.str.repeat(2) Its output is as follows − 0 Tom Tom 1 William Rick William Rick 2 JohnJohn 3 Alber@tAlber@t dtype: object count(pattern) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print (“The number of ”m”s in each string:”) print s.str.count(”m”) Its output is as follows − The number of ”m”s in each string: 0 1 1 1 2 0 3 0 startswith(pattern) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print (“Strings that start with ”T”:”) print s.str. startswith (”T”) Its output is as follows − 0 True 1 False 2 False 3 False dtype: bool endswith(pattern) Live Demo import pandas as pd s = pd.Series([”Tom ”, ” William Rick”, ”John”, ”Alber@t”]) print (“Strings that end with ”t”:”) print s.str.endswith(”t”) Its output is as follows − Strings
Python Pandas – Basic Functionality ”; Previous Next By now, we learnt about the three Pandas DataStructures and how to create them. We will majorly focus on the DataFrame objects because of its importance in the real time data processing and also discuss a few other DataStructures. Series Basic Functionality Sr.No. Attribute or Method & Description 1 axes Returns a list of the row axis labels 2 dtype Returns the dtype of the object. 3 empty Returns True if series is empty. 4 ndim Returns the number of dimensions of the underlying data, by definition 1. 5 size Returns the number of elements in the underlying data. 6 values Returns the Series as ndarray. 7 head() Returns the first n rows. 8 tail() Returns the last n rows. Let us now create a Series and see all the above tabulated attributes operation. Example Live Demo import pandas as pd import numpy as np #Create a series with 100 random numbers s = pd.Series(np.random.randn(4)) print s Its output is as follows − 0 0.967853 1 -0.148368 2 -1.395906 3 -1.758394 dtype: float64 axes Returns the list of the labels of the series. Live Demo import pandas as pd import numpy as np #Create a series with 100 random numbers s = pd.Series(np.random.randn(4)) print (“The axes are:”) print s.axes Its output is as follows − The axes are: [RangeIndex(start=0, stop=4, step=1)] The above result is a compact format of a list of values from 0 to 5, i.e., [0,1,2,3,4]. empty Returns the Boolean value saying whether the Object is empty or not. True indicates that the object is empty. Live Demo import pandas as pd import numpy as np #Create a series with 100 random numbers s = pd.Series(np.random.randn(4)) print (“Is the Object empty?”) print s.empty Its output is as follows − Is the Object empty? False ndim Returns the number of dimensions of the object. By definition, a Series is a 1D data structure, so it returns Live Demo import pandas as pd import numpy as np #Create a series with 4 random numbers s = pd.Series(np.random.randn(4)) print s print (“The dimensions of the object:”) print s.ndim Its output is as follows − 0 0.175898 1 0.166197 2 -0.609712 3 -1.377000 dtype: float64 The dimensions of the object: 1 size Returns the size(length) of the series. Live Demo import pandas as pd import numpy as np #Create a series with 4 random numbers s = pd.Series(np.random.randn(2)) print s print (“The size of the object:”) print s.size Its output is as follows − 0 3.078058 1 -1.207803 dtype: float64 The size of the object: 2 values Returns the actual data in the series as an array. Live Demo import pandas as pd import numpy as np #Create a series with 4 random numbers s = pd.Series(np.random.randn(4)) print s print (“The actual data series is:”) print s.values Its output is as follows − 0 1.787373 1 -0.605159 2 0.180477 3 -0.140922 dtype: float64 The actual data series is: [ 1.78737302 -0.60515881 0.18047664 -0.1409218 ] Head & Tail To view a small sample of a Series or the DataFrame object, use the head() and the tail() methods. head() returns the first n rows(observe the index values). The default number of elements to display is five, but you may pass a custom number. Live Demo import pandas as pd import numpy as np #Create a series with 4 random numbers s = pd.Series(np.random.randn(4)) print (“The original series is:”) print s print (“The first two rows of the data series:”) print s.head(2) Its output is as follows − The original series is: 0 0.720876 1 -0.765898 2 0.479221 3 -0.139547 dtype: float64 The first two rows of the data series: 0 0.720876 1 -0.765898 dtype: float64 tail() returns the last n rows(observe the index values). The default number of elements to display is five, but you may pass a custom number. Live Demo import pandas as pd import numpy as np #Create a series with 4 random numbers s = pd.Series(np.random.randn(4)) print (“The original series is:”) print s print (“The last two rows of the data series:”) print s.tail(2) Its output is as follows − The original series is: 0 -0.655091 1 -0.881407 2 -0.608592 3 -2.341413 dtype: float64 The last two rows of the data series: 2 -0.608592 3 -2.341413 dtype: float64 DataFrame Basic Functionality Let us now understand what DataFrame Basic Functionality is. The following tables lists down the important attributes or methods that help in DataFrame Basic Functionality. Sr.No. Attribute or Method & Description 1 T Transposes rows and columns. 2 axes Returns a list with the row axis labels and column axis labels as the only members. 3 dtypes Returns the dtypes in this object. 4 empty True if NDFrame is entirely empty [no items]; if any of the axes are of length 0. 5 ndim Number of axes / array dimensions. 6 shape Returns a tuple representing the dimensionality of the DataFrame. 7 size Number of elements in the NDFrame. 8 values Numpy representation of NDFrame. 9 head() Returns the first n rows. 10 tail() Returns last n rows. Let us now create a DataFrame and see all how the above mentioned attributes operate. Example Live Demo import pandas as pd import numpy as np #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”]), ”Age”:pd.Series([25,26,25,23,30,29,23]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} #Create a DataFrame df = pd.DataFrame(d) print (“Our data series is:”) print df Its output is as follows − Our data series is: Age Name Rating 0 25 Tom 4.23 1 26 James 3.24 2 25 Ricky 3.98 3 23 Vin 2.56 4 30 Steve 3.20 5 29 Smith 4.60 6 23 Jack 3.80 T (Transpose) Returns the transpose of the DataFrame. The rows and columns will interchange. Live Demo import pandas as pd import numpy as np # Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”]), ”Age”:pd.Series([25,26,25,23,30,29,23]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} # Create a DataFrame df = pd.DataFrame(d) print (“The transpose of the data series is:”) print df.T Its output is as follows −