Python Processing CSV Data

Python – Processing CSV Data ”; Previous Next Reading data from CSV(comma separated values) is a fundamental necessity in Data Science. Often, we get data from various sources which can get exported to CSV format so that they can be used by other systems. The Panadas library provides features using which we can read the CSV file in full as well as in parts for only a selected group of columns and rows. Input as CSV File The csv file is a text file in which the values in the columns are separated by a comma. Let”s consider the following data present in the file named input.csv. You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the save As All files(*.*) option in notepad. id,name,salary,start_date,dept 1,Rick,623.3,2012-01-01,IT 2,Dan,515.2,2013-09-23,Operations 3,Tusar,611,2014-11-15,IT 4,Ryan,729,2014-05-11,HR 5,Gary,843.25,2015-03-27,Finance 6,Rasmi,578,2013-05-21,IT 7,Pranab,632.8,2013-07-30,Operations 8,Guru,722.5,2014-06-17,Finance Reading a CSV File The read_csv function of the pandas library is used read the content of a CSV file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file. import pandas as pd data = pd.read_csv(”path/input.csv”) print (data) When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function. id name salary start_date dept 0 1 Rick 623.30 2012-01-01 IT 1 2 Dan 515.20 2013-09-23 Operations 2 3 Tusar 611.00 2014-11-15 IT 3 4 Ryan 729.00 2014-05-11 HR 4 5 Gary 843.25 2015-03-27 Finance 5 6 Rasmi 578.00 2013-05-21 IT 6 7 Pranab 632.80 2013-07-30 Operations 7 8 Guru 722.50 2014-06-17 Finance Reading Specific Rows The read_csv function of the pandas library can also be used to read some specific rows for a given column. We slice the result from the read_csv function using the code shown below for first 5 rows for the column named salary. import pandas as pd data = pd.read_csv(”path/input.csv”) # Slice the result for first 5 rows print (data[0:5][”salary”]) When we execute the above code, it produces the following result. 0 623.30 1 515.20 2 611.00 3 729.00 4 843.25 Name: salary, dtype: float64 Reading Specific Columns The read_csv function of the pandas library can also be used to read some specific columns. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for all the rows. import pandas as pd data = pd.read_csv(”path/input.csv”) # Use the multi-axes indexing funtion print (data.loc[:,[”salary”,”name”]]) When we execute the above code, it produces the following result. salary name 0 623.30 Rick 1 515.20 Dan 2 611.00 Tusar 3 729.00 Ryan 4 843.25 Gary 5 578.00 Rasmi 6 632.80 Pranab 7 722.50 Guru Reading Specific Columns and Rows The read_csv function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows. import pandas as pd data = pd.read_csv(”path/input.csv”) # Use the multi-axes indexing funtion print (data.loc[[1,3,5],[”salary”,”name”]]) When we execute the above code, it produces the following result. salary name 1 515.2 Dan 3 729.0 Ryan 5 578.0 Rasmi Reading Specific Columns for a Range of Rows The read_csv function of the pandas library can also be used to read some specific columns and a range of rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows. import pandas as pd data = pd.read_csv(”path/input.csv”) # Use the multi-axes indexing funtion print (data.loc[2:6,[”salary”,”name”]]) When we execute the above code, it produces the following result. salary name 2 611.00 Tusar 3 729.00 Ryan 4 843.25 Gary 5 578.00 Rasmi 6 632.80 Pranab Print Page Previous Next Advertisements ”;

Python Data Science – Pandas

Python Data Science – Pandas ”; Previous Next What is Pandas? Pandas is an open-source Python Library used for high-performance data manipulation and data analysis using its powerful data structures. Python with pandas is in use in a variety of academic and commercial domains, including Finance, Economics, Statistics, Advertising, Web Analytics, and more. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, organize, manipulate, model, and analyse the data. Below are the some of the important features of Pandas which is used specifically for Data processing and Data analysis work. Key Features of Pandas Fast and efficient DataFrame object with default and customized indexing. Tools for loading data into in-memory data objects from different file formats. Data alignment and integrated handling of missing data. Reshaping and pivoting of date sets. Label-based slicing, indexing and subsetting of large data sets. Columns from a data structure can be deleted or inserted. Group by data for aggregation and transformations. High performance merging and joining of data. Time Series functionality. Pandas deals with the following three data structures − Series DataFrame These data structures are built on top of Numpy array, making them fast and efficient. Dimension & Description The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame. Data Structure Dimensions Description Series 1 1D labeled homogeneous array, size-immutable. Data Frames 2 General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns. DataFrame is widely used and it is the most important data structures. Series Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, … 10 23 56 17 52 61 73 90 26 72 Key Points of Series Homogeneous data Size Immutable Values of Data Mutable DataFrame DataFrame is a two-dimensional array with heterogeneous data. For example, Name Age Gender Rating Steve 32 Male 3.45 Lia 28 Female 4.6 Vin 45 Male 3.9 Katie 38 Female 2.78 The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. Data Type of Columns The data types of the four columns are as follows − Column Type Name String Age Integer Gender String Rating Float Key Points of Data Frame Heterogeneous data Size Mutable Data Mutable We will see lots of examples on using pandas library of python in Data science work in the next chapters. Print Page Previous Next Advertisements ”;

Python Data Science – SciPy

Python Data Science – SciPy ”; Previous Next What is SciPy? The SciPy library of Python is built to work with NumPy arrays and provides many user-friendly and efficient numerical practices such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install and are free of charge. NumPy and SciPy are easy to use, but powerful enough to depend on by some of the world”s leading scientists and engineers. SciPy Sub-packages SciPy is organized into sub-packages covering different scientific computing domains. These are summarized in the following table − scipy.constants Physical and mathematical constants scipy.fftpack Fourier transform scipy.integrate Integration routines scipy.interpolate Interpolation scipy.io Data input and output scipy.linalg Linear algebra routines scipy.optimize Optimization scipy.signal Signal processing scipy.sparse Sparse matrices scipy.spatial Spatial data structures and algorithms scipy.special Any special mathematical functions scipy.stats Statistics Data Structure The basic data structure used by SciPy is a multidimensional array provided by the NumPy module. NumPy provides some functions for Linear Algebra, Fourier Transforms and Random Number Generation, but not with the generality of the equivalent functions in SciPy. We will see lots of examples on using SciPy library of python in Data science work in the next chapters. Print Page Previous Next Advertisements ”;

Python Processing XLS Data

Python – Processing XLS Data ”; Previous Next Microsoft Excel is a very widely used spread sheet program. Its user friendliness and appealing features makes it a very frequently used tool in Data Science. The Panadas library provides features using which we can read the Excel file in full as well as in parts for only a selected group of Data. We can also read an Excel file with multiple sheets in it. We use the read_excel function to read the data from it. Input as Excel File We Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown below. You can create this file using the Excel Program in windows OS. Save the file as input.xlsx. # Data in Sheet1 id,name,salary,start_date,dept 1,Rick,623.3,2012-01-01,IT 2,Dan,515.2,2013-09-23,Operations 3,Tusar,611,2014-11-15,IT 4,Ryan,729,2014-05-11,HR 5,Gary,843.25,2015-03-27,Finance 6,Rasmi,578,2013-05-21,IT 7,Pranab,632.8,2013-07-30,Operations 8,Guru,722.5,2014-06-17,Finance # Data in Sheet2 id name zipcode 1 Rick 301224 2 Dan 341255 3 Tusar 297704 4 Ryan 216650 5 Gary 438700 6 Rasmi 665100 7 Pranab 341211 8 Guru 347480 Reading an Excel File The read_excel function of the pandas library is used read the content of an Excel file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file. By default, the function will read Sheet1. import pandas as pd data = pd.read_excel(”path/input.xlsx”) print (data) When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function. id name salary start_date dept 0 1 Rick 623.30 2012-01-01 IT 1 2 Dan 515.20 2013-09-23 Operations 2 3 Tusar 611.00 2014-11-15 IT 3 4 Ryan 729.00 2014-05-11 HR 4 5 Gary 843.25 2015-03-27 Finance 5 6 Rasmi 578.00 2013-05-21 IT 6 7 Pranab 632.80 2013-07-30 Operations 7 8 Guru 722.50 2014-06-17 Finance Reading Specific Columns and Rows Similar to what we have already seen in the previous chapter to read the CSV file, the read_excel function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows. import pandas as pd data = pd.read_excel(”path/input.xlsx”) # Use the multi-axes indexing funtion print (data.loc[[1,3,5],[”salary”,”name”]]) When we execute the above code, it produces the following result. salary name 1 515.2 Dan 3 729.0 Ryan 5 578.0 Rasmi Reading Multiple Excel Sheets Multiple sheets with different Data formats can also be read by using read_excel function with help of a wrapper class named ExcelFile. It will read the multiple sheets into memory only once. In the below example we read sheet1 and sheet2 into two data frames and print them out individually. import pandas as pd with pd.ExcelFile(”C:/Users/Rasmi/Documents/pydatasci/input.xlsx”) as xls: df1 = pd.read_excel(xls, ”Sheet1”) df2 = pd.read_excel(xls, ”Sheet2”) print(“****Result Sheet 1****”) print (df1[0:5][”salary”]) print(“”) print(“***Result Sheet 2****”) print (df2[0:5][”zipcode”]) When we execute the above code, it produces the following result. ****Result Sheet 1**** 0 623.30 1 515.20 2 611.00 3 729.00 4 843.25 Name: salary, dtype: float64 ***Result Sheet 2**** 0 301224 1 341255 2 297704 3 216650 4 438700 Name: zipcode, dtype: int64 Print Page Previous Next Advertisements ”;

Python Data Science – Numpy

Python Data Science – NumPy ”; Previous Next What is NumPy? NumPy is a Python package which stands for ”Numerical Python”. It is a library consisting of multidimensional array objects and a collection of routines for processing of array. Operations using NumPy Using NumPy, a developer can perform the following operations − Mathematical and logical operations on arrays. Fourier transforms and routines for shape manipulation. Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation. NumPy – A Replacement for MatLab NumPy is often used along with packages like SciPy (Scientific Python) and Mat−plotlib (plotting library). This combination is widely used as a replacement for MatLab, a popular platform for technical computing. However, Python alternative to MatLab is now seen as a more modern and complete programming language. It is open source, which is an added advantage of NumPy. ndarray Object The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index. Every item in an ndarray takes the same size of block in the memory. Each element in ndarray is an object of data-type object (called dtype). Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types. We will see lots of examples on using NumPy library of python in Data science work in the next chapters. Print Page Previous Next Advertisements ”;

Fundamentals

Python Deep Learning – Fundamentals ”; Previous Next In this chapter, we will look into the fundamentals of Python Deep Learning. Deep learning models/algorithms Let us now learn about the different deep learning models/ algorithms. Some of the popular models within deep learning are as follows − Convolutional neural networks Recurrent neural networks Deep belief networks Generative adversarial networks Auto-encoders and so on The inputs and outputs are represented as vectors or tensors. For example, a neural network may have the inputs where individual pixel RGB values in an image are represented as vectors. The layers of neurons that lie between the input layer and the output layer are called hidden layers. This is where most of the work happens when the neural net tries to solve problems. Taking a closer look at the hidden layers can reveal a lot about the features the network has learned to extract from the data. Different architectures of neural networks are formed by choosing which neurons to connect to the other neurons in the next layer. Pseudocode for calculating output Following is the pseudocode for calculating output of Forward-propagating Neural Network − # node[] := array of topologically sorted nodes # An edge from a to b means a is to the left of b # If the Neural Network has R inputs and S outputs, # then first R nodes are input nodes and last S nodes are output nodes. # incoming[x] := nodes connected to node x # weight[x] := weights of incoming edges to x For each neuron x, from left to right − if x <= R: do nothing # its an input node inputs[x] = [output[i] for i in incoming[x]] weighted_sum = dot_product(weights[x], inputs[x]) output[x] = Activation_function(weighted_sum) Print Page Previous Next Advertisements ”;

Quick Guide

Python Deep Learning – Quick Guide ”; Previous Next Python Deep Learning – Introduction Deep structured learning or hierarchical learning or deep learning in short is part of the family of machine learning methods which are themselves a subset of the broader field of Artificial Intelligence. Deep learning is a class of machine learning algorithms that use several layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. Deep neural networks, deep belief networks and recurrent neural networks have been applied to fields such as computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, and bioinformatics where they produced results comparable to and in some cases better than human experts have. Deep Learning Algorithms and Networks − are based on the unsupervised learning of multiple levels of features or representations of the data. Higher-level features are derived from lower level features to form a hierarchical representation. use some form of gradient descent for training. Python Deep Learning – Environment In this chapter, we will learn about the environment set up for Python Deep Learning. We have to install the following software for making deep learning algorithms. Python 2.7+ Scipy with Numpy Matplotlib Theano Keras TensorFlow It is strongly recommend that Python, NumPy, SciPy, and Matplotlib are installed through the Anaconda distribution. It comes with all of those packages. We need to ensure that the different types of software are installed properly. Let us go to our command line program and type in the following command − $ python Python 3.6.3 |Anaconda custom (32-bit)| (default, Oct 13 2017, 14:21:34) [GCC 7.2.0] on linux Next, we can import the required libraries and print their versions − import numpy print numpy.__version__ Output 1.14.2 Installation of Theano, TensorFlow and Keras Before we begin with the installation of the packages − Theano, TensorFlow and Keras, we need to confirm if the pip is installed. The package management system in Anaconda is called the pip. To confirm the installation of pip, type the following in the command line − $ pip Once the installation of pip is confirmed, we can install TensorFlow and Keras by executing the following command − $pip install theano $pip install tensorflow $pip install keras Confirm the installation of Theano by executing the following line of code − $python –c “import theano: print (theano.__version__)” Output 1.0.1 Confirm the installation of Tensorflow by executing the following line of code − $python –c “import tensorflow: print tensorflow.__version__” Output 1.7.0 Confirm the installation of Keras by executing the following line of code − $python –c “import keras: print keras.__version__” Using TensorFlow backend Output 2.1.5 Python Deep Basic Machine Learning Artificial Intelligence (AI) is any code, algorithm or technique that enables a computer to mimic human cognitive behaviour or intelligence. Machine Learning (ML) is a subset of AI that uses statistical methods to enable machines to learn and improve with experience. Deep Learning is a subset of Machine Learning, which makes the computation of multi-layer neural networks feasible. Machine Learning is seen as shallow learning while Deep Learning is seen as hierarchical learning with abstraction. Machine learning deals with a wide range of concepts. The concepts are listed below − supervised unsupervised reinforcement learning linear regression cost functions overfitting under-fitting hyper-parameter, etc. In supervised learning, we learn to predict values from labelled data. One ML technique that helps here is classification, where target values are discrete values; for example,cats and dogs. Another technique in machine learning that could come of help is regression. Regression works onthe target values. The target values are continuous values; for example, the stock market data can be analysed using Regression. In unsupervised learning, we make inferences from the input data that is not labelled or structured. If we have a million medical records and we have to make sense of it, find the underlying structure, outliers or detect anomalies, we use clustering technique to divide data into broad clusters. Data sets are divided into training sets, testing sets, validation sets and so on. A breakthrough in 2012 brought the concept of Deep Learning into prominence. An algorithm classified 1 million images into 1000 categories successfully using 2 GPUs and latest technologies like Big Data. Relating Deep Learning and Traditional Machine Learning One of the major challenges encountered in traditional machine learning models is a process called feature extraction. The programmer needs to be specific and tell the computer the features to be looked out for. These features will help in making decisions. Entering raw data into the algorithm rarely works, so feature extraction is a critical part of the traditional machine learning workflow. This places a huge responsibility on the programmer, and the algorithm”s efficiency relies heavily on how inventive the programmer is. For complex problems such as object recognition or handwriting recognition, this is a huge issue. Deep learning, with the ability to learn multiple layers of representation, is one of the few methods that has help us with automatic feature extraction. The lower layers can be assumed to be performing automatic feature extraction, requiring little or no guidance from the programmer. Artificial Neural Networks The Artificial Neural Network, or just neural network for short, is not a new idea. It has been around for about 80 years. It was not until 2011, when Deep Neural Networks became popular with the use of new techniques, huge dataset availability, and powerful computers. A neural network mimics a neuron, which has dendrites, a nucleus, axon, and terminal axon. For a network, we need two neurons. These neurons transfer information via synapse between the dendrites of one and the terminal axon of another. A probable model of an artificial neuron looks like this − A neural network will look like as shown below − The circles are neurons or nodes, with their functions on the data and the lines/edges connecting them are the weights/information being passed along. Each column is a layer.

Computational Graphs

Computational Graphs ”; Previous Next Backpropagation is implemented in deep learning frameworks like Tensorflow, Torch, Theano, etc., by using computational graphs. More significantly, understanding back propagation on computational graphs combines several different algorithms and its variations such as backprop through time and backprop with shared weights. Once everything is converted into a computational graph, they are still the same algorithm − just back propagation on computational graphs. What is Computational Graph A computational graph is defined as a directed graph where the nodes correspond to mathematical operations. Computational graphs are a way of expressing and evaluating a mathematical expression. For example, here is a simple mathematical equation − $$p = x+y$$ We can draw a computational graph of the above equation as follows. The above computational graph has an addition node (node with “+” sign) with two input variables x and y and one output q. Let us take another example, slightly more complex. We have the following equation. $$g = left (x+y right ) ast z $$ The above equation is represented by the following computational graph. Computational Graphs and Backpropagation Computational graphs and backpropagation, both are important core concepts in deep learning for training neural networks. Forward Pass Forward pass is the procedure for evaluating the value of the mathematical expression represented by computational graphs. Doing forward pass means we are passing the value from variables in forward direction from the left (input) to the right where the output is. Let us consider an example by giving some value to all of the inputs. Suppose, the following values are given to all of the inputs. $$x=1, y=3, z=−3$$ By giving these values to the inputs, we can perform forward pass and get the following values for the outputs on each node. First, we use the value of x = 1 and y = 3, to get p = 4. Then we use p = 4 and z = -3 to get g = -12. We go from left to right, forwards. Objectives of Backward Pass In the backward pass, our intention is to compute the gradients for each input with respect to the final output. These gradients are essential for training the neural network using gradient descent. For example, we desire the following gradients. Desired gradients $$frac{partial x}{partial f}, frac{partial y}{partial f}, frac{partial z}{partial f}$$ Backward pass (backpropagation) We start the backward pass by finding the derivative of the final output with respect to the final output (itself!). Thus, it will result in the identity derivation and the value is equal to one. $$frac{partial g}{partial g} = 1$$ Our computational graph now looks as shown below − Next, we will do the backward pass through the “*” operation. We will calculate the gradients at p and z. Since g = p*z, we know that − $$frac{partial g}{partial z} = p$$ $$frac{partial g}{partial p} = z$$ We already know the values of z and p from the forward pass. Hence, we get − $$frac{partial g}{partial z} = p = 4$$ and $$frac{partial g}{partial p} = z = -3$$ We want to calculate the gradients at x and y − $$frac{partial g}{partial x}, frac{partial g}{partial y}$$ However, we want to do this efficiently (although x and g are only two hops away in this graph, imagine them being really far from each other). To calculate these values efficiently, we will use the chain rule of differentiation. From chain rule, we have − $$frac{partial g}{partial x}=frac{partial g}{partial p}ast frac{partial p}{partial x}$$ $$frac{partial g}{partial y}=frac{partial g}{partial p}ast frac{partial p}{partial y}$$ But we already know the dg/dp = -3, dp/dx and dp/dy are easy since p directly depends on x and y. We have − $$p=x+yRightarrow frac{partial x}{partial p} = 1, frac{partial y}{partial p} = 1$$ Hence, we get − $$frac{partial g} {partial f} = frac{partial g} {partial p}ast frac{partial p} {partial x} = left ( -3 right ).1 = -3$$ In addition, for the input y − $$frac{partial g} {partial y} = frac{partial g} {partial p}ast frac{partial p} {partial y} = left ( -3 right ).1 = -3$$ The main reason for doing this backwards is that when we had to calculate the gradient at x, we only used already computed values, and dq/dx (derivative of node output with respect to the same node”s input). We used local information to compute a global value. Steps for training a neural network Follow these steps to train a neural network − For data point x in dataset,we do forward pass with x as input, and calculate the cost c as output. We do backward pass starting at c, and calculate gradients for all nodes in the graph. This includes nodes that represent the neural network weights. We then update the weights by doing W = W – learning rate * gradients. We repeat this process until stop criteria is met. Print Page Previous Next Advertisements ”;

Useful Resources

Python Deep Learning – Useful Resources ”; Previous Next The following resources contain additional information on Python Deep Learning. Please use them to get more in-depth knowledge on this. Useful Video Courses Computer Vision and Deep Learning in Python: Novice to Expert 107 Lectures 13.5 hours Abhilash Nelson More Detail Natural Language Processing with Deep Learning Course Most Popular 60 Lectures 2.5 hours Mike West More Detail Deep Learning And Neural Networks With Python By Spotle.ai Most Popular 24 Lectures 2.5 hours Spotle Learn More Detail Computer Vision Powered By Deep Learning, OpenCV And Python By Spotle.ai Best Seller 20 Lectures 2 hours Spotle Learn More Detail Performance Tuning Deep Learning in Python – A Masterclass 115 Lectures 4.5 hours Packt Publishing More Detail Deep Learning with Python for Image Classification 22 Lectures 1.5 hours Mazhar Hussain More Detail Print Page Previous Next Advertisements ”;

Discussion

Discuss Python Deep Learning ”; Previous Next Python is a general-purpose high level programming language that is widely used in data science and for producing deep learning algorithms. This brief tutorial introduces Python and its libraries like Numpy, Scipy, Pandas, Matplotlib; frameworks like Theano, TensorFlow, Keras. The tutorial explains how the different libraries and frameworks can be applied to solve complex real world problems. Print Page Previous Next Advertisements ”;