Python – Date and Time ”; Previous Next Often in data science we need analysis which is based on temporal values. Python can handle the various formats of date and time gracefully. The datetime library provides necessary methods and functions to handle the following scenarios. Date Time Representation Date Time Arithmetic Date Time Comparison We will study them one by one. Date Time Representation A date and its various parts are represented by using different datetime functions. Also, there are format specifiers which play a role in displaying the alphabetical parts of a date like name of the month or week day. The following code shows today”s date and various parts of the date. import datetime print ”The Date Today is :”, datetime.datetime.today() date_today = datetime.date.today() print date_today print ”This Year :”, date_today.year print ”This Month :”, date_today.month print ”Month Name:”,date_today.strftime(”%B”) print ”This Week Day :”, date_today.day print ”Week Day Name:”,date_today.strftime(”%A”) When we execute the above code, it produces the following result. The Date Today is : 2018-04-22 15:38:35.835000 2018-04-22 This Year : 2018 This Month : 4 Month Name: April This Week Day : 22 Week Day Name: Sunday Date Time Arithmetic For calculations involving dates we store the various dates into variables and apply the relevant mathematical operator to these variables. import datetime #Capture the First Date day1 = datetime.date(2018, 2, 12) print ”day1:”, day1.ctime() # Capture the Second Date day2 = datetime.date(2017, 8, 18) print ”day2:”, day2.ctime() # Find the difference between the dates print ”Number of Days:”, day1-day2 date_today = datetime.date.today() # Create a delta of Four Days no_of_days = datetime.timedelta(days=4) # Use Delta for Past Date before_four_days = date_today – no_of_days print ”Before Four Days:”, before_four_days # Use Delta for future Date after_four_days = date_today + no_of_days print ”After Four Days:”, after_four_days When we execute the above code, it produces the following result. day1: Mon Feb 12 00:00:00 2018 day2: Fri Aug 18 00:00:00 2017 Number of Days: 178 days, 0:00:00 Before Four Days: 2018-04-18 After Four Days: 2018-04-26 Date Time Comparison Date and time are compared using logical operators. But we must be careful in comparing the right parts of the dates with each other. In the below examples we take the future and past dates and compare them using the python if clause along with logical operators. import datetime date_today = datetime.date.today() print ”Today is: ”, date_today # Create a delta of Four Days no_of_days = datetime.timedelta(days=4) # Use Delta for Past Date before_four_days = date_today – no_of_days print ”Before Four Days:”, before_four_days after_four_days = date_today + no_of_days date1 = datetime.date(2018,4,4) print ”date1:”,date1 if date1 == before_four_days : print ”Same Dates” if date_today > date1: print ”Past Date” if date1 < after_four_days: print ”Future Date” When we execute the above code, it produces the following result. Today is: 2018-04-22 Before Four Days: 2018-04-18 date1: 2018-04-04 Past Date Future Date Print Page Previous Next Advertisements ”;
Category: Machine Learning
Python Data cleansing
Python – Data Cleansing ”; Previous Next Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid. When and Why Is Data Missed? Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time. Let us now see how we can handle missing values (say NA or NaN) using Pandas. # import the pandas library import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df Its output is as follows − one two three a 0.077988 0.476149 0.965836 b NaN NaN NaN c -0.390208 -0.551605 -2.301950 d NaN NaN NaN e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g NaN NaN NaN h 0.085100 0.532791 0.887415 Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number. Check for Missing Values To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects − Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df[”one”].isnull() Its output is as follows − a False b True c False d True e False f False g True h False Name: one, dtype: bool Cleaning / Filling Missing Data Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections. Replace NaN with a Scalar Value The following program shows how you can replace “NaN” with “0”. import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3), index=[”a”, ”c”, ”e”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”]) print df print (“NaN replaced with ”0”:”) print df.fillna(0) Its output is as follows − one two three a -0.576991 -0.741695 0.553172 b NaN NaN NaN c 0.744328 -1.735166 1.749580 NaN replaced with ”0”: one two three a -0.576991 -0.741695 0.553172 b 0.000000 0.000000 0.000000 c 0.744328 -1.735166 1.749580 Here, we are filling with value zero; instead we can also fill with any other value. Fill NA Forward and Backward Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values. Method Action pad/fill Fill methods Forward bfill/backfill Fill methods Backward Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.fillna(method=”pad”) Its output is as follows − one two three a 0.077988 0.476149 0.965836 b 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 d -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415 Drop Missing Values If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded. Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.dropna() Its output is as follows − one two three a 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415 Replace Missing (or) Generic Values Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method. Replacing NA with a scalar value is equivalent behavior of the fillna() function. Example import pandas as pd import numpy as np df = pd.DataFrame({”one”:[10,20,30,40,50,2000], ”two”:[1000,0,30,40,50,60]}) print df.replace({1000:10,2000:60}) Its output is as follows − one two 0 10 10 1 20 0 2 30 30 3 40 40 4 50 50 5 60 60 Print Page Previous Next Advertisements ”;
Python Data Operations
Python – Data Operations ”; Previous Next Python handles data of various formats mainly through the two libraries, Pandas and Numpy. We have already seen the important features of these two libraries in the previous chapters. In this chapter we will see some basic examples from each of the libraries on how to operate on data. Data Operations in Numpy The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index. An instance of ndarray class can be constructed by different array creation routines described later in the tutorial. The basic ndarray is created using an array function in NumPy as follows − numpy.array Following are some examples on Numpy Data handling. Example 1 # more than one dimensions import numpy as np a = np.array([[1, 2], [3, 4]]) print a The output is as follows − [[1, 2] [3, 4]] Example 2 # minimum dimensions import numpy as np a = np.array([1, 2, 3,4,5], ndmin = 2) print a The output is as follows − [[1, 2, 3, 4, 5]] Example 3 # dtype parameter import numpy as np a = np.array([1, 2, 3], dtype = complex) print a The output is as follows − [ 1.+0.j, 2.+0.j, 3.+0.j] Data Operations in Pandas Pandas handles data through Series,Data Frame, and Panel. We will see some examples from each of these. Pandas Series Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. A pandas Series can be created using the following constructor − pandas.Series( data, index, dtype, copy) Example Here we create a series from a Numpy Array. #import the pandas library and aliasing as pd import pandas as pd import numpy as np data = np.array([”a”,”b”,”c”,”d”]) s = pd.Series(data) print s Its output is as follows − 0 a 1 b 2 c 3 d dtype: object Pandas DataFrame A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. A pandas DataFrame can be created using the following constructor − pandas.DataFrame( data, index, columns, dtype, copy) Let us now create an indexed DataFrame using arrays. import pandas as pd data = {”Name”:[”Tom”, ”Jack”, ”Steve”, ”Ricky”],”Age”:[28,34,29,42]} df = pd.DataFrame(data, index=[”rank1”,”rank2”,”rank3”,”rank4”]) print df Its output is as follows − Age Name rank1 28 Tom rank2 34 Jack rank3 29 Steve rank4 42 Ricky Pandas Panel A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s. A Panel can be created using the following constructor − pandas.Panel(data, items, major_axis, minor_axis, dtype, copy) In the below example we create a panel from dict of DataFrame Objects #creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p Its output is as follows − <class ”pandas.core.panel.Panel”> Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis) Items axis: 0 to 1 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 4 Print Page Previous Next Advertisements ”;
Python – Stemming and Lemmatization ”; Previous Next In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words – agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word. The below program uses the Porter Stemming Algorithm for stemming. import nltk from nltk.stem.porter import PorterStemmer porter_stemmer = PorterStemmer() word_data = “It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms” # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) #Next find the roots of the word for w in nltk_tokens: print “Actual: %s Stem: %s” % (w,porter_stemmer.stem(w)) When we execute the above code, it produces the following result. Actual: It Stem: It Actual: originated Stem: origin Actual: from Stem: from Actual: the Stem: the Actual: idea Stem: idea Actual: that Stem: that Actual: there Stem: there Actual: are Stem: are Actual: readers Stem: reader Actual: who Stem: who Actual: prefer Stem: prefer Actual: learning Stem: learn Actual: new Stem: new Actual: skills Stem: skill Actual: from Stem: from Actual: the Stem: the Actual: comforts Stem: comfort Actual: of Stem: of Actual: their Stem: their Actual: drawing Stem: draw Actual: rooms Stem: room Lemmatization is similar ti stemming but it brings context to the words.So it goes a steps further by linking words with similar meaning to one word. For example if a paragraph has words like cars, trains and automobile, then it will link all of them to automobile. In the below program we use the WordNet lexical database for lemmatization. import nltk from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() word_data = “It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms” nltk_tokens = nltk.word_tokenize(word_data) for w in nltk_tokens: print “Actual: %s Lemma: %s” % (w,wordnet_lemmatizer.lemmatize(w)) When we execute the above code, it produces the following result. Actual: It Lemma: It Actual: originated Lemma: originated Actual: from Lemma: from Actual: the Lemma: the Actual: idea Lemma: idea Actual: that Lemma: that Actual: there Lemma: there Actual: are Lemma: are Actual: readers Lemma: reader Actual: who Lemma: who Actual: prefer Lemma: prefer Actual: learning Lemma: learning Actual: new Lemma: new Actual: skills Lemma: skill Actual: from Lemma: from Actual: the Lemma: the Actual: comforts Lemma: comfort Actual: of Lemma: of Actual: their Lemma: their Actual: drawing Lemma: drawing Actual: rooms Lemma: room Print Page Previous Next Advertisements ”;
Python Processing JSON Data
Python – Processing JSON Data ”; Previous Next JSON file stores data as text in human-readable format. JSON stands for JavaScript Object Notation. Pandas can read JSON files using the read_json function. Input Data Create a JSON file by copying the below data into a text editor like notepad. Save the file with .json extension and choosing the file type as all files(*.*). { “ID”:[“1″,”2″,”3″,”4″,”5″,”6″,”7″,”8” ], “Name”:[“Rick”,”Dan”,”Michelle”,”Ryan”,”Gary”,”Nina”,”Simon”,”Guru” ] “Salary”:[“623.3″,”515.2″,”611″,”729″,”843.25″,”578″,”632.8″,”722.5” ], “StartDate”:[ “1/1/2012″,”9/23/2013″,”11/15/2014″,”5/11/2014″,”3/27/2015″,”5/21/2013”, “7/30/2013″,”6/17/2014”], “Dept”:[ “IT”,”Operations”,”IT”,”HR”,”Finance”,”IT”,”Operations”,”Finance”] } Read the JSON File The read_json function of the pandas library can be used to read the JSON file into a pandas DataFrame. import pandas as pd data = pd.read_json(”path/input.json”) print (data) When we execute the above code, it produces the following result. Dept ID Name Salary StartDate 0 IT 1 Rick 623.30 1/1/2012 1 Operations 2 Dan 515.20 9/23/2013 2 IT 3 Tusar 611.00 11/15/2014 3 HR 4 Ryan 729.00 5/11/2014 4 Finance 5 Gary 843.25 3/27/2015 5 IT 6 Rasmi 578.00 5/21/2013 6 Operations 7 Pranab 632.80 7/30/2013 7 Finance 8 Guru 722.50 6/17/2014 Reading Specific Columns and Rows Similar to what we have already seen in the previous chapter to read the CSV file, the read_json function of the pandas library can also be used to read some specific columns and specific rows after the JSON file is read to a DataFrame. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the Salary and Name column for some of the rows. import pandas as pd data = pd.read_json(”path/input.xlsx”) # Use the multi-axes indexing funtion print (data.loc[[1,3,5],[”salary”,”name”]]) When we execute the above code, it produces the following result. salary name 1 515.2 Dan 3 729.0 Ryan 5 578.0 Rasmi Reading JSON file as Records We can also apply the to_json function along with parameters to read the JSON file content into individual records. import pandas as pd data = pd.read_json(”path/input.xlsx”) print(data.to_json(orient=”records”, lines=True)) When we execute the above code, it produces the following result. {“Dept”:”IT”,”ID”:1,”Name”:”Rick”,”Salary”:623.3,”StartDate”:”1/1/2012″} {“Dept”:”Operations”,”ID”:2,”Name”:”Dan”,”Salary”:515.2,”StartDate”:”9/23/2013″} {“Dept”:”IT”,”ID”:3,”Name”:”Tusar”,”Salary”:611.0,”StartDate”:”11/15/2014″} {“Dept”:”HR”,”ID”:4,”Name”:”Ryan”,”Salary”:729.0,”StartDate”:”5/11/2014″} {“Dept”:”Finance”,”ID”:5,”Name”:”Gary”,”Salary”:843.25,”StartDate”:”3/27/2015″} {“Dept”:”IT”,”ID”:6,”Name”:”Rasmi”,”Salary”:578.0,”StartDate”:”5/21/2013″} {“Dept”:”Operations”,”ID”:7,”Name”:”Pranab”,”Salary”:632.8,”StartDate”:”7/30/2013″} {“Dept”:”Finance”,”ID”:8,”Name”:”Guru”,”Salary”:722.5,”StartDate”:”6/17/2014″} Print Page Previous Next Advertisements ”;
Python NoSQL Databases
Python – NoSQL Databases ”; Previous Next As more and more data become available as unstructured or semi-structured, the need of managing them through NoSql database increases. Python can also interact with NoSQL databases in a similar way as is interacts with Relational databases. In this chapter we will use python to interact with MongoDB as a NoSQL database. In case you are new to MongoDB, you can learn it in our tutorial here. In order to connect to MongoDB, python uses a library known as pymongo. You can add this library to your python environment, using the below command from the Anaconda environment. conda install pymongo This library enables python to connect to MOngoDB using a db client. Once connected we select the db name to be used for various operations. Inserting Data To insert data into MongoDB we use the insert() method which is available in the database environment. First we connect to the db using python code shown below and then we provide the document details in form of a series of key-value pairs. # Import the python libraries from pymongo import MongoClient from pprint import pprint # Choose the appropriate client client = MongoClient() # Connect to the test db db=client.test # Use the employee collection employee = db.employee employee_details = { ”Name”: ”Raj Kumar”, ”Address”: ”Sears Streer, NZ”, ”Age”: ”42” } # Use the insert method result = employee.insert_one(employee_details) # Query for the inserted document. Queryresult = employee.find_one({”Age”: ”42”}) pprint(Queryresult) When we execute the above code, it produces the following result. {u”Address”: u”Sears Streer, NZ”, u”Age”: u”42”, u”Name”: u”Raj Kumar”, u”_id”: ObjectId(”5adc5a9f84e7cd3940399f93”)} Updating Data Updating an existing MongoDB data is similar to inserting. We use the update() method which is native to mongoDB. In the below code we are replacing the existing record with new key-value pairs. Please note how we are using the condition criteria to decide which record to update. # Import the python libraries from pymongo import MongoClient from pprint import pprint # Choose the appropriate client client = MongoClient() # Connect to db db=client.test employee = db.employee # Use the condition to choose the record # and use the update method db.employee.update_one( {“Age”:”42”}, { “$set”: { “Name”:”Srinidhi”, “Age”:”35”, “Address”:”New Omsk, WC” } } ) Queryresult = employee.find_one({”Age”:”35”}) pprint(Queryresult) When we execute the above code, it produces the following result. {u”Address”: u”New Omsk, WC”, u”Age”: u”35”, u”Name”: u”Srinidhi”, u”_id”: ObjectId(”5adc5a9f84e7cd3940399f93”)} Deleting Data Deleting a record is also straight forward where we use the delete method. Here also we mention the condition which is used to choose the record to be deleted. # Import the python libraries from pymongo import MongoClient from pprint import pprint # Choose the appropriate client client = MongoClient() # Connect to db db=client.test employee = db.employee # Use the condition to choose the record # and use the delete method db.employee.delete_one({“Age”:”35”}) Queryresult = employee.find_one({”Age”:”35”}) pprint(Queryresult) When we execute the above code, it produces the following result. None So we see the particular record does not exist in the db any more. Print Page Previous Next Advertisements ”;
Python Measuring Variance
Python – Measuring Variance ”; Previous Next In statistics, variance is a measure of how far a value in a data set lies from the mean value. In other words, it indicates how dispersed the values are. It is measured by using standard deviation. The other method commonly used is skewness. Both of these are calculated by using functions available in pandas library. Measuring Standard Deviation Standard deviation is square root of variance. variance is the average of squared difference of values in a data set from the mean value. In python we calculate this value by using the function std() from pandas library. import pandas as pd #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”Chanchal”,”Gasper”,”Naviya”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,25,23,34,40,30,25,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} #Create a DataFrame df = pd.DataFrame(d) # Calculate the standard deviation print df.std() Its output is as follows − Age 7.265527 Rating 0.661628 dtype: float64 Measuring Skewness It used to determine whether the data is symmetric or skewed. If the index is between -1 and 1, then the distribution is symmetric. If the index is no more than -1 then it is skewed to the left and if it is at least 1, then it is skewed to the right import pandas as pd #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”Chanchal”,”Gasper”,”Naviya”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,25,23,34,40,30,25,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} #Create a DataFrame df = pd.DataFrame(d) print df.skew() Its output is as follows − Age 1.443490 Rating -0.153629 dtype: float64 So the distribution of age rating is symmetric while the distribution of age is skewed to the right. Print Page Previous Next Advertisements ”;
Python Data Science – Numpy
Python Data Science – NumPy ”; Previous Next What is NumPy? NumPy is a Python package which stands for ”Numerical Python”. It is a library consisting of multidimensional array objects and a collection of routines for processing of array. Operations using NumPy Using NumPy, a developer can perform the following operations − Mathematical and logical operations on arrays. Fourier transforms and routines for shape manipulation. Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation. NumPy – A Replacement for MatLab NumPy is often used along with packages like SciPy (Scientific Python) and Mat−plotlib (plotting library). This combination is widely used as a replacement for MatLab, a popular platform for technical computing. However, Python alternative to MatLab is now seen as a more modern and complete programming language. It is open source, which is an added advantage of NumPy. ndarray Object The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index. Every item in an ndarray takes the same size of block in the memory. Each element in ndarray is an object of data-type object (called dtype). Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types. We will see lots of examples on using NumPy library of python in Data science work in the next chapters. Print Page Previous Next Advertisements ”;
Data Science Python – Getting Started ”; Previous Next What is Data Science ? Data science is the process of deriving knowledge and insights from a huge and diverse set of data through organizing, processing and analysing the data. It involves many different disciplines like mathematical and statistical modelling, extracting data from it source and applying data visualization techniques. Often it also involves handling big data technologies to gather both structured and unstructured data. Below we will see some example scenarios where Data science is used. Recommendation systems As online shopping becomes more prevalent, the e-commerce platforms are able to capture users shopping preferences as well as the performance of various products in the market. This leads to creation of recommendation systems which create models predicting the shoppers needs and show the products the shopper is most likely to buy. Financial Risk management The financial risk involving loans and credits are better analysed by using the customers past spend habits, past defaults, other financial commitments and many socio-economic indicators. These data is gathered from various sources in different formats. Organising them together and getting insight into customers profile needs the help of Data science. The outcome is minimizing loss for the financial organization by avoiding bad debt. Improvement in Health Care services The health care industry deals with a variety of data which can be classified into technical data, financial data, patient information, drug information and legal rules. All this data need to be analysed in a coordinated manner to produce insights that will save cost both for the health care provider and care receiver while remaining legally compliant. Computer Vision The advancement in recognizing an image by a computer involves processing large sets of image data from multiple objects of same category. For example, Face recognition. These data sets are modelled, and algorithms are created to apply the model to newer images to get a satisfactory result. Processing of these huge data sets and creation of models need various tools used in Data science. Efficient Management of Energy As the demand for energy consumption soars, the energy producing companies need to manage the various phases of the energy production and distribution more efficiently. This involves optimizing the production methods, the storage and distribution mechanisms as well as studying the customers consumption patterns. Linking the data from all these sources and deriving insight seems a daunting task. This is made easier by using the tools of data science. Python in Data Science The programming requirements of data science demands a very versatile yet flexible language which is simple to write the code but can handle highly complex mathematical processing. Python is most suited for such requirements as it has already established itself both as a language for general computing as well as scientific computing. More over it is being continuously upgraded in form of new addition to its plethora of libraries aimed at different programming requirements. Below we will discuss such features of python which makes it the preferred language for data science. A simple and easy to learn language which achieves result in fewer lines of code than other similar languages like R. Its simplicity also makes it robust to handle complex scenarios with minimal code and much less confusion on the general flow of the program. It is cross platform, so the same code works in multiple environments without needing any change. That makes it perfect to be used in a multi-environment setup easily. It executes faster than other similar languages used for data analysis like R and MATLAB. Its excellent memory management capability, especially garbage collection makes it versatile in gracefully managing very large volume of data transformation, slicing, dicing and visualization. Most importantly Python has got a very large collection of libraries which serve as special purpose analysis tools. For example – the NumPy package deals with scientific computing and its array needs much less memory than the conventional python list for managing numeric data. And the number of such packages is continuously growing. Python has packages which can directly use the code from other languages like Java or C. This helps in optimizing the code performance by using existing code of other languages, whenever it gives a better result. In the subsequent chapters we will see how we can leverage these features of python to accomplish all the tasks needed in the different areas of Data Science. Print Page Previous Next Advertisements ”;
Python Data Science – Matplotlib ”; Previous Next What is Matplotlib? Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It has a module named pyplot which makes things easy for plotting by providing feature to control line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and plots namely – histogram, bar charts, power spectra, error charts etc. It is used along with NumPy to provide an environment that is an effective open source alternative for MatLab. It can also be used with graphics toolkits like PyQt and wxPython. Conventionally, the package is imported into the Python script by adding the following statement − from matplotlib import pyplot as plt Matplotlib Example The following script produces the sine wave plot using matplotlib. Example import numpy as np import matplotlib.pyplot as plt # Compute the x and y coordinates for points on a sine curve x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) plt.title(“sine wave form”) # Plot the points using matplotlib plt.plot(x, y) plt.show() Its output is as follows − We will see lots of examples on using Matplotlib library of python in Data science work in the next chapters. Print Page Previous Next Advertisements ”;