python Data Science Archives - Page 3 of 5 - Donotsad where can learn any thing work project and make money

Aug 09

Python Reading HTML Pages

Python – Reading HTML Pages ”; Previous Next library known as beautifulsoup. Using this library, we can search for the values of html tags and get specific data like title of the page and the list of headers in the page. Install Beautifulsoup Use the Anaconda package manager to install the required package and its dependent packages. conda install Beaustifulsoap Reading the HTML file In the below example we make a request to an url to be loaded into the python environment. Then use the html parser parameter to read the entire html file. Next, we print first few lines of the html page. import urllib2 from bs4 import BeautifulSoup # Fetch the html file response = urllib2.urlopen(”http://tutorialspoint.com/python/python_overview.htm”) html_doc = response.read() # Parse the html file soup = BeautifulSoup(html_doc, ”html.parser”) # Format the parsed html file strhtm = soup.prettify() # Print the first few characters print (strhtm[:225]) When we execute the above code, it produces the following result. <!DOCTYPE html> <!–[if IE 8]><html class=”ie ie8″> <![endif]–> <!–[if IE 9]><html class=”ie ie9″> <![endif]–> <!–[if gt IE 9]><!–> <html> <!–<![endif]–> <head> <!– Basic –> <meta charset=”utf-8″/> <title> Extracting Tag Value We can extract tag value from the first instance of the tag using the following code. import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen(”http://tutorialspoint.com/python/python_overview.htm”) html_doc = response.read() soup = BeautifulSoup(html_doc, ”html.parser”) print (soup.title) print(soup.title.string) print(soup.a.string) print(soup.b.string) When we execute the above code, it produces the following result. Python Overview Python Overview None Python is Interpreted Extracting All Tags We can extract tag value from all the instances of a tag using the following code. import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen(”http://tutorialspoint.com/python/python_overview.htm”) html_doc = response.read() soup = BeautifulSoup(html_doc, ”html.parser”) for x in soup.find_all(”b”): print(x.string) When we execute the above code, it produces the following result. Python is Interpreted Python is Interactive Python is Object-Oriented Python is a Beginner”s Language Easy-to-learn Easy-to-read Easy-to-maintain A broad standard library Interactive Mode Portable Extendable Databases GUI Programming Scalable Print Page Previous Next Advertisements ”;

Aug 09

Python Poisson Distribution

Python – Poisson Distribution ”; Previous Next A Poisson distribution is a distribution which shows the likely number of times that an event will occur within a pre-determined period of time. It is used for independent events which occur at a constant rate within a given interval of time. The Poisson distribution is a discrete function, meaning that the event can only be measured as occurring or not as occurring, meaning the variable can only be measured in whole numbers. We use the seaborn python library which has in-built functions to create such probability distribution graphs. Also the scipy package helps is creating the binomial distribution. from scipy.stats import poisson import seaborn as sb data_binom = poisson.rvs(mu=4, size=10000) ax = sb.distplot(data_binom, kde=True, color=”green”, hist_kws={“linewidth”: 25,”alpha”:1}) ax.set(xlabel=”Poisson”, ylabel=”Frequency”) Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python Data Aggregation

Python – Data Aggregation ”; Previous Next Python has several methods are available to perform aggregations on data. It is done using the pandas and numpy libraries. The data must be available or converted to a dataframe to apply the aggregation functions. Applying Aggregations on DataFrame Let us create a DataFrame and apply aggregations on it. import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 0.790670 -0.387854 -0.668132 0.267283 2000-01-03 -0.575523 -0.965025 0.060427 -2.179780 2000-01-04 1.669653 1.211759 -0.254695 1.429166 2000-01-05 0.100568 -0.236184 0.491646 -0.466081 2000-01-06 0.155172 0.992975 -1.205134 0.320958 2000-01-07 0.309468 -0.724053 -1.412446 0.627919 2000-01-08 0.099489 -1.028040 0.163206 -1.274331 2000-01-09 1.639500 -0.068443 0.714008 -0.565969 2000-01-10 0.326761 1.479841 0.664282 -1.361169 Rolling [window=3,min_periods=1,center=False,axis=0] We can aggregate by passing a function to the entire DataFrame, or select a column via the standard get item method. Apply Aggregation on a Whole Dataframe import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r.aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 Apply Aggregation on a Single Column of a Dataframe import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r[”A”].aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 2000-01-01 1.088512 2000-01-02 1.879182 2000-01-03 1.303660 2000-01-04 1.884801 2000-01-05 1.194699 2000-01-06 1.925393 2000-01-07 0.565208 2000-01-08 0.564129 2000-01-09 2.048458 2000-01-10 2.065750 Freq: D, Name: A, dtype: float64 Apply Aggregation on Multiple Columns of a DataFrame import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r[[”A”,”B”]].aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 A B 2000-01-01 1.088512 -0.650942 2000-01-02 1.879182 -1.038796 2000-01-03 1.303660 -2.003821 2000-01-04 1.884801 -0.141119 2000-01-05 1.194699 0.010551 2000-01-06 1.925393 1.968551 2000-01-07 0.565208 0.032738 2000-01-08 0.564129 -0.759118 2000-01-09 2.048458 -1.820537 2000-01-10 2.065750 0.383357 Print Page Previous Next Advertisements ”;

Aug 09

Python Time Series

Python – Time Series ”; Previous Next Time series is a series of data points in which each data point is associated with a timestamp. A simple example is the price of a stock in the stock market at different points of time on a given day. Another example is the amount of rainfall in a region at different months of the year. In the below example we take the value of stock prices every day for a quarter for a particular stock symbol. We capture these values as a csv file and then organize them to a dataframe using pandas library. We then set the date field as index of the dataframe by recreating the additional Valuedate column as index and deleting the old valuedate column. Sample Data Below is the sample data for the price of the stock on different days of a given quarter. The data is saved in a file named as stock.csv ValueDate Price 01-01-2018, 1042.05 02-01-2018, 1033.55 03-01-2018, 1029.7 04-01-2018, 1021.3 05-01-2018, 1015.4 … … … … 23-03-2018, 1161.3 26-03-2018, 1167.6 27-03-2018, 1155.25 28-03-2018, 1154 Creating Time Series from datetime import datetime import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv(”path_to_file/stock.csv”) df = pd.DataFrame(data, columns = [”ValueDate”, ”Price”]) # Set the Date as Index df[”ValueDate”] = pd.to_datetime(df[”ValueDate”]) df.index = df[”ValueDate”] del df[”ValueDate”] df.plot(figsize=(15, 6)) plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python Relational databases

Python – Relational Databases ”; Previous Next We can connect to relational databases for analysing data using the pandas library as well as another additional library for implementing database connectivity. This package is named as sqlalchemy which provides full SQL language functionality to be used in python. Installing SQLAlchemy The installation is very straight forward using Anaconda which we have discussed in the chapter Data Science Environment. Assuming you have installed Anaconda as described in this chapter, run the following command in the Anaconda Prompt Window to install the SQLAlchemy package. conda install sqlalchemy Reading Relational Tables We will use Sqlite3 as our relational database as it is very light weight and easy to use. Though the SQLAlchemy library can connect to a variety of relational sources including MySql, Oracle and Postgresql and Mssql. We first create a database engine and then connect to the database engine using the to_sql function of the SQLAlchemy library. In the below example we create the relational table by using the to_sql function from a dataframe already created by reading a csv file. Then we use the read_sql_query function from pandas to execute and capture the results from various SQL queries. from sqlalchemy import create_engine import pandas as pd data = pd.read_csv(”/path/input.csv”) # Create the db engine engine = create_engine(”sqlite:///:memory:”) # Store the dataframe as a table data.to_sql(”data_table”, engine) # Query 1 on the relational table res1 = pd.read_sql_query(”SELECT * FROM data_table”, engine) print(”Result 1”) print(res1) print(””) # Query 2 on the relational table res2 = pd.read_sql_query(”SELECT dept,sum(salary) FROM data_table group by dept”, engine) print(”Result 2”) print(res2) When we execute the above code, it produces the following result. Result 1 index id name salary start_date dept 0 0 1 Rick 623.30 2012-01-01 IT 1 1 2 Dan 515.20 2013-09-23 Operations 2 2 3 Tusar 611.00 2014-11-15 IT 3 3 4 Ryan 729.00 2014-05-11 HR 4 4 5 Gary 843.25 2015-03-27 Finance 5 5 6 Rasmi 578.00 2013-05-21 IT 6 6 7 Pranab 632.80 2013-07-30 Operations 7 7 8 Guru 722.50 2014-06-17 Finance Result 2 dept sum(salary) 0 Finance 1565.75 1 HR 729.00 2 IT 1812.30 3 Operations 1148.00 Inserting Data to Relational Tables We can also insert data into relational tables using sql.execute function available in pandas. In the below code we previous csv file as input data set, store it in a relational table and then insert another record using sql.execute. from sqlalchemy import create_engine from pandas.io import sql import pandas as pd data = pd.read_csv(”C:/Users/Rasmi/Documents/pydatasci/input.csv”) engine = create_engine(”sqlite:///:memory:”) # Store the Data in a relational table data.to_sql(”data_table”, engine) # Insert another row sql.execute(”INSERT INTO data_table VALUES(?,?,?,?,?,?)”, engine, params=[(”id”,9,”Ruby”,711.20,”2015-03-27”,”IT”)]) # Read from the relational table res = pd.read_sql_query(”SELECT ID,Dept,Name,Salary,start_date FROM data_table”, engine) print(res) When we execute the above code, it produces the following result. id dept name salary start_date 0 1 IT Rick 623.30 2012-01-01 1 2 Operations Dan 515.20 2013-09-23 2 3 IT Tusar 611.00 2014-11-15 3 4 HR Ryan 729.00 2014-05-11 4 5 Finance Gary 843.25 2015-03-27 5 6 IT Rasmi 578.00 2013-05-21 6 7 Operations Pranab 632.80 2013-07-30 7 8 Finance Guru 722.50 2014-06-17 8 9 IT Ruby 711.20 2015-03-27 Deleting Data from Relational Tables We can also delete data into relational tables using sql.execute function available in pandas. The below code deletes a row based on the input condition given. from sqlalchemy import create_engine from pandas.io import sql import pandas as pd data = pd.read_csv(”C:/Users/Rasmi/Documents/pydatasci/input.csv”) engine = create_engine(”sqlite:///:memory:”) data.to_sql(”data_table”, engine) sql.execute(”Delete from data_table where name = (?) ”, engine, params=[(”Gary”)]) res = pd.read_sql_query(”SELECT ID,Dept,Name,Salary,start_date FROM data_table”, engine) print(res) When we execute the above code, it produces the following result. id dept name salary start_date 0 1 IT Rick 623.3 2012-01-01 1 2 Operations Dan 515.2 2013-09-23 2 3 IT Tusar 611.0 2014-11-15 3 4 HR Ryan 729.0 2014-05-11 4 6 IT Rasmi 578.0 2013-05-21 5 7 Operations Pranab 632.8 2013-07-30 6 8 Finance Guru 722.5 2014-06-17 Print Page Previous Next Advertisements ”;

Aug 09

Python Processing Unstructured Data

Python – Processing Unstructured Data ”; Previous Next The data that is already present in a row and column format or which can be easily converted to rows and columns so that later it can fit nicely into a database is known as structured data. Examples are CSV, TXT, XLS files etc. These files have a delimiter and either fixed or variable width where the missing values are represented as blanks in between the delimiters. But sometimes we get data where the lines are not fixed width, or they are just HTML, image or pdf files. Such data is known as unstructured data. While the HTML file can be handled by processing the HTML tags, a feed from twitter or a plain text document from a news feed can without having a delimiter does not have tags to handle. In such scenario we use different in-built functions from various python libraries to process the file. Reading Data In the below example we take a text file and read the file segregating each of the lines in it. Next we can divide the output into further lines and words. The original file is a text file containing some paragraphs describing the python language. filename = ”pathinput.txt” with open(filename) as fn: # Read each line ln = fn.readline() # Keep count of lines lncnt = 1 while ln: print(“Line {}: {}”.format(lncnt, ln.strip())) ln = fn.readline() lncnt += 1 When we execute the above code, it produces the following result. Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library. Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation. Counting Word Frequency We can count the frequency of the words in the file using the counter function as follows. from collections import Counter with open(r”pathinput2.txt”) as f: p = Counter(f.read().split()) print(p) When we execute the above code, it produces the following result. Counter({”and”: 3, ”Python”: 3, ”that”: 2, ”a”: 2, ”programming”: 2, ”code”: 1, ”1991,”: 1, ”is”: 1, ”programming.”: 1, ”dynamic”: 1, ”an”: 1, ”design”: 1, ”in”: 1, ”high-level”: 1, ”management.”: 1, ”features”: 1, ”readability,”: 1, ”van”: 1, ”both”: 1, ”for”: 1, ”Rossum”: 1, ”system”: 1, ”provides”: 1, ”memory”: 1, ”has”: 1, ”type”: 1, ”enable”: 1, ”Created”: 1, ”philosophy”: 1, ”constructs”: 1, ”emphasizes”: 1, ”general-purpose”: 1, ”notably”: 1, ”released”: 1, ”significant”: 1, ”Guido”: 1, ”using”: 1, ”interpreted”: 1, ”by”: 1, ”on”: 1, ”language”: 1, ”whitespace.”: 1, ”clear”: 1, ”It”: 1, ”large”: 1, ”small”: 1, ”automatic”: 1, ”scales.”: 1, ”first”: 1}) Print Page Previous Next Advertisements ”;

Aug 09

Python Date and Time

Python – Date and Time ”; Previous Next Often in data science we need analysis which is based on temporal values. Python can handle the various formats of date and time gracefully. The datetime library provides necessary methods and functions to handle the following scenarios. Date Time Representation Date Time Arithmetic Date Time Comparison We will study them one by one. Date Time Representation A date and its various parts are represented by using different datetime functions. Also, there are format specifiers which play a role in displaying the alphabetical parts of a date like name of the month or week day. The following code shows today”s date and various parts of the date. import datetime print ”The Date Today is :”, datetime.datetime.today() date_today = datetime.date.today() print date_today print ”This Year :”, date_today.year print ”This Month :”, date_today.month print ”Month Name:”,date_today.strftime(”%B”) print ”This Week Day :”, date_today.day print ”Week Day Name:”,date_today.strftime(”%A”) When we execute the above code, it produces the following result. The Date Today is : 2018-04-22 15:38:35.835000 2018-04-22 This Year : 2018 This Month : 4 Month Name: April This Week Day : 22 Week Day Name: Sunday Date Time Arithmetic For calculations involving dates we store the various dates into variables and apply the relevant mathematical operator to these variables. import datetime #Capture the First Date day1 = datetime.date(2018, 2, 12) print ”day1:”, day1.ctime() # Capture the Second Date day2 = datetime.date(2017, 8, 18) print ”day2:”, day2.ctime() # Find the difference between the dates print ”Number of Days:”, day1-day2 date_today = datetime.date.today() # Create a delta of Four Days no_of_days = datetime.timedelta(days=4) # Use Delta for Past Date before_four_days = date_today – no_of_days print ”Before Four Days:”, before_four_days # Use Delta for future Date after_four_days = date_today + no_of_days print ”After Four Days:”, after_four_days When we execute the above code, it produces the following result. day1: Mon Feb 12 00:00:00 2018 day2: Fri Aug 18 00:00:00 2017 Number of Days: 178 days, 0:00:00 Before Four Days: 2018-04-18 After Four Days: 2018-04-26 Date Time Comparison Date and time are compared using logical operators. But we must be careful in comparing the right parts of the dates with each other. In the below examples we take the future and past dates and compare them using the python if clause along with logical operators. import datetime date_today = datetime.date.today() print ”Today is: ”, date_today # Create a delta of Four Days no_of_days = datetime.timedelta(days=4) # Use Delta for Past Date before_four_days = date_today – no_of_days print ”Before Four Days:”, before_four_days after_four_days = date_today + no_of_days date1 = datetime.date(2018,4,4) print ”date1:”,date1 if date1 == before_four_days : print ”Same Dates” if date_today > date1: print ”Past Date” if date1 < after_four_days: print ”Future Date” When we execute the above code, it produces the following result. Today is: 2018-04-22 Before Four Days: 2018-04-18 date1: 2018-04-04 Past Date Future Date Print Page Previous Next Advertisements ”;

Aug 09

Python Data cleansing

Python – Data Cleansing ”; Previous Next Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid. When and Why Is Data Missed? Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time. Let us now see how we can handle missing values (say NA or NaN) using Pandas. # import the pandas library import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df Its output is as follows − one two three a 0.077988 0.476149 0.965836 b NaN NaN NaN c -0.390208 -0.551605 -2.301950 d NaN NaN NaN e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g NaN NaN NaN h 0.085100 0.532791 0.887415 Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number. Check for Missing Values To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects − Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df[”one”].isnull() Its output is as follows − a False b True c False d True e False f False g True h False Name: one, dtype: bool Cleaning / Filling Missing Data Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections. Replace NaN with a Scalar Value The following program shows how you can replace “NaN” with “0”. import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3), index=[”a”, ”c”, ”e”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”]) print df print (“NaN replaced with ”0”:”) print df.fillna(0) Its output is as follows − one two three a -0.576991 -0.741695 0.553172 b NaN NaN NaN c 0.744328 -1.735166 1.749580 NaN replaced with ”0”: one two three a -0.576991 -0.741695 0.553172 b 0.000000 0.000000 0.000000 c 0.744328 -1.735166 1.749580 Here, we are filling with value zero; instead we can also fill with any other value. Fill NA Forward and Backward Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values. Method Action pad/fill Fill methods Forward bfill/backfill Fill methods Backward Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.fillna(method=”pad”) Its output is as follows − one two three a 0.077988 0.476149 0.965836 b 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 d -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415 Drop Missing Values If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded. Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.dropna() Its output is as follows − one two three a 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415 Replace Missing (or) Generic Values Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method. Replacing NA with a scalar value is equivalent behavior of the fillna() function. Example import pandas as pd import numpy as np df = pd.DataFrame({”one”:[10,20,30,40,50,2000], ”two”:[1000,0,30,40,50,60]}) print df.replace({1000:10,2000:60}) Its output is as follows − one two 0 10 10 1 20 0 2 30 30 3 40 40 4 50 50 5 60 60 Print Page Previous Next Advertisements ”;

Aug 09

Python Data Operations

Python – Data Operations ”; Previous Next Python handles data of various formats mainly through the two libraries, Pandas and Numpy. We have already seen the important features of these two libraries in the previous chapters. In this chapter we will see some basic examples from each of the libraries on how to operate on data. Data Operations in Numpy The most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index. An instance of ndarray class can be constructed by different array creation routines described later in the tutorial. The basic ndarray is created using an array function in NumPy as follows − numpy.array Following are some examples on Numpy Data handling. Example 1 # more than one dimensions import numpy as np a = np.array([[1, 2], [3, 4]]) print a The output is as follows − [[1, 2] [3, 4]] Example 2 # minimum dimensions import numpy as np a = np.array([1, 2, 3,4,5], ndmin = 2) print a The output is as follows − [[1, 2, 3, 4, 5]] Example 3 # dtype parameter import numpy as np a = np.array([1, 2, 3], dtype = complex) print a The output is as follows − [ 1.+0.j, 2.+0.j, 3.+0.j] Data Operations in Pandas Pandas handles data through Series,Data Frame, and Panel. We will see some examples from each of these. Pandas Series Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. A pandas Series can be created using the following constructor − pandas.Series( data, index, dtype, copy) Example Here we create a series from a Numpy Array. #import the pandas library and aliasing as pd import pandas as pd import numpy as np data = np.array([”a”,”b”,”c”,”d”]) s = pd.Series(data) print s Its output is as follows − 0 a 1 b 2 c 3 d dtype: object Pandas DataFrame A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. A pandas DataFrame can be created using the following constructor − pandas.DataFrame( data, index, columns, dtype, copy) Let us now create an indexed DataFrame using arrays. import pandas as pd data = {”Name”:[”Tom”, ”Jack”, ”Steve”, ”Ricky”],”Age”:[28,34,29,42]} df = pd.DataFrame(data, index=[”rank1”,”rank2”,”rank3”,”rank4”]) print df Its output is as follows − Age Name rank1 28 Tom rank2 34 Jack rank3 29 Steve rank4 42 Ricky Pandas Panel A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s. A Panel can be created using the following constructor − pandas.Panel(data, items, major_axis, minor_axis, dtype, copy) In the below example we create a panel from dict of DataFrame Objects #creating an empty panel import pandas as pd import numpy as np data = {”Item1” : pd.DataFrame(np.random.randn(4, 3)), ”Item2” : pd.DataFrame(np.random.randn(4, 2))} p = pd.Panel(data) print p Its output is as follows − <class ”pandas.core.panel.Panel”> Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis) Items axis: 0 to 1 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 4 Print Page Previous Next Advertisements ”;

Aug 09

Python Stemming and Lemmatization

Python – Stemming and Lemmatization ”; Previous Next In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words – agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word. The below program uses the Porter Stemming Algorithm for stemming. import nltk from nltk.stem.porter import PorterStemmer porter_stemmer = PorterStemmer() word_data = “It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms” # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) #Next find the roots of the word for w in nltk_tokens: print “Actual: %s Stem: %s” % (w,porter_stemmer.stem(w)) When we execute the above code, it produces the following result. Actual: It Stem: It Actual: originated Stem: origin Actual: from Stem: from Actual: the Stem: the Actual: idea Stem: idea Actual: that Stem: that Actual: there Stem: there Actual: are Stem: are Actual: readers Stem: reader Actual: who Stem: who Actual: prefer Stem: prefer Actual: learning Stem: learn Actual: new Stem: new Actual: skills Stem: skill Actual: from Stem: from Actual: the Stem: the Actual: comforts Stem: comfort Actual: of Stem: of Actual: their Stem: their Actual: drawing Stem: draw Actual: rooms Stem: room Lemmatization is similar ti stemming but it brings context to the words.So it goes a steps further by linking words with similar meaning to one word. For example if a paragraph has words like cars, trains and automobile, then it will link all of them to automobile. In the below program we use the WordNet lexical database for lemmatization. import nltk from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() word_data = “It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms” nltk_tokens = nltk.word_tokenize(word_data) for w in nltk_tokens: print “Actual: %s Lemma: %s” % (w,wordnet_lemmatizer.lemmatize(w)) When we execute the above code, it produces the following result. Actual: It Lemma: It Actual: originated Lemma: originated Actual: from Lemma: from Actual: the Lemma: the Actual: idea Lemma: idea Actual: that Lemma: that Actual: there Lemma: there Actual: are Lemma: are Actual: readers Lemma: reader Actual: who Lemma: who Actual: prefer Lemma: prefer Actual: learning Lemma: learning Actual: new Lemma: new Actual: skills Lemma: skill Actual: from Lemma: from Actual: the Lemma: the Actual: comforts Lemma: comfort Actual: of Lemma: of Actual: their Lemma: their Actual: drawing Lemma: drawing Actual: rooms Lemma: room Print Page Previous Next Advertisements ”;