Python – Poisson Distribution ”; Previous Next A Poisson distribution is a distribution which shows the likely number of times that an event will occur within a pre-determined period of time. It is used for independent events which occur at a constant rate within a given interval of time. The Poisson distribution is a discrete function, meaning that the event can only be measured as occurring or not as occurring, meaning the variable can only be measured in whole numbers. We use the seaborn python library which has in-built functions to create such probability distribution graphs. Also the scipy package helps is creating the binomial distribution. from scipy.stats import poisson import seaborn as sb data_binom = poisson.rvs(mu=4, size=10000) ax = sb.distplot(data_binom, kde=True, color=”green”, hist_kws={“linewidth”: 25,”alpha”:1}) ax.set(xlabel=”Poisson”, ylabel=”Frequency”) Its output is as follows − Print Page Previous Next Advertisements ”;
Category: python Data Science
Python Data Aggregation
Python – Data Aggregation ”; Previous Next Python has several methods are available to perform aggregations on data. It is done using the pandas and numpy libraries. The data must be available or converted to a dataframe to apply the aggregation functions. Applying Aggregations on DataFrame Let us create a DataFrame and apply aggregations on it. import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 0.790670 -0.387854 -0.668132 0.267283 2000-01-03 -0.575523 -0.965025 0.060427 -2.179780 2000-01-04 1.669653 1.211759 -0.254695 1.429166 2000-01-05 0.100568 -0.236184 0.491646 -0.466081 2000-01-06 0.155172 0.992975 -1.205134 0.320958 2000-01-07 0.309468 -0.724053 -1.412446 0.627919 2000-01-08 0.099489 -1.028040 0.163206 -1.274331 2000-01-09 1.639500 -0.068443 0.714008 -0.565969 2000-01-10 0.326761 1.479841 0.664282 -1.361169 Rolling [window=3,min_periods=1,center=False,axis=0] We can aggregate by passing a function to the entire DataFrame, or select a column via the standard get item method. Apply Aggregation on a Whole Dataframe import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r.aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 Apply Aggregation on a Single Column of a Dataframe import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r[”A”].aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 2000-01-01 1.088512 2000-01-02 1.879182 2000-01-03 1.303660 2000-01-04 1.884801 2000-01-05 1.194699 2000-01-06 1.925393 2000-01-07 0.565208 2000-01-08 0.564129 2000-01-09 2.048458 2000-01-10 2.065750 Freq: D, Name: A, dtype: float64 Apply Aggregation on Multiple Columns of a DataFrame import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range(”1/1/2000”, periods=10), columns = [”A”, ”B”, ”C”, ”D”]) print df r = df.rolling(window=3,min_periods=1) print r[[”A”,”B”]].aggregate(np.sum) Its output is as follows − A B C D 2000-01-01 1.088512 -0.650942 -2.547450 -0.566858 2000-01-02 1.879182 -1.038796 -3.215581 -0.299575 2000-01-03 1.303660 -2.003821 -3.155154 -2.479355 2000-01-04 1.884801 -0.141119 -0.862400 -0.483331 2000-01-05 1.194699 0.010551 0.297378 -1.216695 2000-01-06 1.925393 1.968551 -0.968183 1.284044 2000-01-07 0.565208 0.032738 -2.125934 0.482797 2000-01-08 0.564129 -0.759118 -2.454374 -0.325454 2000-01-09 2.048458 -1.820537 -0.535232 -1.212381 2000-01-10 2.065750 0.383357 1.541496 -3.201469 A B 2000-01-01 1.088512 -0.650942 2000-01-02 1.879182 -1.038796 2000-01-03 1.303660 -2.003821 2000-01-04 1.884801 -0.141119 2000-01-05 1.194699 0.010551 2000-01-06 1.925393 1.968551 2000-01-07 0.565208 0.032738 2000-01-08 0.564129 -0.759118 2000-01-09 2.048458 -1.820537 2000-01-10 2.065750 0.383357 Print Page Previous Next Advertisements ”;
Python Time Series
Python – Time Series ”; Previous Next Time series is a series of data points in which each data point is associated with a timestamp. A simple example is the price of a stock in the stock market at different points of time on a given day. Another example is the amount of rainfall in a region at different months of the year. In the below example we take the value of stock prices every day for a quarter for a particular stock symbol. We capture these values as a csv file and then organize them to a dataframe using pandas library. We then set the date field as index of the dataframe by recreating the additional Valuedate column as index and deleting the old valuedate column. Sample Data Below is the sample data for the price of the stock on different days of a given quarter. The data is saved in a file named as stock.csv ValueDate Price 01-01-2018, 1042.05 02-01-2018, 1033.55 03-01-2018, 1029.7 04-01-2018, 1021.3 05-01-2018, 1015.4 … … … … 23-03-2018, 1161.3 26-03-2018, 1167.6 27-03-2018, 1155.25 28-03-2018, 1154 Creating Time Series from datetime import datetime import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv(”path_to_file/stock.csv”) df = pd.DataFrame(data, columns = [”ValueDate”, ”Price”]) # Set the Date as Index df[”ValueDate”] = pd.to_datetime(df[”ValueDate”]) df.index = df[”ValueDate”] del df[”ValueDate”] df.plot(figsize=(15, 6)) plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;
Python Data Wrangling
Python – Data Wrangling ”; Previous Next Data wrangling involves processing the data in various formats like – merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in features to apply these wrangling methods to various data sets to achieve the analytical goal. In this chapter we will look at few examples describing these methods. Merging Data The Pandas library in python provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects − pd.merge(left, right, how=”inner”, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True) Let us now create two different DataFrames and perform the merging operations on it. # import the pandas library import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame( {”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print left print right Its output is as follows − Name id subject_id 0 Alex 1 sub1 1 Amy 2 sub2 2 Allen 3 sub4 3 Alice 4 sub6 4 Ayoung 5 sub5 Name id subject_id 0 Billy 1 sub2 1 Brian 2 sub4 2 Bran 3 sub3 3 Bryce 4 sub6 4 Betty 5 sub5 Grouping Data Grouping data sets is a frequent need in data analysis where we need the result in terms of various groups present in the data set. Panadas has in-built methods which can roll the data into various groups. In the below example we group the data by year and then get the result for a specific year. # import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby(”Year”) print grouped.get_group(2014) Its output is as follows − Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014 Concatenating Data Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects. In the below example the concat function performs concatenation operations along an axis. Let us create different objects and do concatenation. import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print pd.concat([one,two]) Its output is as follows − Marks_scored Name subject_id 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5 Print Page Previous Next Advertisements ”;
Python Processing JSON Data
Python – Processing JSON Data ”; Previous Next JSON file stores data as text in human-readable format. JSON stands for JavaScript Object Notation. Pandas can read JSON files using the read_json function. Input Data Create a JSON file by copying the below data into a text editor like notepad. Save the file with .json extension and choosing the file type as all files(*.*). { “ID”:[“1″,”2″,”3″,”4″,”5″,”6″,”7″,”8” ], “Name”:[“Rick”,”Dan”,”Michelle”,”Ryan”,”Gary”,”Nina”,”Simon”,”Guru” ] “Salary”:[“623.3″,”515.2″,”611″,”729″,”843.25″,”578″,”632.8″,”722.5” ], “StartDate”:[ “1/1/2012″,”9/23/2013″,”11/15/2014″,”5/11/2014″,”3/27/2015″,”5/21/2013”, “7/30/2013″,”6/17/2014”], “Dept”:[ “IT”,”Operations”,”IT”,”HR”,”Finance”,”IT”,”Operations”,”Finance”] } Read the JSON File The read_json function of the pandas library can be used to read the JSON file into a pandas DataFrame. import pandas as pd data = pd.read_json(”path/input.json”) print (data) When we execute the above code, it produces the following result. Dept ID Name Salary StartDate 0 IT 1 Rick 623.30 1/1/2012 1 Operations 2 Dan 515.20 9/23/2013 2 IT 3 Tusar 611.00 11/15/2014 3 HR 4 Ryan 729.00 5/11/2014 4 Finance 5 Gary 843.25 3/27/2015 5 IT 6 Rasmi 578.00 5/21/2013 6 Operations 7 Pranab 632.80 7/30/2013 7 Finance 8 Guru 722.50 6/17/2014 Reading Specific Columns and Rows Similar to what we have already seen in the previous chapter to read the CSV file, the read_json function of the pandas library can also be used to read some specific columns and specific rows after the JSON file is read to a DataFrame. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the Salary and Name column for some of the rows. import pandas as pd data = pd.read_json(”path/input.xlsx”) # Use the multi-axes indexing funtion print (data.loc[[1,3,5],[”salary”,”name”]]) When we execute the above code, it produces the following result. salary name 1 515.2 Dan 3 729.0 Ryan 5 578.0 Rasmi Reading JSON file as Records We can also apply the to_json function along with parameters to read the JSON file content into individual records. import pandas as pd data = pd.read_json(”path/input.xlsx”) print(data.to_json(orient=”records”, lines=True)) When we execute the above code, it produces the following result. {“Dept”:”IT”,”ID”:1,”Name”:”Rick”,”Salary”:623.3,”StartDate”:”1/1/2012″} {“Dept”:”Operations”,”ID”:2,”Name”:”Dan”,”Salary”:515.2,”StartDate”:”9/23/2013″} {“Dept”:”IT”,”ID”:3,”Name”:”Tusar”,”Salary”:611.0,”StartDate”:”11/15/2014″} {“Dept”:”HR”,”ID”:4,”Name”:”Ryan”,”Salary”:729.0,”StartDate”:”5/11/2014″} {“Dept”:”Finance”,”ID”:5,”Name”:”Gary”,”Salary”:843.25,”StartDate”:”3/27/2015″} {“Dept”:”IT”,”ID”:6,”Name”:”Rasmi”,”Salary”:578.0,”StartDate”:”5/21/2013″} {“Dept”:”Operations”,”ID”:7,”Name”:”Pranab”,”Salary”:632.8,”StartDate”:”7/30/2013″} {“Dept”:”Finance”,”ID”:8,”Name”:”Guru”,”Salary”:722.5,”StartDate”:”6/17/2014″} Print Page Previous Next Advertisements ”;
Python NoSQL Databases
Python – NoSQL Databases ”; Previous Next As more and more data become available as unstructured or semi-structured, the need of managing them through NoSql database increases. Python can also interact with NoSQL databases in a similar way as is interacts with Relational databases. In this chapter we will use python to interact with MongoDB as a NoSQL database. In case you are new to MongoDB, you can learn it in our tutorial here. In order to connect to MongoDB, python uses a library known as pymongo. You can add this library to your python environment, using the below command from the Anaconda environment. conda install pymongo This library enables python to connect to MOngoDB using a db client. Once connected we select the db name to be used for various operations. Inserting Data To insert data into MongoDB we use the insert() method which is available in the database environment. First we connect to the db using python code shown below and then we provide the document details in form of a series of key-value pairs. # Import the python libraries from pymongo import MongoClient from pprint import pprint # Choose the appropriate client client = MongoClient() # Connect to the test db db=client.test # Use the employee collection employee = db.employee employee_details = { ”Name”: ”Raj Kumar”, ”Address”: ”Sears Streer, NZ”, ”Age”: ”42” } # Use the insert method result = employee.insert_one(employee_details) # Query for the inserted document. Queryresult = employee.find_one({”Age”: ”42”}) pprint(Queryresult) When we execute the above code, it produces the following result. {u”Address”: u”Sears Streer, NZ”, u”Age”: u”42”, u”Name”: u”Raj Kumar”, u”_id”: ObjectId(”5adc5a9f84e7cd3940399f93”)} Updating Data Updating an existing MongoDB data is similar to inserting. We use the update() method which is native to mongoDB. In the below code we are replacing the existing record with new key-value pairs. Please note how we are using the condition criteria to decide which record to update. # Import the python libraries from pymongo import MongoClient from pprint import pprint # Choose the appropriate client client = MongoClient() # Connect to db db=client.test employee = db.employee # Use the condition to choose the record # and use the update method db.employee.update_one( {“Age”:”42”}, { “$set”: { “Name”:”Srinidhi”, “Age”:”35”, “Address”:”New Omsk, WC” } } ) Queryresult = employee.find_one({”Age”:”35”}) pprint(Queryresult) When we execute the above code, it produces the following result. {u”Address”: u”New Omsk, WC”, u”Age”: u”35”, u”Name”: u”Srinidhi”, u”_id”: ObjectId(”5adc5a9f84e7cd3940399f93”)} Deleting Data Deleting a record is also straight forward where we use the delete method. Here also we mention the condition which is used to choose the record to be deleted. # Import the python libraries from pymongo import MongoClient from pprint import pprint # Choose the appropriate client client = MongoClient() # Connect to db db=client.test employee = db.employee # Use the condition to choose the record # and use the delete method db.employee.delete_one({“Age”:”35”}) Queryresult = employee.find_one({”Age”:”35”}) pprint(Queryresult) When we execute the above code, it produces the following result. None So we see the particular record does not exist in the db any more. Print Page Previous Next Advertisements ”;
Python Relational databases
Python – Relational Databases ”; Previous Next We can connect to relational databases for analysing data using the pandas library as well as another additional library for implementing database connectivity. This package is named as sqlalchemy which provides full SQL language functionality to be used in python. Installing SQLAlchemy The installation is very straight forward using Anaconda which we have discussed in the chapter Data Science Environment. Assuming you have installed Anaconda as described in this chapter, run the following command in the Anaconda Prompt Window to install the SQLAlchemy package. conda install sqlalchemy Reading Relational Tables We will use Sqlite3 as our relational database as it is very light weight and easy to use. Though the SQLAlchemy library can connect to a variety of relational sources including MySql, Oracle and Postgresql and Mssql. We first create a database engine and then connect to the database engine using the to_sql function of the SQLAlchemy library. In the below example we create the relational table by using the to_sql function from a dataframe already created by reading a csv file. Then we use the read_sql_query function from pandas to execute and capture the results from various SQL queries. from sqlalchemy import create_engine import pandas as pd data = pd.read_csv(”/path/input.csv”) # Create the db engine engine = create_engine(”sqlite:///:memory:”) # Store the dataframe as a table data.to_sql(”data_table”, engine) # Query 1 on the relational table res1 = pd.read_sql_query(”SELECT * FROM data_table”, engine) print(”Result 1”) print(res1) print(””) # Query 2 on the relational table res2 = pd.read_sql_query(”SELECT dept,sum(salary) FROM data_table group by dept”, engine) print(”Result 2”) print(res2) When we execute the above code, it produces the following result. Result 1 index id name salary start_date dept 0 0 1 Rick 623.30 2012-01-01 IT 1 1 2 Dan 515.20 2013-09-23 Operations 2 2 3 Tusar 611.00 2014-11-15 IT 3 3 4 Ryan 729.00 2014-05-11 HR 4 4 5 Gary 843.25 2015-03-27 Finance 5 5 6 Rasmi 578.00 2013-05-21 IT 6 6 7 Pranab 632.80 2013-07-30 Operations 7 7 8 Guru 722.50 2014-06-17 Finance Result 2 dept sum(salary) 0 Finance 1565.75 1 HR 729.00 2 IT 1812.30 3 Operations 1148.00 Inserting Data to Relational Tables We can also insert data into relational tables using sql.execute function available in pandas. In the below code we previous csv file as input data set, store it in a relational table and then insert another record using sql.execute. from sqlalchemy import create_engine from pandas.io import sql import pandas as pd data = pd.read_csv(”C:/Users/Rasmi/Documents/pydatasci/input.csv”) engine = create_engine(”sqlite:///:memory:”) # Store the Data in a relational table data.to_sql(”data_table”, engine) # Insert another row sql.execute(”INSERT INTO data_table VALUES(?,?,?,?,?,?)”, engine, params=[(”id”,9,”Ruby”,711.20,”2015-03-27”,”IT”)]) # Read from the relational table res = pd.read_sql_query(”SELECT ID,Dept,Name,Salary,start_date FROM data_table”, engine) print(res) When we execute the above code, it produces the following result. id dept name salary start_date 0 1 IT Rick 623.30 2012-01-01 1 2 Operations Dan 515.20 2013-09-23 2 3 IT Tusar 611.00 2014-11-15 3 4 HR Ryan 729.00 2014-05-11 4 5 Finance Gary 843.25 2015-03-27 5 6 IT Rasmi 578.00 2013-05-21 6 7 Operations Pranab 632.80 2013-07-30 7 8 Finance Guru 722.50 2014-06-17 8 9 IT Ruby 711.20 2015-03-27 Deleting Data from Relational Tables We can also delete data into relational tables using sql.execute function available in pandas. The below code deletes a row based on the input condition given. from sqlalchemy import create_engine from pandas.io import sql import pandas as pd data = pd.read_csv(”C:/Users/Rasmi/Documents/pydatasci/input.csv”) engine = create_engine(”sqlite:///:memory:”) data.to_sql(”data_table”, engine) sql.execute(”Delete from data_table where name = (?) ”, engine, params=[(”Gary”)]) res = pd.read_sql_query(”SELECT ID,Dept,Name,Salary,start_date FROM data_table”, engine) print(res) When we execute the above code, it produces the following result. id dept name salary start_date 0 1 IT Rick 623.3 2012-01-01 1 2 Operations Dan 515.2 2013-09-23 2 3 IT Tusar 611.0 2014-11-15 3 4 HR Ryan 729.0 2014-05-11 4 6 IT Rasmi 578.0 2013-05-21 5 7 Operations Pranab 632.8 2013-07-30 6 8 Finance Guru 722.5 2014-06-17 Print Page Previous Next Advertisements ”;
Python – Processing Unstructured Data ”; Previous Next The data that is already present in a row and column format or which can be easily converted to rows and columns so that later it can fit nicely into a database is known as structured data. Examples are CSV, TXT, XLS files etc. These files have a delimiter and either fixed or variable width where the missing values are represented as blanks in between the delimiters. But sometimes we get data where the lines are not fixed width, or they are just HTML, image or pdf files. Such data is known as unstructured data. While the HTML file can be handled by processing the HTML tags, a feed from twitter or a plain text document from a news feed can without having a delimiter does not have tags to handle. In such scenario we use different in-built functions from various python libraries to process the file. Reading Data In the below example we take a text file and read the file segregating each of the lines in it. Next we can divide the output into further lines and words. The original file is a text file containing some paragraphs describing the python language. filename = ”pathinput.txt” with open(filename) as fn: # Read each line ln = fn.readline() # Keep count of lines lncnt = 1 while ln: print(“Line {}: {}”.format(lncnt, ln.strip())) ln = fn.readline() lncnt += 1 When we execute the above code, it produces the following result. Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library. Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation. Counting Word Frequency We can count the frequency of the words in the file using the counter function as follows. from collections import Counter with open(r”pathinput2.txt”) as f: p = Counter(f.read().split()) print(p) When we execute the above code, it produces the following result. Counter({”and”: 3, ”Python”: 3, ”that”: 2, ”a”: 2, ”programming”: 2, ”code”: 1, ”1991,”: 1, ”is”: 1, ”programming.”: 1, ”dynamic”: 1, ”an”: 1, ”design”: 1, ”in”: 1, ”high-level”: 1, ”management.”: 1, ”features”: 1, ”readability,”: 1, ”van”: 1, ”both”: 1, ”for”: 1, ”Rossum”: 1, ”system”: 1, ”provides”: 1, ”memory”: 1, ”has”: 1, ”type”: 1, ”enable”: 1, ”Created”: 1, ”philosophy”: 1, ”constructs”: 1, ”emphasizes”: 1, ”general-purpose”: 1, ”notably”: 1, ”released”: 1, ”significant”: 1, ”Guido”: 1, ”using”: 1, ”interpreted”: 1, ”by”: 1, ”on”: 1, ”language”: 1, ”whitespace.”: 1, ”clear”: 1, ”It”: 1, ”large”: 1, ”small”: 1, ”automatic”: 1, ”scales.”: 1, ”first”: 1}) Print Page Previous Next Advertisements ”;
Python Date and Time
Python – Date and Time ”; Previous Next Often in data science we need analysis which is based on temporal values. Python can handle the various formats of date and time gracefully. The datetime library provides necessary methods and functions to handle the following scenarios. Date Time Representation Date Time Arithmetic Date Time Comparison We will study them one by one. Date Time Representation A date and its various parts are represented by using different datetime functions. Also, there are format specifiers which play a role in displaying the alphabetical parts of a date like name of the month or week day. The following code shows today”s date and various parts of the date. import datetime print ”The Date Today is :”, datetime.datetime.today() date_today = datetime.date.today() print date_today print ”This Year :”, date_today.year print ”This Month :”, date_today.month print ”Month Name:”,date_today.strftime(”%B”) print ”This Week Day :”, date_today.day print ”Week Day Name:”,date_today.strftime(”%A”) When we execute the above code, it produces the following result. The Date Today is : 2018-04-22 15:38:35.835000 2018-04-22 This Year : 2018 This Month : 4 Month Name: April This Week Day : 22 Week Day Name: Sunday Date Time Arithmetic For calculations involving dates we store the various dates into variables and apply the relevant mathematical operator to these variables. import datetime #Capture the First Date day1 = datetime.date(2018, 2, 12) print ”day1:”, day1.ctime() # Capture the Second Date day2 = datetime.date(2017, 8, 18) print ”day2:”, day2.ctime() # Find the difference between the dates print ”Number of Days:”, day1-day2 date_today = datetime.date.today() # Create a delta of Four Days no_of_days = datetime.timedelta(days=4) # Use Delta for Past Date before_four_days = date_today – no_of_days print ”Before Four Days:”, before_four_days # Use Delta for future Date after_four_days = date_today + no_of_days print ”After Four Days:”, after_four_days When we execute the above code, it produces the following result. day1: Mon Feb 12 00:00:00 2018 day2: Fri Aug 18 00:00:00 2017 Number of Days: 178 days, 0:00:00 Before Four Days: 2018-04-18 After Four Days: 2018-04-26 Date Time Comparison Date and time are compared using logical operators. But we must be careful in comparing the right parts of the dates with each other. In the below examples we take the future and past dates and compare them using the python if clause along with logical operators. import datetime date_today = datetime.date.today() print ”Today is: ”, date_today # Create a delta of Four Days no_of_days = datetime.timedelta(days=4) # Use Delta for Past Date before_four_days = date_today – no_of_days print ”Before Four Days:”, before_four_days after_four_days = date_today + no_of_days date1 = datetime.date(2018,4,4) print ”date1:”,date1 if date1 == before_four_days : print ”Same Dates” if date_today > date1: print ”Past Date” if date1 < after_four_days: print ”Future Date” When we execute the above code, it produces the following result. Today is: 2018-04-22 Before Four Days: 2018-04-18 date1: 2018-04-04 Past Date Future Date Print Page Previous Next Advertisements ”;
Python Data cleansing
Python – Data Cleansing ”; Previous Next Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid. When and Why Is Data Missed? Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time. Let us now see how we can handle missing values (say NA or NaN) using Pandas. # import the pandas library import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df Its output is as follows − one two three a 0.077988 0.476149 0.965836 b NaN NaN NaN c -0.390208 -0.551605 -2.301950 d NaN NaN NaN e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g NaN NaN NaN h 0.085100 0.532791 0.887415 Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number. Check for Missing Values To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects − Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df[”one”].isnull() Its output is as follows − a False b True c False d True e False f False g True h False Name: one, dtype: bool Cleaning / Filling Missing Data Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections. Replace NaN with a Scalar Value The following program shows how you can replace “NaN” with “0”. import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3), index=[”a”, ”c”, ”e”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”]) print df print (“NaN replaced with ”0”:”) print df.fillna(0) Its output is as follows − one two three a -0.576991 -0.741695 0.553172 b NaN NaN NaN c 0.744328 -1.735166 1.749580 NaN replaced with ”0”: one two three a -0.576991 -0.741695 0.553172 b 0.000000 0.000000 0.000000 c 0.744328 -1.735166 1.749580 Here, we are filling with value zero; instead we can also fill with any other value. Fill NA Forward and Backward Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values. Method Action pad/fill Fill methods Forward bfill/backfill Fill methods Backward Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.fillna(method=”pad”) Its output is as follows − one two three a 0.077988 0.476149 0.965836 b 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 d -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415 Drop Missing Values If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded. Example import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[”a”, ”c”, ”e”, ”f”, ”h”],columns=[”one”, ”two”, ”three”]) df = df.reindex([”a”, ”b”, ”c”, ”d”, ”e”, ”f”, ”g”, ”h”]) print df.dropna() Its output is as follows − one two three a 0.077988 0.476149 0.965836 c -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415 Replace Missing (or) Generic Values Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method. Replacing NA with a scalar value is equivalent behavior of the fillna() function. Example import pandas as pd import numpy as np df = pd.DataFrame({”one”:[10,20,30,40,50,2000], ”two”:[1000,0,30,40,50,60]}) print df.replace({1000:10,2000:60}) Its output is as follows − one two 0 10 10 1 20 0 2 30 30 3 40 40 4 50 50 5 60 60 Print Page Previous Next Advertisements ”;