python Data Science Archives - Page 2 of 5 - Donotsad where can learn any thing work project and make money

Aug 09

Python Chart Styling

Python – Chart Styling ”; Previous Next The charts created in python can have further styling by using some appropriate methods from the libraries used for charting. In this lesson we will see the implementation of Annotation, legends and chart background. We will continue to use the code from the last chapter and modify it to add these styles to the chart. Adding Annotations Many times, we need to annotate the chart by highlighting the specific locations of the chart. In the below example we indicate the sharp change in values in the chart by adding annotations at those points. import numpy as np from matplotlib import pyplot as plt x = np.arange(0,10) y = x ^ 2 z = x ^ 3 t = x ^ 4 # Labeling the Axes and Title plt.title(“Graph Drawing”) plt.xlabel(“Time”) plt.ylabel(“Distance”) plt.plot(x,y) #Annotate plt.annotate(xy=[2,1], s=”Second Entry”) plt.annotate(xy=[4,6], s=”Third Entry”) Its output is as follows − Adding Legends We sometimes need a chart with multiple lines being plotted. Use of legend represents the meaning associated with each line. In the below chart we have 3 lines with appropriate legends. import numpy as np from matplotlib import pyplot as plt x = np.arange(0,10) y = x ^ 2 z = x ^ 3 t = x ^ 4 # Labeling the Axes and Title plt.title(“Graph Drawing”) plt.xlabel(“Time”) plt.ylabel(“Distance”) plt.plot(x,y) #Annotate plt.annotate(xy=[2,1], s=”Second Entry”) plt.annotate(xy=[4,6], s=”Third Entry”) # Adding Legends plt.plot(x,z) plt.plot(x,t) plt.legend([”Race1”, ”Race2”,”Race3”], loc=4) Its output is as follows − Chart presentation Style We can modify the presentation style of the chart by using different methods from the style package. import numpy as np from matplotlib import pyplot as plt x = np.arange(0,10) y = x ^ 2 z = x ^ 3 t = x ^ 4 # Labeling the Axes and Title plt.title(“Graph Drawing”) plt.xlabel(“Time”) plt.ylabel(“Distance”) plt.plot(x,y) #Annotate plt.annotate(xy=[2,1], s=”Second Entry”) plt.annotate(xy=[4,6], s=”Third Entry”) # Adding Legends plt.plot(x,z) plt.plot(x,t) plt.legend([”Race1”, ”Race2”,”Race3”], loc=4) #Style the background plt.style.use(”fast”) plt.plot(x,z) Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python word tokenization

Python – Word Tokenization ”; Previous Next Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc. The Natural Language Tool kit(NLTK) is a library used to achieve this. Install NLTK before proceeding with the python program for word tokenization. conda install -c anaconda nltk Next we use the word_tokenize method to split the paragraph into individual words. import nltk word_data = “It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms” nltk_tokens = nltk.word_tokenize(word_data) print (nltk_tokens) When we execute the above code, it produces the following result. [”It”, ”originated”, ”from”, ”the”, ”idea”, ”that”, ”there”, ”are”, ”readers”, ”who”, ”prefer”, ”learning”, ”new”, ”skills”, ”from”, ”the”, ”comforts”, ”of”, ”their”, ”drawing”, ”rooms”] Tokenizing Sentences We can also tokenize the sentences in a paragraph like we tokenized the words. We use the method sent_tokenize to achieve this. Below is an example. import nltk sentence_data = “Sun rises in the east. Sun sets in the west.” nltk_tokens = nltk.sent_tokenize(sentence_data) print (nltk_tokens) When we execute the above code, it produces the following result. [”Sun rises in the east.”, ”Sun sets in the west.”] Print Page Previous Next Advertisements ”;

Aug 09

Python Chart Properties

Python – Chart Properties ”; Previous Next Python has excellent libraries for data visualization. A combination of Pandas, numpy and matplotlib can help in creating in nearly all types of visualizations charts. In this chapter we will get started with looking at some simple chart and the various properties of the chart. Creating a Chart We use numpy library to create the required numbers to be mapped for creating the chart and the pyplot method in matplotlib to draws the actual chart. import numpy as np import matplotlib.pyplot as plt x = np.arange(0,10) y = x ^ 2 #Simple Plot plt.plot(x,y) Its output is as follows − Labling the Axes We can apply labels to the axes as well as a title for the chart using appropriate methods from the library as shown below. import numpy as np import matplotlib.pyplot as plt x = np.arange(0,10) y = x ^ 2 #Labeling the Axes and Title plt.title(“Graph Drawing”) plt.xlabel(“Time”) plt.ylabel(“Distance”) #Simple Plot plt.plot(x,y) Its output is as follows − Formatting Line type and Colour The style as well as colour for the line in the chart can be specified using appropriate methods from the library as shown below. import numpy as np import matplotlib.pyplot as plt x = np.arange(0,10) y = x ^ 2 #Labeling the Axes and Title plt.title(“Graph Drawing”) plt.xlabel(“Time”) plt.ylabel(“Distance”) # Formatting the line colors plt.plot(x,y,”r”) # Formatting the line type plt.plot(x,y,”>”) Its output is as follows − Saving the Chart File The chart can be saved in different image file formats using appropriate methods from the library as shown below. import numpy as np import matplotlib.pyplot as plt x = np.arange(0,10) y = x ^ 2 #Labeling the Axes and Title plt.title(“Graph Drawing”) plt.xlabel(“Time”) plt.ylabel(“Distance”) # Formatting the line colors plt.plot(x,y,”r”) # Formatting the line type plt.plot(x,y,”>”) # save in pdf formats plt.savefig(”timevsdist.pdf”, format=”pdf”) The above code creates the pdf file in the default path of the python environment. Print Page Previous Next Advertisements ”;

Aug 09

Python Heat Maps

Python – Heat Maps ”; Previous Next A heatmap contains values representing various shades of the same colour for each value to be plotted. Usually the darker shades of the chart represent higher values than the lighter shade. For a very different value a completely different colour can also be used. The below example is a two-dimensional plot of values which are mapped to the indices and columns of the chart. from pandas import DataFrame import matplotlib.pyplot as plt data=[{2,3,4,1},{6,3,5,2},{6,3,5,4},{3,7,5,4},{2,8,1,5}] Index= [”I1”, ”I2”,”I3”,”I4”,”I5”] Cols = [”C1”, ”C2”, ”C3”,”C4”] df = DataFrame(data, index=Index, columns=Cols) plt.pcolor(df) plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python Box Plots

Python – Box Plots ”; Previous Next Boxplots are a measure of how well distributed the data in a data set is. It divides the data set into three quartiles. This graph represents the minimum, maximum, median, first quartile and third quartile in the data set. It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them. Drawing a Box Plot Boxplot can be drawn calling Series.box.plot() and DataFrame.box.plot(), or DataFrame.boxplot() to visualize the distribution of values within each column. For instance, here is a boxplot representing five trials of 10 observations of a uniform random variable on [0,1). import pandas as pd import numpy as np df = pd.DataFrame(np.random.rand(10, 5), columns=[”A”, ”B”, ”C”, ”D”, ”E”]) df.plot.box(grid=”True”) Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python Measuring Central Tendency

Python – Measuring Central Tendency ”; Previous Next Mathematically central tendency means measuring the center or distribution of location of values of a data set. It gives an idea of the average value of the data in the data set and also an indication of how widely the values are spread in the data set. That in turn helps in evaluating the chances of a new input fitting into the existing data set and hence probability of success. There are three main measures of central tendency which can be calculated using the methods in pandas python library. Mean – It is the Average value of the data which is a division of sum of the values with the number of values. Median – It is the middle value in distribution when the values are arranged in ascending or descending order. Mode – It is the most commonly occurring value in a distribution. Calculating Mean and Median The pandas functions can be directly used to calculate these values. import pandas as pd #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”Chanchal”,”Gasper”,”Naviya”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), ”Rating”:pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} #Create a DataFrame df = pd.DataFrame(d) print “Mean Values in the Distribution” print df.mean() print “*******************************” print “Median Values in the Distribution” print df.median() Its output is as follows − Mean Values in the Distribution Age 31.833333 Rating 3.743333 dtype: float64 ******************************* Median Values in the Distribution Age 29.50 Rating 3.79 dtype: float64 Calculating Mode Mode may or may not be available in a distribution depending on whether the data is continous or whether there are values which has maximum frquency. We take a simple distribution below to find out the mode. Here we have a value which has maximum frequency in the distribution. import pandas as pd #Create a Dictionary of series d = {”Name”:pd.Series([”Tom”,”James”,”Ricky”,”Vin”,”Steve”,”Smith”,”Jack”, ”Lee”,”Chanchal”,”Gasper”,”Naviya”,”Andres”]), ”Age”:pd.Series([25,26,25,23,30,25,23,34,40,30,25,46])} #Create a DataFrame df = pd.DataFrame(d) print df.mode() Its output is as follows − Age Name 0 25.0 Andres 1 NaN Chanchal 2 NaN Gasper 3 NaN Jack 4 NaN James 5 NaN Lee 6 NaN Naviya 7 NaN Ricky 8 NaN Smith 9 NaN Steve 10 NaN Tom 11 NaN Vin Print Page Previous Next Advertisements ”;

Aug 09

Python Chi-square Test

Python – Chi-Square Test ”; Previous Next Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like − Yes/No, Male/Female, Red/Green etc. For example, we can build a data set with observations on people”s ice-cream buying pattern and try to correlate the gender of a person with the flavour of the ice-cream they prefer. If a correlation is found we can plan for appropriate stock of flavours by knowing the number of gender of people visiting. We use various functions in numpy library to carry out the chi-square test. from scipy import stats import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 10, 100) fig,ax = plt.subplots(1,1) linestyles = [”:”, ”–”, ”-.”, ”-”] deg_of_freedom = [1, 4, 7, 6] for df, ls in zip(deg_of_freedom, linestyles): ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls) plt.xlim(0, 10) plt.ylim(0, 0.4) plt.xlabel(”Value”) plt.ylabel(”Frequency”) plt.title(”Chi-Square Distribution”) plt.legend() plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python Normal Distribution

Python – Normal Distribution ”; Previous Next The normal distribution is a form presenting data by arranging the probability distribution of each value in the data.Most values remain around the mean value making the arrangement symmetric. We use various functions in numpy library to mathematically calculate the values for a normal distribution. Histograms are created over which we plot the probability distribution curve. import matplotlib.pyplot as plt import numpy as np mu, sigma = 0.5, 0.1 s = np.random.normal(mu, sigma, 1000) # Create the bins and histogram count, bins, ignored = plt.hist(s, 20, normed=True) # Plot the distribution curve plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( – (bins – mu)**2 / (2 * sigma**2) ), linewidth=3, color=”y”) plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python Time Series

Python – Time Series ”; Previous Next Time series is a series of data points in which each data point is associated with a timestamp. A simple example is the price of a stock in the stock market at different points of time on a given day. Another example is the amount of rainfall in a region at different months of the year. In the below example we take the value of stock prices every day for a quarter for a particular stock symbol. We capture these values as a csv file and then organize them to a dataframe using pandas library. We then set the date field as index of the dataframe by recreating the additional Valuedate column as index and deleting the old valuedate column. Sample Data Below is the sample data for the price of the stock on different days of a given quarter. The data is saved in a file named as stock.csv ValueDate Price 01-01-2018, 1042.05 02-01-2018, 1033.55 03-01-2018, 1029.7 04-01-2018, 1021.3 05-01-2018, 1015.4 … … … … 23-03-2018, 1161.3 26-03-2018, 1167.6 27-03-2018, 1155.25 28-03-2018, 1154 Creating Time Series from datetime import datetime import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv(”path_to_file/stock.csv”) df = pd.DataFrame(data, columns = [”ValueDate”, ”Price”]) # Set the Date as Index df[”ValueDate”] = pd.to_datetime(df[”ValueDate”]) df.index = df[”ValueDate”] del df[”ValueDate”] df.plot(figsize=(15, 6)) plt.show() Its output is as follows − Print Page Previous Next Advertisements ”;

Aug 09

Python Data Wrangling

Python – Data Wrangling ”; Previous Next Data wrangling involves processing the data in various formats like – merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in features to apply these wrangling methods to various data sets to achieve the analytical goal. In this chapter we will look at few examples describing these methods. Merging Data The Pandas library in python provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects − pd.merge(left, right, how=”inner”, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True) Let us now create two different DataFrames and perform the merging operations on it. # import the pandas library import pandas as pd left = pd.DataFrame({ ”id”:[1,2,3,4,5], ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”]}) right = pd.DataFrame( {”id”:[1,2,3,4,5], ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”]}) print left print right Its output is as follows − Name id subject_id 0 Alex 1 sub1 1 Amy 2 sub2 2 Allen 3 sub4 3 Alice 4 sub6 4 Ayoung 5 sub5 Name id subject_id 0 Billy 1 sub2 1 Brian 2 sub4 2 Bran 3 sub3 3 Bryce 4 sub6 4 Betty 5 sub5 Grouping Data Grouping data sets is a frequent need in data analysis where we need the result in terms of various groups present in the data set. Panadas has in-built methods which can roll the data into various groups. In the below example we group the data by year and then get the result for a specific year. # import the pandas library import pandas as pd ipl_data = {”Team”: [”Riders”, ”Riders”, ”Devils”, ”Devils”, ”Kings”, ”kings”, ”Kings”, ”Kings”, ”Riders”, ”Royals”, ”Royals”, ”Riders”], ”Rank”: [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], ”Year”: [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], ”Points”:[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby(”Year”) print grouped.get_group(2014) Its output is as follows − Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014 Concatenating Data Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects. In the below example the concat function performs concatenation operations along an axis. Let us create different objects and do concatenation. import pandas as pd one = pd.DataFrame({ ”Name”: [”Alex”, ”Amy”, ”Allen”, ”Alice”, ”Ayoung”], ”subject_id”:[”sub1”,”sub2”,”sub4”,”sub6”,”sub5”], ”Marks_scored”:[98,90,87,69,78]}, index=[1,2,3,4,5]) two = pd.DataFrame({ ”Name”: [”Billy”, ”Brian”, ”Bran”, ”Bryce”, ”Betty”], ”subject_id”:[”sub2”,”sub4”,”sub3”,”sub6”,”sub5”], ”Marks_scored”:[89,80,79,97,88]}, index=[1,2,3,4,5]) print pd.concat([one,two]) Its output is as follows − Marks_scored Name subject_id 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5 Print Page Previous Next Advertisements ”;