Machine Learning – Density Plots A density plot is a type of plot that shows the probability density function of a continuous variable. It is similar to a histogram, but instead of using bars to represent the frequency of each value, it uses a smooth curve to represent the probability density function. The xaxis represents the range of values of the variable, and the y-axis represents the probability density. Density plots are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the variable. Python Implementation of Density Plots Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For our example given below, we will use Seaborn to implement density plots. We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset contains information about the characteristics of breast cancer cells and whether they are malignant or benign. The dataset has 30 features and 569 samples. Example Let”s start by importing the necessary libraries and loading the dataset − import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_breast_cancer data = load_breast_cancer() Next, we will create a density plot of the mean radius feature of the dataset − plt.figure(figsize=(7.2, 3.5)) sns.kdeplot(data.data[:,0], shade=True) plt.xlabel(”Mean Radius”) plt.ylabel(”Density”) plt.show() In this code, we have used the kdeplot() function from Seaborn to create a density plot of the mean radius feature of the dataset. We have set the shade parameter to True to shade the area under the curve. We have also added labels to the x and y axes using the xlabel() and ylabel() functions. Output The resulting density plot shows the probability density function of mean radius values in the dataset. We can see that the data is roughly normally distributed, with a peak around 12-14. Density Plot with Multiple Data Sets We can also create a density plot with multiple data sets to compare their probability density functions. Let”s create density plots of the mean radius feature for both the malignant and benign samples − Example plt.figure(figsize=(7.5, 3.5)) sns.kdeplot(data.data[data.target==0,0], shade=True, label=”Malignant”) sns.kdeplot(data.data[data.target==1,0], shade=True, label=”Benign”) plt.xlabel(”Mean Radius”) plt.ylabel(”Density”) plt.legend() plt.show() In this code, we have used the kdeplot() function twice to create two density plots of the mean radius feature, one for the malignant samples and one for the benign samples. We have set the shade parameter to True to shade the area under the curve, and we have added labels to the plots using the label parameter. We have also added a legend to the plot using the legend() function. Output On executing this code, you will get the following plot as the output − The resulting density plot shows the probability density functions of mean radius values for both the malignant and benign samples. We can see that the probability density function for the malignant samples is shifted to the right, indicating a higher mean radius value.
Category: Machine Learning
Machine Learning – Data Visualization Data visualization is an important aspect of machine learning (ML) as it helps to analyze and communicate patterns, trends, and insights in the data. Data visualization involves creating graphical representations of the data, which can help to identify patterns and relationships that may not be apparent from the raw data. Here are some of the ways data visualization is used in machine learning − Exploring Data − Data visualization is an essential tool for exploring and understanding data. Visualization can help to identify patterns, correlations, and outliers, and can also help to detect data quality issues such as missing values and inconsistencies. Feature Selection − Data visualization can help to select relevant features for the ML model. By visualizing the data and its relationship with the target variable, you can identify features that are strongly correlated with the target variable and exclude irrelevant features that have little predictive power. Model Evaluation − Data visualization can be used to evaluate the performance of the ML model. Visualization techniques such as ROC curves, precision-recall curves, and confusion matrices can help to understand the accuracy, precision, recall, and F1 score of the model. Communicating Insights − Data visualization is an effective way to communicate insights and results to stakeholders who may not have a technical background. Visualizations such as scatter plots, line charts, and bar charts can help to convey complex information in an easily understandable format. Some popular libraries used for data visualization in Python include Matplotlib, Seaborn, Plotly, and Bokeh. These libraries provide a wide range of visualization techniques and customization options to suit different needs and preferences. Univariate Plots: Understanding Attributes Independently The simplest type of visualization is single-variable or “univariate” visualization. With the help of univariate visualization, we can understand each attribute of our dataset independently. The following are some techniques in Python to implement univariate visualization − Multivariate Plots: Interaction Among Multiple Variables Another type of visualization is multi-variable or “multivariate” visualization. With the help of multivariate visualization, we can understand interaction between multiple attributes of our dataset. The following are some techniques in Python to implement multivariate visualization − In the next few chapters, we will look at some of the popular and widely used visualization techniques available in machine learning.
Machine Learning – Supervised vs. Unsupervised Machine Learning approaches can be either Supervised or Unsupervised. If you can anticipate the expanse of data, and if it is possible to divide the data into categories, then the best approach is to help the algorithm become smarter by Supervised Learning. If you anticipate that the amount of data is massive, and if you think that the data cannot be simply classified or labelled, then it is better to go for Unsupervised Learning approach and let the algorithms handle predictions smartly. Differences between Supervised and Unsupervised Machine Learning The table below shows some key differences between supervised and unsupervised machine learning − Supervised Technique Unsupervised Technique Supervised machine learning algorithms are trained using both training data and its associated output i.e., label data. Unsupervised machine learning algorithms do not require labeled data for training. Supervised machine learning model learns the association between input training data and their labels. Unsupervised machine learning model learns the pattern and relationship from the given raw data. Supervised ML model takes feedback to check whether it is predicting the correct output or not. Unsupervised ML model does not take any kind of feedback. As name entails, supervised machine learning algorithms needs supervision to train the model. As name entails, unsupervised machine learning algorithms does not any kind of supervision to train the model. We can divide supervised machine learning algorithms in two broad classes namely Classification and Regression. Clustering, Anomaly Detection, Association, and Association are some of the broad classed of unsupervised machine learning algorithms. In terms of computational complexity, supervised machine learning methods are computationally simple. Unsupervised machine learning methods are computationally complex. Supervised machine learning methods are highly accurate. Unsupervised machine learning methods are less accurate. In supervised machine learning, the learning takes place offline. In unsupervised machine learning, the learning takes place in real time. Number of classes is already known before implementing supervised machine learning methods. In unsupervised learning methods, number of classes are not known in prior. One of the main drawbacks of supervised learning is to classify big data. As the data used in unsupervised learning is not labeled, getting precise information regarding data sorting is one of the main drawbacks of it. Some of the well-known supervised machine learning algorithms are KNN (k-nearest neighbors), Decision tree, Logistic Regression, and Random Forest. Some of the well-known unsupervised machine learning algorithms are Hebbian Learning, K-means Clustering, and Hierarchical Clustering.
Machine Learning – Cross Validation Cross-validation is a powerful technique used in machine learning to estimate the performance of a model on unseen data. It is an essential step in building a robust machine learning model, as it helps to identify overfitting or underfitting, and helps to determine the optimal model hyperparameters. What is Cross-Validation? Cross-validation is a technique used to evaluate the performance of a model by partitioning the dataset into subsets, training the model on a portion of the data, and then validating the model on the remaining data. The basic idea behind cross-validation is to use a subset of the data to train the model and another subset to test its performance. This allows the machine learning model to be trained on a variety of data and to generalize better to new data. There are different types of cross-validation techniques available, but the most commonly used technique is k-fold cross-validation. In k-fold cross-validation, the data is partitioned into k equally sized folds. The model is then trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each of the k folds used once as the validation data. The final performance of the model is then averaged over the k iterations to obtain an estimate of the model”s performance. Why is Cross-Validation Important? Cross-validation is an essential technique in machine learning because it helps to prevent overfitting or underfitting of a model. Overfitting occurs when the model is too complex and fits the training data too closely, resulting in poor performance on new data. On the other hand, underfitting occurs when the model is too simple and does not capture the underlying patterns in the data, resulting in poor performance on both the training and test data. Cross-validation also helps to determine the optimal model hyperparameters. Hyperparameters are the settings that control the behavior of the model. For example, in a decision tree algorithm, the maximum depth of the tree is a hyperparameter that determines the level of complexity of the model. By using cross-validation to evaluate the performance of the model at different hyperparameter values, we can select the optimal hyperparameters that maximize the model”s performance. Implementing Cross-Validation in Python In this section, we will discuss how to implement k-fold cross-validation in Python using the Scikit-learn library. Scikit-learn is a popular Python library for machine learning that provides a range of algorithms and tools for data preprocessing, model selection, and evaluation. To demonstrate how to implement cross-validation in Python, we will use the famous Iris dataset. The Iris dataset contains measurements of the sepal length, sepal width, petal length, and petal width of three different species of iris flowers. The goal is to build a model that can predict the species of an iris flower based on its measurements. First, we need to load the dataset using the Scikit-learn load_iris() function and split it into a training set and a test set using the train_test_split() function. The training set will be used to train the model, and the test set will be used to evaluate the performance of the model. from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Load the Iris dataset iris = load_iris() # Split the data into a training set and a test set X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) Next, we will create a decision tree classifier using the Scikit-learn DecisionTreeClassifier() function. from sklearn.tree import DecisionTree Create a decision tree classifier. clf = DecisionTreeClassifier(random_state=42) Now, we can use k-fold cross-validation to evaluate the performance of the model. We will use the cross_val_score() function from Scikit-learn to perform k-fold cross-validation. The function takes as input the model, the training data, the target variable, and the number of folds. It returns an array of scores, one for each fold. from sklearn.model_selection import cross_val_score # Perform k-fold cross-validation scores = cross_val_score(clf, X_train, y_train, cv=5) Here, we have specified the number of folds as 5, meaning that the data will be partitioned into 5 equally sized folds. The cross_val_score() function will train the model on 4 folds and test it on the remaining fold. This process will be repeated 5 times, with each fold used once as the validation data. The function returns an array of scores, one for each fold. Finally, we can calculate the mean and standard deviation of the scores to get an estimate of the model”s performance. import numpy as np # Calculate the mean and standard deviation of the scores mean_score = np.mean(scores) std_score = np.std(scores) print(“Mean cross-validation score: {:.2f}”.format(mean_score)) print(“Standard deviation of cross-validation score: {:.2f}”.format(std_score)) The output of this code will be the mean and standard deviation of the scores. The mean score represents the average performance of the model across all folds, while the standard deviation represents the variability of the scores. Example Here is the complete implementation of Cross-Validation in Python − from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score import numpy as np # Load the iris dataset iris = load_iris() # Define the features and target variables X = iris.data y = iris.target # Create a decision tree classifier clf = DecisionTreeClassifier(random_state=42) # Perform k-fold cross-validation scores = cross_val_score(clf, X, y, cv=5) # Calculate the mean and standard deviation of the scores mean_score = np.mean(scores) std_score = np.std(scores) print(“Mean cross-validation score: {:.2f}”.format(mean_score)) print(“Standard deviation of cross-validation score: {:.2f}”.format(std_score)) Output When you execute this code, it will produce the following output − Mean cross-validation score: 0.95 Standard deviation of cross-validation score: 0.03
Machine Learning – Correlation Matrix Plot A correlation matrix plot is a graphical representation of the pairwise correlation between variables in a dataset. The plot consists of a matrix of scatterplots and correlation coefficients, where each scatterplot represents the relationship between two variables, and the correlation coefficient indicates the strength of the relationship. The diagonal of the matrix usually shows the distribution of each variable. The correlation coefficient is a measure of the linear relationship between two variables and ranges from -1 to 1. A coefficient of 1 indicates a perfect positive correlation, where an increase in one variable is associated with an increase in the other variable. A coefficient of -1 indicates a perfect negative correlation, where an increase in one variable is associated with a decrease in the other variable. A coefficient of 0 indicates no correlation between the variables. Python Implementation of Correlation Matrix Plots Now that we have a basic understanding of correlation matrix plots, let”s implement them in Python. For our example, we will be using the Iris flower dataset from Sklearn, which contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different species – Setosa, Versicolor, and Virginica. Example import numpy as np import pandas as pd import seaborn as sns from sklearn.datasets import load_iris iris = load_iris() data = pd.DataFrame(iris.data, columns=iris.feature_names) target = iris.target plt.figure(figsize=(7.5, 3.5)) corr = data.corr() sns.set(style=”white”) mask = np.zeros_like(corr, dtype=np.bool) mask[np.triu_indices_from(mask)] = True f, ax = plt.subplots(figsize=(11, 9)) cmap = sns.diverging_palette(220, 10, as_cmap=True) sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={“shrink”: .5}) plt.show() Output This code will produce a correlation matrix plot of the Iris dataset, with each square representing the correlation coefficient between two variables. From this plot, we can see that the variables ”sepal width (cm)” and ”petal length (cm)” have a moderate negative correlation (-0.37), while the variables ”petal length (cm)” and ”petal width (cm)” have a strong positive correlation (0.96). We can also see that the variable ”sepal length (cm)” has a weak positive correlation (0.87) with the variable ”petal length (cm)”.
Machine Learning – Boost Model Performance Boosting is a popular ensemble learning technique that combines several weak learners to create a strong learner. It works by iteratively training weak learners on subsets of the data and assigning higher weights to the misclassified samples to increase their importance in the subsequent iterations. This process is repeated until the desired level of performance is achieved. Here are some techniques to boost model performance in machine learning − Feature Engineering − Feature engineering involves creating new features from the existing features or transforming the existing features to make them more informative for the model. This can include techniques such as one-hot encoding, scaling, normalization, and feature selection. Hyperparameter Tuning − Hyperparameters are parameters that are not learned during training but are set by the data scientist. They control the behavior of the model, and tuning them can significantly impact model performance. Grid search and randomized search are common techniques for hyperparameter tuning. Ensemble Learning − Ensemble learning involves combining multiple models to improve performance. Techniques such as bagging, boosting, and stacking can be used to create ensembles. Random forests are an example of a bagging ensemble, while gradient boosting machines (GBMs) are an example of a boosting ensemble. Regularization − Regularization is a technique that prevents overfitting by adding a penalty term to the loss function. L1 regularization (Lasso) and L2 regularization (Ridge) are common techniques used in linear models, while dropout is a technique used in neural networks. Data Augmentation − Data augmentation involves generating new data from the existing data by applying transformations such as rotation, scaling, and flipping. This can help to reduce overfitting and improve model performance. Model Architecture − The architecture of the model can significantly impact its performance. Techniques such as deep learning and convolutional neural networks (CNNs) can be used to create more complex models that are better able to learn complex patterns in the data. Early Stopping − Early stopping is a technique used to prevent overfitting by stopping the training process once the model performance stops improving on a validation set. This prevents the model from continuing to learn the noise in the data and can help to improve generalization. Cross-Validation − Cross-validation is a technique used to evaluate the performance of a model on multiple subsets of the data. This can help to identify overfitting and can be used to select the best hyperparameters for the model. These techniques can be implemented in Python using various machine learning libraries such as scikit-learn, TensorFlow, and Keras. By using these techniques, data scientists can improve the performance of their models and create more accurate predictions. The following example below in which implement cross-validation using Scikit-learn − Example from sklearn.datasets import load_iris from sklearn.model_selection import cross_val_score from sklearn.ensemble import GradientBoostingClassifier # Load the iris dataset iris = load_iris() X = iris.data y = iris.target # Create a Gradient Boosting Classifier gb_clf = GradientBoostingClassifier() # Perform 5-fold cross-validation on the classifier scores = cross_val_score(gb_clf, X, y, cv=5) # Print the average accuracy and standard deviation of the cross-validation scores print(“Accuracy: %0.2f (+/- %0.2f)” % (scores.mean(), scores.std() * 2)) Output When you execute this code, it will produce the following output − Accuracy: 0.96 (+/- 0.07) Performance Improvement with Ensembles Ensembles can give us boost in the machine learning result by combining several models. Basically, ensemble models consist of several individually trained supervised learning models and their results are merged in various ways to achieve better predictive performance compared to a single model. Ensemble methods can be divided into following two groups − Sequential ensemble methods As the name implies, in these kind of ensemble methods, the base learners are generated sequentially. The motivation of such methods is to exploit the dependency among base learners. Parallel ensemble methods As the name implies, in these kind of ensemble methods, the base learners are generated in parallel. The motivation of such methods is to exploit the independence among base learners. Ensemble Learning Methods The following are the most popular ensemble learning methods i.e. the methods for combining the predictions from different models − Bagging The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to improve prediction accuracy and decrease model variance by combining predictions of individual models trained over randomly generated training samples. The final prediction of ensemble model will be given by calculating the average of all predictions from the individual estimators. One of the best examples of bagging methods are random forests. Boosting In boosting method, the main principle of building ensemble model is to build it incrementally by training each base model estimator sequentially. As the name suggests, it basically combine several week base learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During the training of week base learners, higher weights are assigned to those learners which were misclassified earlier. The example of boosting method is AdaBoost. Voting In this ensemble learning model, multiple models of different types are built and some simple statistics, like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the additional input for training to make the final prediction. Bagging Ensemble Algorithms The following are three bagging ensemble algorithms − Bagged Decision Tree As we know that bagging ensemble methods work well with the algorithms that have high variance and, in this concern, the best one is decision tree algorithm. In the following Python recipe, we are going to build bagged decision tree ensemble model by using BaggingClassifier function of sklearn with DecisionTreeClasifier (a classification & regression trees algorithm) on Pima Indians diabetes dataset. First, import the required packages as follows − from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier Now, we need to load the Pima diabetes dataset as we did in the previous examples − path = r”C:pima-indians-diabetes.csv” headernames = [”preg”, ”plas”, ”pres”, ”skin”, ”test”, ”mass”, ”pedi”, ”age”, ”class”]
Machine Learning – Statistics Statistics is a crucial tool in machine learning because it helps us understand the underlying patterns in the data. It provides us with methods to describe, summarize, and analyze data. Let”s see some of the basics of statistics for machine learning. Descriptive Statistics Descriptive statistics is a branch of statistics that deals with the summary and analysis of data. It includes measures such as mean, median, mode, variance, and standard deviation. These measures help us understand the central tendency, variability, and distribution of the data. In machine learning, descriptive statistics can be used to summarize the data, identify outliers, and detect patterns. For example, we can use the mean and standard deviation to describe the distribution of a dataset. In Python, we can calculate descriptive statistics using libraries such as NumPy and Pandas. Below is an example − Example import numpy as np import pandas as pd data = np.array([1, 2, 3, 4, 5]) df = pd.DataFrame(data, columns=[“Values”]) print(df.describe()) Output This will output a summary of the dataset, including the count, mean, standard deviation, minimum, and maximum values as follows − Values count 5.000000 mean 3.000000 std 1.581139 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 5.000000 Inferential Statistics Inferential statistics is a branch of statistics that deals with making predictions and inferences about a population based on a sample of data. It involves using hypothesis testing, confidence intervals, and regression analysis to draw conclusions about the data. In machine learning, inferential statistics can be used to make predictions about new data based on existing data. For example, we can use regression analysis to predict the price of a house based on its features, such as the number of bedrooms and bathrooms. In Python, we can perform inferential statistics using libraries such as Scikit-Learn and StatsModels. Below is an example − Example import statsmodels.api as sm import numpy as np X = np.array([1, 2, 3, 4, 5]) y = np.array([2, 4, 6, 8, 10]) X = sm.add_constant(X) model = sm.OLS(y, X).fit() print(model.summary()) Output This will output a summary of the regression model, including the coefficients, standard errors, t-statistics, and p-values as follows − In the next chapter, we will discuss various descriptive and inferential statistics measures, which are commonly used in machine learning, in detail along with Python implementation example.
Time Series – Quick Guide Time Series – Introduction A time series is a sequence of observations over a certain period. A univariate time series consists of the values taken by a single variable at periodic time instances over a period, and a multivariate time series consists of the values taken by multiple variables at the same periodic time instances over a period. The simplest example of a time series that all of us come across on a day to day basis is the change in temperature throughout the day or week or month or year. The analysis of temporal data is capable of giving us useful insights on how a variable changes over time, or how it depends on the change in the values of other variable(s). This relationship of a variable on its previous values and/or other variables can be analyzed for time series forecasting and has numerous applications in artificial intelligence. Time Series – Programming Languages A basic understanding of any programming language is essential for a user to work with or develop machine learning problems. A list of preferred programming languages for anyone who wants to work on machine learning is given below − Python It is a high-level interpreted programming language, fast and easy to code. Python can follow either procedural or object-oriented programming paradigms. The presence of a variety of libraries makes implementation of complicated procedures simpler. In this tutorial, we will be coding in Python and the corresponding libraries useful for time series modelling will be discussed in the upcoming chapters. R Similar to Python, R is an interpreted multi-paradigm language, which supports statistical computing and graphics. The variety of packages makes it easier to implement machine learning modelling in R. Java It is an interpreted object-oriented programming language, which is widely famous for a large range of package availability and sophisticated data visualization techniques. C/C++ These are compiled languages, and two of the oldest programming languages. These languages are often preferred to incorporate ML capabilities in the already existing applications as they allow you to customize the implementation of ML algorithms easily. MATLAB MATrix LABoratory is a multi-paradigm language which gives functioning to work with matrices. It allows mathematical operations for complex problems. It is primarily used for numerical operations but some packages also allow the graphical multi-domain simulation and model-based design. Other preferred programming languages for machine learning problems include JavaScript, LISP, Prolog, SQL, Scala, Julia, SAS etc. Time Series – Python Libraries Python has an established popularity among individuals who perform machine learning because of its easy-to-write and easy-to-understand code structure as well as a wide variety of open source libraries. A few of such open source libraries that we will be using in the coming chapters have been introduced below. NumPy Numerical Python is a library used for scientific computing. It works on an N-dimensional array object and provides basic mathematical functionality such as size, shape, mean, standard deviation, minimum, maximum as well as some more complex functions such as linear algebraic functions and Fourier transform. You will learn more about these as we move ahead in this tutorial. Pandas This library provides highly efficient and easy-to-use data structures such as series, dataframes and panels. It has enhanced Python’s functionality from mere data collection and preparation to data analysis. The two libraries, Pandas and NumPy, make any operation on small to very large dataset very simple. To know more about these functions, follow this tutorial. SciPy Science Python is a library used for scientific and technical computing. It provides functionalities for optimization, signal and image processing, integration, interpolation and linear algebra. This library comes handy while performing machine learning. We will discuss these functionalities as we move ahead in this tutorial. Scikit Learn This library is a SciPy Toolkit widely used for statistical modelling, machine learning and deep learning, as it contains various customizable regression, classification and clustering models. It works well with Numpy, Pandas and other libraries which makes it easier to use. Statsmodels Like Scikit Learn, this library is used for statistical data exploration and statistical modelling. It also operates well with other Python libraries. Matplotlib This library is used for data visualization in various formats such as line plot, bar graph, heat maps, scatter plots, histogram etc. It contains all the graph related functionalities required from plotting to labelling. We will discuss these functionalities as we move ahead in this tutorial. These libraries are very essential to start with machine learning with any sort of data. Beside the ones discussed above, another library especially significant to deal with time series is − Datetime This library, with its two modules − datetime and calendar, provides all necessary datetime functionality for reading, formatting and manipulating time. We shall be using these libraries in the coming chapters. Time Series – Data Processing and Visualization Time Series is a sequence of observations indexed in equi-spaced time intervals. Hence, the order and continuity should be maintained in any time series. The dataset we will be using is a multi-variate time series having hourly data for approximately one year, for air quality in a significantly polluted Italian city. The dataset can be downloaded from the link given below − . It is necessary to make sure that − The time series is equally spaced, and There are no redundant values or gaps in it. In case the time series is not continuous, we can upsample or downsample it. Showing df.head() In [122]: import pandas In [123]: df = pandas.read_csv(“AirQualityUCI.csv”, sep = “;”, decimal = “,”) df = df.iloc[ : , 0:14] In [124]: len(df) Out[124]: 9471 In [125]: df.head() Out[125]: For preprocessing the time series, we make sure there are no NaN(NULL) values in the dataset; if there are, we can replace them with either 0 or average or preceding or succeeding values. Replacing is a preferred choice over dropping so that the continuity of the time series is maintained. However, in our dataset the last few values seem to be
Time Series – Error Metrics It is important for us to quantify the performance of a model to use it as a feedback and comparison. In this tutorial we have used one of the most popular error metric root mean squared error. There are various other error metrics available. This chapter discusses them in brief. Mean Square Error It is the average of square of difference between the predicted values and true values. Sklearn provides it as a function. It has the same units as the true and predicted values squared and is always positive. $$MSE = frac{1}{n} displaystylesumlimits_{t=1}^n lgroup y”_{t}:-y_{t}rgroup^{2}$$ Where $y”_{t}$ is the predicted value, $y_{t}$ is the actual value, and n is the total number of values in test set. It is clear from the equation that MSE is more penalizing for larger errors, or the outliers. Root Mean Square Error It is the square root of the mean square error. It is also always positive and is in the range of the data. $$RMSE = sqrt{frac{1}{n} displaystylesumlimits_{t=1}^n lgroup y”_{t}-y_{t}rgroup ^2}$$ Where, $y”_{t}$ is predicted value $y_{t}$ is actual value, and n is total number of values in test set. It is in the power of unity and hence is more interpretable as compared to MSE. RMSE is also more penalizing for larger errors. We have used RMSE metric in our tutorial. Mean Absolute Error It is the average of absolute difference between predicted values and true values. It has the same units as predicted and true value and is always positive. $$MAE = frac{1}{n}displaystylesumlimits_{t=1}^{t=n} | y”{t}-y_{t}lvert$$ Where, $y”_{t}$ is predicted value, $y_{t}$ is actual value, and n is total number of values in test set. Mean Percentage Error It is the percentage of average of absolute difference between predicted values and true values, divided by the true value. $$MAPE = frac{1}{n}displaystylesumlimits_{t=1}^nfrac{y”_{t}-y_{t}}{y_{t}}*100: %$$ Where, $y”_{t}$ is predicted value, $y_{t}$ is actual value and n is total number of values in test set. However, the disadvantage of using this error is that the positive error and negative errors can offset each other. Hence mean absolute percentage error is used. Mean Absolute Percentage Error It is the percentage of average of absolute difference between predicted values and true values, divided by the true value. $$MAPE = frac{1}{n}displaystylesumlimits_{t=1}^nfrac{|y”_{t}-y_{t}lvert}{y_{t}}*100: %$$ Where $y”_{t}$ is predicted value $y_{t}$ is actual value, and n is total number of values in test set.
Time Series – LSTM Model Now, we are familiar with statistical modelling on time series, but machine learning is all the rage right now, so it is essential to be familiar with some machine learning models as well. We shall start with the most popular model in time series domain − Long Short-term Memory model. LSTM is a class of recurrent neural network. So before we can jump to LSTM, it is essential to understand neural networks and recurrent neural networks. Neural Networks An artificial neural network is a layered structure of connected neurons, inspired by biological neural networks. It is not one algorithm but combinations of various algorithms which allows us to do complex operations on data. Recurrent Neural Networks It is a class of neural networks tailored to deal with temporal data. The neurons of RNN have a cell state/memory, and input is processed according to this internal state, which is achieved with the help of loops with in the neural network. There are recurring module(s) of ‘tanh’ layers in RNNs that allow them to retain information. However, not for a long time, which is why we need LSTM models. LSTM It is special kind of recurrent neural network that is capable of learning long term dependencies in data. This is achieved because the recurring module of the model has a combination of four layers interacting with each other. The picture above depicts four neural network layers in yellow boxes, point wise operators in green circles, input in yellow circles and cell state in blue circles. An LSTM module has a cell state and three gates which provides them with the power to selectively learn, unlearn or retain information from each of the units. The cell state in LSTM helps the information to flow through the units without being altered by allowing only a few linear interactions. Each unit has an input, output and a forget gate which can add or remove the information to the cell state. The forget gate decides which information from the previous cell state should be forgotten for which it uses a sigmoid function. The input gate controls the information flow to the current cell state using a point-wise multiplication operation of ‘sigmoid’ and ‘tanh’ respectively. Finally, the output gate decides which information should be passed on to the next hidden state Now that we have understood the internal working of LSTM model, let us implement it. To understand the implementation of LSTM, we will start with a simple example − a straight line. Let us see, if LSTM can learn the relationship of a straight line and predict it. First let us create the dataset depicting a straight line. In [402]: x = numpy.arange (1,500,1) y = 0.4 * x + 30 plt.plot(x,y) Out[402]: [<matplotlib.lines.Line2D at 0x1eab9d3ee10>] In [403]: trainx, testx = x[0:int(0.8*(len(x)))], x[int(0.8*(len(x))):] trainy, testy = y[0:int(0.8*(len(y)))], y[int(0.8*(len(y))):] train = numpy.array(list(zip(trainx,trainy))) test = numpy.array(list(zip(trainx,trainy))) Now that the data has been created and split into train and test. Let’s convert the time series data into the form of supervised learning data according to the value of look-back period, which is essentially the number of lags which are seen to predict the value at time ‘t’. So a time series like this − time variable_x t1 x1 t2 x2 : : : : T xT When look-back period is 1, is converted to − x1 x2 x2 x3 : : : : xT-1 xT In [404]: def create_dataset(n_X, look_back): dataX, dataY = [], [] for i in range(len(n_X)-look_back): a = n_X[i:(i+look_back), ] dataX.append(a) dataY.append(n_X[i + look_back, ]) return numpy.array(dataX), numpy.array(dataY) In [405]: look_back = 1 trainx,trainy = create_dataset(train, look_back) testx,testy = create_dataset(test, look_back) trainx = numpy.reshape(trainx, (trainx.shape[0], 1, 2)) testx = numpy.reshape(testx, (testx.shape[0], 1, 2)) Now we will train our model. Small batches of training data are shown to network, one run of when entire training data is shown to the model in batches and error is calculated is called an epoch. The epochs are to be run ‘til the time the error is reducing. In [ ]: from keras.models import Sequential from keras.layers import LSTM, Dense model = Sequential() model.add(LSTM(256, return_sequences = True, input_shape = (trainx.shape[1], 2))) model.add(LSTM(128,input_shape = (trainx.shape[1], 2))) model.add(Dense(2)) model.compile(loss = ”mean_squared_error”, optimizer = ”adam”) model.fit(trainx, trainy, epochs = 2000, batch_size = 10, verbose = 2, shuffle = False) model.save_weights(”LSTMBasic1.h5”) In [407]: model.load_weights(”LSTMBasic1.h5”) predict = model.predict(testx) Now let’s see what our predictions look like. In [408]: plt.plot(testx.reshape(398,2)[:,0:1], testx.reshape(398,2)[:,1:2]) plt.plot(predict[:,0:1], predict[:,1:2]) Out[408]: [<matplotlib.lines.Line2D at 0x1eac792f048>] Now, we should try and model a sine or cosine wave in a similar fashion. You can run the code given below and play with the model parameters to see how the results change. In [409]: x = numpy.arange (1,500,1) y = numpy.sin(x) plt.plot(x,y) Out[409]: [<matplotlib.lines.Line2D at 0x1eac7a0b3c8>] In [410]: trainx, testx = x[0:int(0.8*(len(x)))], x[int(0.8*(len(x))):] trainy, testy = y[0:int(0.8*(len(y)))], y[int(0.8*(len(y))):] train = numpy.array(list(zip(trainx,trainy))) test = numpy.array(list(zip(trainx,trainy))) In [411]: look_back = 1 trainx,trainy = create_dataset(train, look_back) testx,testy = create_dataset(test, look_back) trainx = numpy.reshape(trainx, (trainx.shape[0], 1, 2)) testx = numpy.reshape(testx, (testx.shape[0], 1, 2)) In [ ]: model = Sequential() model.add(LSTM(512, return_sequences = True, input_shape = (trainx.shape[1], 2))) model.add(LSTM(256,input_shape = (trainx.shape[1], 2))) model.add(Dense(2)) model.compile(loss = ”mean_squared_error”, optimizer = ”adam”) model.fit(trainx, trainy, epochs = 2000, batch_size = 10, verbose = 2, shuffle = False) model.save_weights(”LSTMBasic2.h5”) In [413]: model.load_weights(”LSTMBasic2.h5”) predict = model.predict(testx) In [415]: plt.plot(trainx.reshape(398,2)[:,0:1], trainx.reshape(398,2)[:,1:2]) plt.plot(predict[:,0:1], predict[:,1:2]) Out [415]: [<matplotlib.lines.Line2D at 0x1eac7a1f550>] Now you are ready to move on to any dataset.