Time Series – Useful Resources The following resources contain additional information on Time Series. Please use them to get more in-depth knowledge on this. Useful Links on Time Series − Wikipedia reference for Time Series Useful Books on Time Series To enlist your site on this page, please drop an email to [email protected]
Category: Machine Learning
Time Series – Walk Forward Validation In time series modelling, the predictions over time become less and less accurate and hence it is a more realistic approach to re-train the model with actual data as it gets available for further predictions. Since training of statistical models are not time consuming, walk-forward validation is the most preferred solution to get most accurate results. Let us apply one step walk forward validation on our data and compare it with the results we got earlier. In [333]: prediction = [] data = train.values for t In test.values: model = (ExponentialSmoothing(data).fit()) y = model.predict() prediction.append(y[0]) data = numpy.append(data, t) In [335]: test_ = pandas.DataFrame(test) test_[”predictionswf”] = prediction In [341]: plt.plot(test_[”T”]) plt.plot(test_.predictionswf, ”–”) plt.show() In [340]: error = sqrt(metrics.mean_squared_error(test.values,prediction)) print (”Test RMSE for Triple Exponential Smoothing with Walk-Forward Validation: ”, error) Test RMSE for Triple Exponential Smoothing with Walk-Forward Validation: 11.787532205759442 We can see that our model performs significantly better now. In fact, the trend is followed so closely that on the plot predictions are overlapping with the actual values. You can try applying walk-forward validation on ARIMA models too.
Time Series – Prophet Model In 2017, Facebook open sourced the prophet model which was capable of modelling the time series with strong multiple seasonalities at day level, week level, year level etc. and trend. It has intuitive parameters that a not-so-expert data scientist can tune for better forecasts. At its core, it is an additive regressive model which can detect change points to model the time series. Prophet decomposes the time series into components of trend $g_{t}$, seasonality $S_{t}$ and holidays $h_{t}$. $$y_{t}=g_{t}+s_{t}+h_{t}+epsilon_{t}$$ Where, $epsilon_{t}$ is the error term. Similar packages for time series forecasting such as causal impact and anomaly detection were introduced in R by google and twitter respectively.
Time Series – Applications We discussed time series analysis in this tutorial, which has given us the understanding that time series models first recognize the trend and seasonality from the existing observations and then forecast a value based on this trend and seasonality. Such analysis is useful in various fields such as − Financial Analysis − It includes sales forecasting, inventory analysis, stock market analysis, price estimation. Weather Analysis − It includes temperature estimation, climate change, seasonal shift recognition, weather forecasting. Network Data Analysis − It includes network usage prediction, anomaly or intrusion detection, predictive maintenance. Healthcare Analysis − It includes census prediction, insurance benefits prediction, patient monitoring.
Machine Learning – Box and Whisker Plots A boxplot is a graphical representation of a dataset that displays the five-number summary of the data – the minimum value, the first quartile, the median, the third quartile, and the maximum value. The boxplot consists of a box with whiskers extending from the top and bottom of the box. The box represents the interquartile range (IQR) of the data, which is the range between the first and third quartiles. The whiskers extend from the top and bottom of the box to the highest and lowest values that are within 1.5 times the IQR. Any values that fall outside this range are considered outliers and are represented as points beyond the whiskers. Python Implementation of Box and Whisker Plots Now that we have a basic understanding of boxplots, let”s implement them in Python. For our example, we will be using the Iris dataset from Sklearn, which contains measurements of the sepal length, sepal width, petal length, and petal width of 150 iris flowers, belonging to three different species – Setosa, Versicolor, and Virginica. To start, we need to import the necessary libraries and load the dataset. Example import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_iris iris = load_iris() data = iris.data target = iris.target Next, we can create a boxplot of the sepal length for each of the three iris species using the Seaborn library. plt.figure(figsize=(7.5, 3.5)) sns.boxplot(x=target, y=data[:, 0]) plt.xlabel(”Species”) plt.ylabel(”Sepal Length (cm)”) plt.show() Output This code will produce a boxplot of the sepal length for each of the three iris species, with the x-axis representing the species and the y-axis representing the sepal length in centimeters. From this boxplot, we can see that the setosa species has a shorter sepal length compared to the versicolor and virginica species, which have a similar median and range of sepal lengths. Additionally, we can see that there are no outliers in the setosa species, but there are a few outliers in the versicolor and virginica specie.
Machine Learning – Histograms A histogram is a bar graph-like representation of the distribution of a variable. It shows the frequency of occurrences of each value of the variable. The x-axis represents the range of values of the variable, and the y-axis represents the frequency or count of each value. The height of each bar represents the number of data points that fall within that value range. Histograms are useful for identifying patterns in data, such as skewness, modality, and outliers. Skewness refers to the degree of asymmetry in the distribution of the variable. Modality refers to the number of peaks in the distribution. Outliers are data points that fall outside of the range of typical values for the variable. Python Implementation of Histograms Python provides several libraries for data visualization, such as Matplotlib, Seaborn, Plotly, and Bokeh. For the example given below, we will use Matplotlib to implement histograms. We will use the breast cancer dataset from the Sklearn library for this example. The breast cancer dataset contains information about the characteristics of breast cancer cells and whether they are malignant or benign. The dataset has 30 features and 569 samples. Example Let”s start by importing the necessary libraries and loading the dataset − import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer data = load_breast_cancer() Next, we will create a histogram of the mean radius feature of the dataset − plt.figure(figsize=(7.2, 3.5)) plt.hist(data.data[:,0], bins=20) plt.xlabel(”Mean Radius”) plt.ylabel(”Frequency”) plt.show() In this code, we have used the hist() function from Matplotlib to create a histogram of the mean radius feature of the dataset. We have set the number of bins to 20 to divide the data range into 20 intervals. We have also added labels to the x and y axes using the xlabel() and ylabel() functions. Output The resulting histogram shows the distribution of mean radius values in the dataset. We can see that the data is roughly normally distributed, with a peak around 12-14. Histogram with Multiple Data Sets We can also create a histogram with multiple data sets to compare their distributions. Let”s create histograms of the mean radius feature for both the malignant and benign samples − Example plt.figure(figsize=(7.2, 3.5)) plt.hist(data.data[data.target==0,0], bins=20, alpha=0.5, label=”Malignant”) plt.hist(data.data[data.target==1,0], bins=20, alpha=0.5, label=”Benign”) plt.xlabel(”Mean Radius”) plt.ylabel(”Frequency”) plt.legend() plt.show() In this code, we have used the hist() function twice to create two histograms of the mean radius feature, one for the malignant samples and one for the benign samples. We have set the transparency of the bars to 0.5 using the alpha parameter so that they don”t overlap completely. We have also added a legend to the plot using the legend() function. Output On executing this code, you will get the following plot as the output − The resulting histogram shows the distribution of mean radius values for both the malignant and benign samples. We can see that the distributions are different, with the malignant samples having a higher frequency of higher mean radius values.
Machine Learning – Reinforcement These methods are a bit different from previously studied methods and very rarely used also. In this kind of learning algorithms, there would be an agent that we want to train over a period of time so that it can interact with a specific environment. The agent will follow a set of strategies for interacting with the environment and then after observing the environment it will take actions regards the current state of the environment. Here are the major steps involved in reinforcement learning methods − Step 1 − First, we need to prepare an agent with some initial set of strategies. Step 2 − Then observe the environment and its current state. Step 3 − Next, select the optimal policy regards the current state of the environment and perform important action. Step 4 − Now, the agent can get corresponding reward or penalty as per accordance with the action taken by it in previous step. Step 5 − Now, we can update the strategies if it is required so. Step 6 − At last, repeat steps 2-5 until the agent got to learn & adopt the optimal policies. The following diagram shows what type of task is appropriate for various ML problems −
Machine Learning – Unsupervised What is Unsupervised Learning? In unsupervised machine learning algorithms, we do not have any supervisor to provide any sort of guidance. Unsupervised learning algorithms are handy in the scenario in which we do not have the liberty, like in supervised learning algorithms, of having pre-labeled training data and we want to extract useful pattern from input data. Examples of unsupervised machine learning algorithms includes K-means clustering, K-nearest neighbors etc. In regression, we train the machine to predict a future value. In classification, we train the machine to classify an unknown object in one of the categories defined by us. In short, we have been training machines so that it can predict Y for our data X. Given a huge data set and not estimating the categories, it would be difficult for us to train the machine using supervised learning. What if the machine can look up and analyze the big data running into several Gigabytes and Terabytes and tell us that this data contains so many distinct categories? As an example, consider the voter’s data. By considering some inputs from each voter (these are called features in AI terminology), let the machine predict that there are so many voters who would vote for X political party and so many would vote for Y, and so on. Thus, in general, we are asking the machine given a huge set of data points X, “What can you tell me about X?”. Or it may be a question like “What are the five best groups we can make out of X?”. Or it could be even like “What three features occur together most frequently in X?”. This is exactly the Unsupervised Learning is all about. Algorithms for Unsupervised Learning Let us now discuss one of the widely used algorithms for classification in unsupervised machine learning. k-means clustering The 2000 and 2004 Presidential elections in the United States were close — very close. The largest percentage of the popular vote that any candidate received was 50.7% and the lowest was 47.9%. If a percentage of the voters were to have switched sides, the outcome of the election would have been different. There are small groups of voters who, when properly appealed to, will switch sides. These groups may not be huge, but with such close races, they may be big enough to change the outcome of the election. How do you find these groups of people? How do you appeal to them with a limited budget? The answer is clustering. Let us understand how it is done. First, you collect information on people either with or without their consent: any sort of information that might give some clue about what is important to them and what will influence how they vote. Then you put this information into some sort of clustering algorithm. Next, for each cluster (it would be smart to choose the largest one first) you craft a message that will appeal to these voters. Finally, you deliver the campaign and measure to see if it’s working. Clustering is a type of unsupervised learning that automatically forms clusters of similar things. It is like automatic classification. You can cluster almost anything, and the more similar the items are in the cluster, the better the clusters are. In this chapter, we are going to study one type of clustering algorithm called k-means. It is called k-means because it finds ‘k’ unique clusters, and the center of each cluster is the mean of the values in that cluster. Cluster Identification Cluster identification tells an algorithm, “Here’s some data. Now group similar things together and tell me about those groups.” The key difference from classification is that in classification you know what you are looking for. While that is not the case in clustering. Clustering is sometimes called unsupervised classification because it produces the same result as classification does but without having predefined classes. Based on the ML tasks, unsupervised learning algorithms can be divided into the following broad classes: Clustering, Association, Dimensionality Reduction, and Anomaly Detection. Clustering Clustering methods are one of the most useful unsupervised ML methods. These algorithms used to find similarity as well as relationship patterns among data samples and then cluster those samples into groups having similarity based on features. The real-world example of clustering is to group the customers by their purchasing behavior. Association Another useful unsupervised ML method is Association which is basically used to analyze large dataset to find patterns which further represent the interesting relationships between various items. It is also termed as Association Rule Mining or Market basket analysis which is mainly used to analyze customer shopping patterns. Dimensionality Reduction As the name suggests, this unsupervised ML method is used to reduce the number of feature variables for each data sample by selecting set of principal or representative features. A question arises here is that, why we need to reduce the dimensionality? The reason behind this is the problem of feature space complexity which arises when we start analyzing and extracting millions of features from data samples. This problem generally refers to “curse of dimensionality”. PCA (Principal Component Analysis), K-nearest neighbors and discriminant analysis are some of the popular algorithms for this purpose. Anomaly Detection This unsupervised ML method is used to find out occurrences of rare events or observations that generally do not occur. By using the learned knowledge, anomaly detection methods would be able to differentiate between anomalous or a normal data point. Some of the unsupervised algorithms like clustering, KNN can detect anomalies based on the data and its features.
Time Series – Moving Average For a stationary time series, a moving average model sees the value of a variable at time ‘t’ as a linear function of residual errors from ‘q’ time steps preceding it. The residual error is calculated by comparing the value at the time ‘t’ to moving average of the values preceding. Mathematically it can be written as − $$y_{t} = c:+:epsilon_{t}:+:theta_{1}:epsilon_{t-1}:+:theta_{2}:epsilon_{t-2}:+:…+:theta_{q}:epsilon_{t-q}:$$ Where‘q’ is the moving-average trend parameter $epsilon_{t}$ is white noise, and $epsilon_{t-1}, epsilon_{t-2}…epsilon_{t-q}$ are the error terms at previous time periods. Value of ‘q’ can be calibrated using various methods. One way of finding the apt value of ‘q’ is plotting the partial auto-correlation plot. A partial auto-correlation plot shows the relation of a variable with itself at prior time steps with indirect correlations removed, unlike auto-correlation plot which shows direct as well as indirect correlations, let’s see how it looks like for ‘temperature’ variable of our data. Showing PACP In [143]: from statsmodels.graphics.tsaplots import plot_pacf plot_pacf(train, lags = 100) plt.show() A partial auto-correlation is read in the same way as a correlogram.
Time Series Tutorial Job Search A time series is a sequence of observations over a certain period. The simplest example of a time series that all of us come across on a day to day basis is the change in temperature throughout the day or week or month or year. The analysis of temporal data is capable of giving us useful insights on how a variable changes over time. This tutorial will teach you how to analyze and forecast time series data with the help of various statistical and machine learning models in elaborate and easy to understand way! Audience This tutorial is for the inquisitive minds who are looking to understand time series and time series forecasting models from scratch. At the end of this tutorial you will have a good understanding on time series modelling. Prerequisites This tutorial only assumes a preliminary understanding of Python language. Although this tutorial is self-contained, it will be useful if you have understanding of statistical mathematics. If you are new to either Python or Statistics, we suggest you to pick up a tutorial based on these subjects first before you embark on your journey with Time Series.