Time Series – Applications We discussed time series analysis in this tutorial, which has given us the understanding that time series models first recognize the trend and seasonality from the existing observations and then forecast a value based on this trend and seasonality. Such analysis is useful in various fields such as − Financial Analysis − It includes sales forecasting, inventory analysis, stock market analysis, price estimation. Weather Analysis − It includes temperature estimation, climate change, seasonal shift recognition, weather forecasting. Network Data Analysis − It includes network usage prediction, anomaly or intrusion detection, predictive maintenance. Healthcare Analysis − It includes census prediction, insurance benefits prediction, patient monitoring.
Category: time Series
Time Series – Moving Average For a stationary time series, a moving average model sees the value of a variable at time ‘t’ as a linear function of residual errors from ‘q’ time steps preceding it. The residual error is calculated by comparing the value at the time ‘t’ to moving average of the values preceding. Mathematically it can be written as − $$y_{t} = c:+:epsilon_{t}:+:theta_{1}:epsilon_{t-1}:+:theta_{2}:epsilon_{t-2}:+:…+:theta_{q}:epsilon_{t-q}:$$ Where‘q’ is the moving-average trend parameter $epsilon_{t}$ is white noise, and $epsilon_{t-1}, epsilon_{t-2}…epsilon_{t-q}$ are the error terms at previous time periods. Value of ‘q’ can be calibrated using various methods. One way of finding the apt value of ‘q’ is plotting the partial auto-correlation plot. A partial auto-correlation plot shows the relation of a variable with itself at prior time steps with indirect correlations removed, unlike auto-correlation plot which shows direct as well as indirect correlations, let’s see how it looks like for ‘temperature’ variable of our data. Showing PACP In [143]: from statsmodels.graphics.tsaplots import plot_pacf plot_pacf(train, lags = 100) plt.show() A partial auto-correlation is read in the same way as a correlogram.
Time Series – Naive Methods Introduction Naive Methods such as assuming the predicted value at time ‘t’ to be the actual value of the variable at time ‘t-1’ or rolling mean of series, are used to weigh how well do the statistical models and machine learning models can perform and emphasize their need. In this chapter, let us try these models on one of the features of our time-series data. First we shall see the mean of the ‘temperature’ feature of our data and the deviation around it. It is also useful to see maximum and minimum temperature values. We can use the functionalities of numpy library here. Showing statistics In [135]: import numpy print ( ”Mean: ”,numpy.mean(df[”T”]), Standard Deviation: ”,numpy.std(df[”T”]), nMaximum Temperature: ”,max(df[”T”]), Minimum Temperature: ”,min(df[”T”]) ) We have the statistics for all 9357 observations across equi-spaced timeline which are useful for us to understand the data. Now we will try the first naive method, setting the predicted value at present time equal to actual value at previous time and calculate the root mean squared error(RMSE) for it to quantify the performance of this method. Showing 1st naïve method In [136]: df[”T”] df[”T_t-1”] = df[”T”].shift(1) In [137]: df_naive = df[[”T”,”T_t-1”]][1:] In [138]: from sklearn import metrics from math import sqrt true = df_naive[”T”] prediction = df_naive[”T_t-1”] error = sqrt(metrics.mean_squared_error(true,prediction)) print (”RMSE for Naive Method 1: ”, error) RMSE for Naive Method 1: 12.901140576492974 Let us see the next naive method, where predicted value at present time is equated to the mean of the time periods preceding it. We will calculate the RMSE for this method too. Showing 2nd naive method In [139]: df[”T_rm”] = df[”T”].rolling(3).mean().shift(1) df_naive = df[[”T”,”T_rm”]].dropna() In [140]: true = df_naive[”T”] prediction = df_naive[”T_rm”] error = sqrt(metrics.mean_squared_error(true,prediction)) print (”RMSE for Naive Method 2: ”, error) RMSE for Naive Method 2: 14.957633272839242 Here, you can experiment with various number of previous time periods also called ‘lags’ you want to consider, which is kept as 3 here. In this data it can be seen that as you increase the number of lags and error increases. If lag is kept 1, it becomes same as the naïve method used earlier. Points to Note You can write a very simple function for calculating root mean squared error. Here, we have used the mean squared error function from the package ‘sklearn’ and then taken its square root. In pandas df[‘column_name’] can also be written as df.column_name, however for this dataset df.T will not work the same as df[‘T’] because df.T is the function for transposing a dataframe. So use only df[‘T’] or consider renaming this column before using the other syntax.
Time Series – ARIMA We have already understood that for a stationary time series a variable at time ‘t’ is a linear function of prior observations or residual errors. Hence it is time for us to combine the two and have an Auto-regressive moving average (ARMA) model. However, at times the time series is not stationary, i.e the statistical properties of a series like mean, variance changes over time. And the statistical models we have studied so far assume the time series to be stationary, therefore, we can include a pre-processing step of differencing the time series to make it stationary. Now, it is important for us to find out whether the time series we are dealing with is stationary or not. Various methods to find the stationarity of a time series are looking for seasonality or trend in the plot of time series, checking the difference in mean and variance for various time periods, Augmented Dickey-Fuller (ADF) test, KPSS test, Hurst’s exponent etc. Let us see whether the ‘temperature’ variable of our dataset is a stationary time series or not using ADF test. In [74]: from statsmodels.tsa.stattools import adfuller result = adfuller(train) print(”ADF Statistic: %f” % result[0]) print(”p-value: %f” % result[1]) print(”Critical Values:”) for key, value In result[4].items() print(”t%s: %.3f” % (key, value)) ADF Statistic: -10.406056 p-value: 0.000000 Critical Values: 1%: -3.431 5%: -2.862 10%: -2.567 Now that we have run the ADF test, let us interpret the result. First we will compare the ADF Statistic with the critical values, a lower critical value tells us the series is most likely non-stationary. Next, we see the p-value. A p-value greater than 0.05 also suggests that the time series is non-stationary. Alternatively, p-value less than or equal to 0.05, or ADF Statistic less than critical values suggest the time series is stationary. Hence, the time series we are dealing with is already stationary. In case of stationary time series, we set the ‘d’ parameter as 0. We can also confirm the stationarity of time series using Hurst exponent. In [75]: import hurst H, c,data = hurst.compute_Hc(train) print(“H = {:.4f}, c = {:.4f}”.format(H,c)) H = 0.1660, c = 5.0740 The value of H<0.5 shows anti-persistent behavior, and H>0.5 shows persistent behavior or a trending series. H=0.5 shows random walk/Brownian motion. The value of H<0.5, confirming that our series is stationary. For non-stationary time series, we set ‘d’ parameter as 1. Also, the value of the auto-regressive trend parameter ‘p’ and the moving average trend parameter ‘q’, is calculated on the stationary time series i.e by plotting ACP and PACP after differencing the time series. ARIMA Model, which is characterized by 3 parameter, (p,d,q) are now clear to us, so let us model our time series and predict the future values of temperature. In [156]: from statsmodels.tsa.arima_model import ARIMA model = ARIMA(train.values, order=(5, 0, 2)) model_fit = model.fit(disp=False) In [157]: predictions = model_fit.predict(len(test)) test_ = pandas.DataFrame(test) test_[”predictions”] = predictions[0:1871] In [158]: plt.plot(df[”T”]) plt.plot(test_.predictions) plt.show() In [167]: error = sqrt(metrics.mean_squared_error(test.values,predictions[0:1871])) print (”Test RMSE for ARIMA: ”, error) Test RMSE for ARIMA: 43.21252940234892
Time Series Tutorial Job Search A time series is a sequence of observations over a certain period. The simplest example of a time series that all of us come across on a day to day basis is the change in temperature throughout the day or week or month or year. The analysis of temporal data is capable of giving us useful insights on how a variable changes over time. This tutorial will teach you how to analyze and forecast time series data with the help of various statistical and machine learning models in elaborate and easy to understand way! Audience This tutorial is for the inquisitive minds who are looking to understand time series and time series forecasting models from scratch. At the end of this tutorial you will have a good understanding on time series modelling. Prerequisites This tutorial only assumes a preliminary understanding of Python language. Although this tutorial is self-contained, it will be useful if you have understanding of statistical mathematics. If you are new to either Python or Statistics, we suggest you to pick up a tutorial based on these subjects first before you embark on your journey with Time Series.
Time Series – Further Scope Machine learning deals with various kinds of problems. In fact, almost all fields have a scope to be automatized or improved with the help of machine learning. A few such problems on which a great deal of work is being done are given below. Time Series Data This is the data which changes according to time, and hence time plays a crucial role in it, which we largely discussed in this tutorial. Non-Time Series Data It is the data independent of time, and a major percentage of ML problems are on nontime series data. For simplicity, we shall categorize it further as − Numerical Data − Computers, unlike humans, only understand numbers, so all kinds of data ultimately is converted to numerical data for machine learning, for example, image data is converted to (r,b,g) values, characters are converted to ASCII codes or words are indexed to numbers, speech data is converted to mfcc files containing numerical data. Image Data − Computer vision has revolutionized the world of computers, it has various application in the field of medicine, satellite imaging etc. Text Data − Natural Language Processing (NLP) is used for text classification, paraphrase detection and language summarization. This is what makes Google and Facebook smart. Speech Data − Speech Processing involves speech recognition and sentiment understanding. It plays a crucial role in imparting computers the human-like qualities.
Time Series – Modeling Introduction A time series has 4 components as given below − Level − It is the mean value around which the series varies. Trend − It is the increasing or decreasing behavior of a variable with time. Seasonality − It is the cyclic behavior of time series. Noise − It is the error in the observations added due to environmental factors. Time Series Modeling Techniques To capture these components, there are a number of popular time series modelling techniques. This section gives a brief introduction of each technique, however we will discuss about them in detail in the upcoming chapters − Naïve Methods These are simple estimation techniques, such as the predicted value is given the value equal to mean of preceding values of the time dependent variable, or previous actual value. These are used for comparison with sophisticated modelling techniques. Auto Regression Auto regression predicts the values of future time periods as a function of values at previous time periods. Predictions of auto regression may fit the data better than that of naïve methods, but it may not be able to account for seasonality. ARIMA Model An auto-regressive integrated moving-average models the value of a variable as a linear function of previous values and residual errors at previous time steps of a stationary timeseries. However, the real world data may be non-stationary and have seasonality, thus Seasonal-ARIMA and Fractional-ARIMA were developed. ARIMA works on univariate time series, to handle multiple variables VARIMA was introduced. Exponential Smoothing It models the value of a variable as an exponential weighted linear function of previous values. This statistical model can handle trend and seasonality as well. LSTM Long Short-Term Memory model (LSTM) is a recurrent neural network which is used for time series to account for long term dependencies. It can be trained with large amount of data to capture the trends in multi-variate time series. The said modelling techniques are used for time series regression. In the coming chapters, let us now explore all these one by one.
Time Series – Auto Regression For a stationary time series, an auto regression models sees the value of a variable at time ‘t’ as a linear function of values ‘p’ time steps preceding it. Mathematically it can be written as − $$y_{t} = :C+:phi_{1}y_{t-1}:+:phi_{2}Y_{t-2}+…+phi_{p}y_{t-p}+epsilon_{t}$$ Where,‘p’ is the auto-regressive trend parameter $epsilon_{t}$ is white noise, and $y_{t-1}, y_{t-2}:: …y_{t-p}$ denote the value of variable at previous time periods. The value of p can be calibrated using various methods. One way of finding the apt value of ‘p’ is plotting the auto-correlation plot. Note − We should separate the data into train and test at 8:2 ratio of total data available prior to doing any analysis on the data because test data is only to find out the accuracy of our model and assumption is, it is not available to us until after predictions have been made. In case of time series, sequence of data points is very essential so one should keep in mind not to lose the order during splitting of data. An auto-correlation plot or a correlogram shows the relation of a variable with itself at prior time steps. It makes use of Pearson’s correlation and shows the correlations within 95% confidence interval. Let’s see how it looks like for ‘temperature’ variable of our data. Showing ACP In [141]: split = len(df) – int(0.2*len(df)) train, test = df[”T”][0:split], df[”T”][split:] In [142]: from statsmodels.graphics.tsaplots import plot_acf plot_acf(train, lags = 100) plt.show() All the lag values lying outside the shaded blue region are assumed to have a csorrelation.
Time Series – Exponential Smoothing In this chapter, we will talk about the techniques involved in exponential smoothing of time series. Simple Exponential Smoothing Exponential Smoothing is a technique for smoothing univariate time-series by assigning exponentially decreasing weights to data over a time period. Mathematically, the value of variable at time ‘t+1’ given value at time t, y_(t+1|t) is defined as − $$y_{t+1|t}:=:alpha y_{t}:+:alphalgroup1 -alphargroup y_{t-1}:+alphalgroup1-alphargroup^{2}:y_{t-2}:+:…+y_{1}$$ where,$0leqalpha leq1$ is the smoothing parameter, and $y_{1},….,y_{t}$ are previous values of network traffic at times 1, 2, 3, … ,t. This is a simple method to model a time series with no clear trend or seasonality. But exponential smoothing can also be used for time series with trend and seasonality. Triple Exponential Smoothing Triple Exponential Smoothing (TES) or Holt”s Winter method, applies exponential smoothing three times – level smoothing $l_{t}$, trend smoothing $b_{t}$, and seasonal smoothing $S_{t}$, with $alpha$, $beta^{*}$ and $gamma$ as smoothing parameters with ‘m’ as the frequency of the seasonality, i.e. the number of seasons in a year. According to the nature of the seasonal component, TES has two categories − Holt-Winter”s Additive Method − When the seasonality is additive in nature. Holt-Winter’s Multiplicative Method − When the seasonality is multiplicative in nature. For non-seasonal time series, we only have trend smoothing and level smoothing, which is called Holt’s Linear Trend Method. Let’s try applying triple exponential smoothing on our data. In [316]: from statsmodels.tsa.holtwinters import ExponentialSmoothing model = ExponentialSmoothing(train.values, trend= ) model_fit = model.fit() In [322]: predictions_ = model_fit.predict(len(test)) In [325]: plt.plot(test.values) plt.plot(predictions_[1:1871]) Out[325]: [<matplotlib.lines.Line2D at 0x1eab00f1cf8>] Here, we have trained the model once with training set and then we keep on making predictions. A more realistic approach is to re-train the model after one or more time step(s). As we get the prediction for time ‘t+1’ from training data ‘til time ‘t’, the next prediction for time ‘t+2’ can be made using the training data ‘til time ‘t+1’ as the actual value at ‘t+1’ will be known then. This methodology of making predictions for one or more future steps and then re-training the model is called rolling forecast or walk forward validation.
Time Series – Programming Languages A basic understanding of any programming language is essential for a user to work with or develop machine learning problems. A list of preferred programming languages for anyone who wants to work on machine learning is given below − Python It is a high-level interpreted programming language, fast and easy to code. Python can follow either procedural or object-oriented programming paradigms. The presence of a variety of libraries makes implementation of complicated procedures simpler. In this tutorial, we will be coding in Python and the corresponding libraries useful for time series modelling will be discussed in the upcoming chapters. R Similar to Python, R is an interpreted multi-paradigm language, which supports statistical computing and graphics. The variety of packages makes it easier to implement machine learning modelling in R. Java It is an interpreted object-oriented programming language, which is widely famous for a large range of package availability and sophisticated data visualization techniques. C/C++ These are compiled languages, and two of the oldest programming languages. These languages are often preferred to incorporate ML capabilities in the already existing applications as they allow you to customize the implementation of ML algorithms easily. MATLAB MATrix LABoratory is a multi-paradigm language which gives functioning to work with matrices. It allows mathematical operations for complex problems. It is primarily used for numerical operations but some packages also allow the graphical multi-domain simulation and model-based design. Other preferred programming languages for machine learning problems include JavaScript, LISP, Prolog, SQL, Scala, Julia, SAS etc.