Ask Analytics: Time Series Forecasting

ARIMA using SAS

We have covered basics about time series and also the basic methods of forecasting. It is time to learn the most important and most widely used for time series forecasting : ARIMA.

It is not possible to write ARIMA in a single stretch, it being full of complication, hence we plan to write it in series of article.
Related articles :

Time Series Forecasting - Part 1 (Overview)

Time Series Forecasting - Part 2 (SES)

Time Series Forecasting - Part 3 (DES)

Time Series Forecasting - Part 4 (TES)

Let's now learn ARIMA

ARIMA is an acronym for Auto Regressive Integrated Moving Average. It is one of the most popular methods used for time series forecasting. ARIMA, also known as Box-Jenkins, is mainly used for short term forecasting. It works best when the data exhibits consistent pattern over time, with a minimum amount of outliers.

Let's directly jump to "How to do ARIMA modeling?" and we would cover the important terminologies there and then only.

Starting with:

Step 0 : Plot and feel the series

proc Sgplot data = masterdata;

Series x = date y = Log_AIR;

run;

Quit;

It is clear from the chart that the series is having strong seasonality and trend.

Step 1 : Check Series Volatility

Let's now check whether the series is volatile or not :

As the data is diverging (Fan Shaped), it is volatile. It is required that we should make the series non-volatile for ARIMA procedure and to reduce its volatility, we transform it.

Step 2 : Stationarity Check

Once again we plot the series (transformed) and do the stationarity check.

What is stationary series ?

A time series is called stationary whose mean and variance is constant over a period of time, in other and precise words mean and variance is not be a function of time.

Now while a series is having a constantly increasing trend, if we take mean across two time zones, definitely the mean won't be same and hence it won't be stationary.

Example :

Hope the concept of stationarity in the time series is clear. Let's see how to do the same in SAS for our case.

Ideally for ARIMA, series should be non-stationary, however for procedural purpose, we need to make it stationary. A stationary series is generally is not very interesting.
Basically we capture the trend and seasonality while making a non-stationary series stationary in ARIMA process, while theoretically it is said that we remove the trend and seasonality.

Apart from visual check, there is more robust statistical way method to check the stationarity. We can use Augmented Dickey Fuller Test (recommended):

PROC ARIMA DATA= Masterdata;

IDENTIFY VAR = log_Air STATIONARITY= (ADF) ;

RUN;

Quit;

Here we have done 1 and 12 order differencing simultaneously, you can can do it one by one and check the results for better understanding.

Perform differencing and checking non-stationarity again:

PROC ARIMA DATA= masterdata ;

IDENTIFY VAR = Log_Air (1,12) STATIONARITY= (ADF) ;

RUN;
Quit;

You can try differencing the series in MS Excel for better understanding : First subtract first lag from each observation and plot it and then with the new series subtract 12th lag from each observation. You will get similar output as shown above.

All right! Now while series has been made stationary. It is now time to identify the potential order of the model.
Order of the ARIMA model is denoted by ARIMA (p, d, q) or AR(p), I(d), MA (q)

What are these p, d ,q ?

p : denoted the auto-regression components, which means relation of an observation with lagged individual observation. p also stands for Partial Auto Correlation Function (PACF).

d : order of differencing to make the data stationary. We have already dealt with this part.

q : denoted the moving average components, which means relation of an observation with lagged multiple observations vector. q also stands for Auto Correlation Function (ACF).

Step 3 : Potential Model Identification

Well, potential models can be identified using following methods :

1. Using ACF and PACF function plots visually
2. Using Minimum Information Criteria (Recommended)
'
Let's also break the data into two part here itself : Training and Validation
We will train out model on Training part and will check the goodness of model on another

Data Training validation;

set masterdata;

if date >= '01Jan1960'd then output validation;

else output Training;

Run;

/* Now run following for checking ACF and PACF plots */

PROC ARIMA DATA= Training;

IDENTIFY VAR = Log_Air (1,12);

RUN;Quit;

It is advised to use the second method for model selection process, though:

ods output MINIC = minic_data;

PROC ARIMA DATA= training;

IDENTIFY VAR = Log_Air(1,12)MINIC;

RUN;quit;

Here is the macro that you can use for doing model iterations :

%Macro top_models;

%do p = 0 %to 3 ;

%do q = 0 % to 3 ;

PROC ARIMA DATA= training ;

IDENTIFY VAR = Log_Air(1,12) ;

ESTIMATE P = &p. Q =&q. OUTSTAT= stats_&p._&q. ;

Forecast lead=12 interval = month id = date

out = result_&p._&q.;

RUN;

Quit;

data stats_&p._&q.;

set stats_&p._&q.;

p = &p.;

q = &q.;

Run;

data result_&p._&q.;

set result_&p._&q.;

p = &p.;

q = &q.;

Run;

%end;

Data final_stats ;

set %do p = 0 %to 3 ;

%do q = 0 % to 3 ;

stats_&p._&q.

%end;

%end;;
if p = 0 and q = 0 then delete;

Run;

Data final_results ;

set %do p = 0 %to 3 ;

%do q = 0 % to 3 ;

result_&p._&q.

%end;

%end;;
if p = 0 and q = 0 then delete;

Run;

%Mend;

%top_models

Thus we get the result of all the top 15 (except 0,0) models, from which we choose the top 5-7 models based on minimum average value of AIC and SBC (AIC and SBC are both errors).

Proc SQL outobs = 7;

create table final_stats_1 as select p,q, sum(_VALUE_)/2 as mean_aic_sbc from final_stats

where _STAT_ in ('AIC','SBC')

group by p,q

order by mean_aic_sbc;

Quit;

/* Finally we calculate MAPE value for top 7 models only to decide the final model

MAPE = Abs(Actual – Predicted) / Actual *100 */

Proc SQL;

create table final_results_1 as select a.p, a.q, a.date,a.forecast, b.log_air

from final_results as a join validation as b

on a.date = b.date right join final_stats_1 as c

on a.p = c.p

and a.q = c.q;

quit;

Data Mape;

set final_results_1 ;

Ind_Mape = abs(log_air - forecast)/ log_air;

Run;

Proc Sql;

create table mape as select p, q, mean(ind_mape) as mape from mape

group by p, q

order by mape ;

quit;

And finally we arrive on the final model with p = 0 and q = 3 as final model and with 1 % as final MAPE value. MAPE should be less than equal to 5% to be acceptable.

I this blog I have avoided few theoretical points and terms such as unit root, white noise etc. as there terms are important to know theoretically and for interview point of view but not much for procedural purpose. I don't want to confuse you unnecessarily but making article too heavy to digest!!!

Humble appeal: