ARIMA using SAS
We have covered basics about time series and also the basic methods of forecasting. It is time to learn the most important and most widely used for time series forecasting : ARIMA.
It is not possible to write ARIMA in a single stretch, it being full of complication, hence we plan to write it in series of article.
Related articles :
Time Series Forecasting - Part 1 (Overview)
Time Series Forecasting - Part 2 (SES)
Time Series Forecasting - Part 3 (DES)
Time Series Forecasting - Part 4 (TES)
Let's now learn ARIMA
ARIMA is an acronym for Auto Regressive Integrated Moving Average. It is one of the most popular methods used for time series forecasting. ARIMA, also known as Box-Jenkins, is mainly used for short term forecasting. It works best when the data exhibits consistent pattern over time, with a minimum amount of outliers.Let's directly jump to "How to do ARIMA modeling?" and we would cover the important terminologies there and then only.
Starting with:
Step 0 : Plot and feel the series
proc
Sgplot
data
= masterdata;
Series x = date
y =
Log_AIR;
run;
Quit;
It is clear from the chart that the series is having strong seasonality and trend.
Step 1 : Check Series Volatility
Let's now check whether the series is volatile or not :
As the data is diverging (Fan Shaped), it is volatile. It is required that we should make the series non-volatile for ARIMA procedure and to reduce its volatility, we transform it.
Step 2 : Stationarity Check
Once again we plot the series (transformed) and do the stationarity check.
What is stationary series ?
A time series is called stationary whose mean and variance is constant over a period of time, in other and precise words mean and variance is not be a function of time.
Now while a series is having a constantly increasing trend, if we take mean across two time zones, definitely the mean won't be same and hence it won't be stationary.
Example :
Hope the concept of stationarity in the time series is clear. Let's see how to do the same in SAS for our case.
What is stationary series ?
A time series is called stationary whose mean and variance is constant over a period of time, in other and precise words mean and variance is not be a function of time.
Now while a series is having a constantly increasing trend, if we take mean across two time zones, definitely the mean won't be same and hence it won't be stationary.
Example :
Hope the concept of stationarity in the time series is clear. Let's see how to do the same in SAS for our case.
Ideally for ARIMA, series should be non-stationary, however for procedural purpose, we need to make it stationary. A stationary series is generally is not very interesting.Apart from visual check, there is more robust statistical way method to check the stationarity. We can use Augmented Dickey Fuller Test (recommended):
Basically we capture the trend and seasonality while making a non-stationary series stationary in ARIMA process, while theoretically it is said that we remove the trend and seasonality.
PROC
ARIMA
DATA=
Masterdata;
IDENTIFY VAR = log_Air STATIONARITY=
(ADF) ;
RUN;
Quit;
Here we have done 1 and 12 order differencing simultaneously, you can can do it one by one and check the results for better understanding.
Perform differencing and checking non-stationarity again:
PROC
ARIMA
DATA=
masterdata
;
IDENTIFY VAR = Log_Air (1,12)
STATIONARITY=
(ADF) ;
RUN;
Quit;
Quit;
You can try differencing the series in MS Excel for better understanding : First subtract first lag from each observation and plot it and then with the new series subtract 12th lag from each observation. You will get similar output as shown above.
All right! Now while series has been made stationary. It is now time to identify the potential order of the model.
Order of the ARIMA model is denoted by ARIMA (p, d, q) or AR(p), I(d), MA (q)
What are these p, d ,q ?
p : denoted the auto-regression components, which means relation of an observation with lagged individual observation. p also stands for Partial Auto Correlation Function (PACF).
d : order of differencing to make the data stationary. We have already dealt with this part.
q : denoted the moving average components, which means relation of an observation with lagged multiple observations vector. q also stands for Auto Correlation Function (ACF).
Step 3 : Potential Model Identification
Well, potential models can be identified using following methods :
1. Using ACF and PACF function plots visually
2. Using Minimum Information Criteria (Recommended)
'
Let's also break the data into two part here itself : Training and Validation
We will train out model on Training part and will check the goodness of model on another
Data Training validation;
set masterdata;
if date
>= '01Jan1960'd
then
output
validation;
else output Training;
Run;
/* Now run following for checking ACF and PACF plots */
PROC
ARIMA
DATA= Training;
IDENTIFY VAR = Log_Air (1,12);
RUN;Quit;
It is advised to use the second method for model selection process, though:
ods output
MINIC = minic_data;
PROC
ARIMA
DATA=
training;
IDENTIFY VAR = Log_Air(1,12)MINIC;
RUN;quit;
Here is the macro that you can use for doing model iterations :
%Macro top_models;
%do p = 0 %to 3 ;
%do q = 0 % to 3 ;
PROC ARIMA DATA= training ;
IDENTIFY VAR = Log_Air(1,12) ;
ESTIMATE P = &p. Q =&q. OUTSTAT= stats_&p._&q. ;
Forecast lead=12 interval = month id = date
out = result_&p._&q.;
RUN;
Quit;
data stats_&p._&q.;
set stats_&p._&q.;
p = &p.;
q = &q.;
Run;
data result_&p._&q.;
set result_&p._&q.;
p = &p.;
q = &q.;
Run;
%end;
%end;
Data final_stats ;
set %do p = 0 %to 3 ;
%do q = 0 % to 3 ;
stats_&p._&q.
%end;
%end;;
if p = 0 and q = 0 then delete;
if p = 0 and q = 0 then delete;
Run;
Data final_results ;
set %do p = 0 %to 3 ;
%do q = 0 % to 3 ;
result_&p._&q.
%end;
%end;;
if p = 0 and q = 0 then delete;
if p = 0 and q = 0 then delete;
Run;
%Mend;
%top_models
Thus we get the result of all the top 15 (except 0,0) models, from which we choose the top 5-7 models based on minimum average value of AIC and SBC (AIC and SBC are both errors).
Proc SQL outobs = 7;
create table final_stats_1 as select p,q, sum(_VALUE_)/2 as mean_aic_sbc from final_stats
where _STAT_ in ('AIC','SBC')
group by p,q
order by mean_aic_sbc;
Quit;
/* Finally we calculate MAPE value for top 7 models only to decide the final model
MAPE = Abs(Actual –
Predicted) / Actual *100 */
Proc SQL;
create table final_results_1 as select a.p, a.q, a.date,a.forecast, b.log_air
from final_results as a join validation as b
on a.date = b.date right join final_stats_1 as c
on a.p = c.p
and a.q = c.q;
quit;
Data Mape;
set final_results_1 ;
Ind_Mape = abs(log_air - forecast)/ log_air;
Run;
Proc Sql;
create table mape as select p, q, mean(ind_mape) as mape from mape
group by p, q
order by mape ;
quit;
And finally we arrive on the final model with p = 0 and q = 3 as final model and with 1 % as final MAPE value. MAPE should be less than equal to 5% to be acceptable.
I this blog I have avoided few theoretical points and terms such as unit root, white noise etc. as there terms are important to know theoretically and for interview point of view but not much for procedural purpose. I don't want to confuse you unnecessarily but making article too heavy to digest!!!
Enjoy reading our other articles and stay tuned with us.
Kindly do provide your feedback in the 'Comments' Section and share as much as possible.