The examples below illustrate how Analytic Solver Data Mining can be used to explore the Income.xlsx dataset to uncover trends and seasonalities in a dataset.
Click Help – Example Models on the Data Mining ribbon, then Forecasting/Data Mining Examples.
This dataset contains the average income of tax payers by state.
Typically the following steps are performed in a time series analysis.
1. The data is partitioned into two sets with 60% of the data assigned to the Training Set and 40% assigned to the Validation Set.
2. Exploratory techniques are applied to both the Training and Validation Sets. If the results are in synch, then the model can be fit. If the ACF and PACF plots are the same, then the same model can be used for both sets.
3. The model is fit using the ARIMA (Autoregressive Integrated Moving Average) method.
4. When a model is fit using the ARIMA method, XLMiner displays the ACF and PACF plots for residuals. If these plots are in the band of UCL and LCL, then the residuals are random and the model is adequate.
5. If the residuals are not within the bands, then some correlations exist, and the model should be improved.
First we must perform a partition on the data. Click Partition within the Time Series group on the Data Mining ribbon to open the following dialog.
Select Year under Variables and click > to define the variable as the Time Variable. Select the remaining variables under Variables and click > to include them in the partitioned data.
Select Specify #Records under Specify Partitioning Options to specify the number of records assigned to the training and validation sets. Then select Specify #Records under Specify #Records for Partitioning. Enter 50 for the number of Training Set records and 21 for the number of Validation Set records.
If Specify Percentages is selected under Specify Partitioning Options, Analytic Solver Data Mining will assign a percentage of records to each set according to the values entered by the user or automatically entered under Specify Percentages for Partitioning.
Click OK. TSPartition is inserted to the right of the Income worksheet.
Note in the output above, the partitioning method is sequential (rather than random). The first 50 observations have been assigned to the training set and the remaining 21 observations have been assigned to the validation set.
Open the Lag Analysis dialog by clicking ARIMA – Lag Analysis. Select CA under Variables in input data, then click > to move the variable to Selected variable. Enter 1 for Minimum Lag and 40 for Maximum Lag under Parameters: Training and 1 for Minimum Lag and 15 for Maximum Lag under Parameters: Validation.
Under Charting, select ACF chart, ACVF chart, and PACF chart to include each chart in the output.
Click OK. TS_Lags is inserted right of the TSPartition worksheet.
First, let's take a look at the ACF charts. Note on each chart, the autocorrelation decreases as the number of lags increase. This suggests that a definite pattern does exist in each partition. However, since the pattern does not repeat, it can be assumed that no seasonality is included in the data. In addition, both charts appear to exhibit a similar pattern.
Note: To view these two charts in the Cloud app, click the Charts icon on the Ribbon, select TS_Lags for Worksheet and ACF/ACVF/PACF Training/Validation Data for Chart.
The PACF functions show a definite pattern which means there is a trend in the data. However, since the pattern does not repeat, we can conclude that the data does not show any seasonality.
The screenshots below display the autocovariance values.
All three charts suggest that a definite pattern exists in the data, but no seasonality. In addition, both datasets exhibit the same behavior in both the training and validation sets which suggests that the same model could be appropriate for each. Now we are ready to fit the model.
The ARIMA model accepts three parameters: p – the number of autoregressive terms, d – the number of non-seasonal differences, and q – the number of lagged errors (moving averages).
Recall that the ACF plot showed no seasonality in the data which means that autocorrelation is almost static, decreasing with the number of lags increasing. This suggests setting q = 0 since there appears to be no lagged errors. The PACF plot displayed a large value for the first lag but minimal plots for successive lags. This suggest setting p =1. With most datasets, setting d =1 is sufficient or can at least be a starting point.
Click back to the TSPartition tab and then click ARIMA – ARIMA Model to bring up the Time Series – ARIMA dialog.
Select CA under Variables in input data then click > to move the variable to the Selected Variable field. Under Nonseasonal Parameters set Autoregressive (p) to 1, Difference (d) to 1 and Moving Average (q) to 0.
Click Advanced to open the ARIMA – Advanced Options dialog. Select Fitted Values and residuals, Produce forecasts, and Report Forecast Confidence Intervals. The default Confidence Level setting of 95 is automatically entered. The option Variance-covariance matrix is selected by default.
Click OK on the ARIMA-Advanced Options dialog and again on the Time Series – ARIMA dialog. Analytic Solver Data Mining calculates and displays various parameters and charts in four output sheets, Arima_Output, Arima_Fitted, Arima_Forecast and Arima_Stored. Click the Arima_Output tab to view the Output Navigator.
Click the ARIMA Model link on the Output Navigator to move to display the ARIMA Model and Ljung-Box Test Results on Residuals.
Analytic Solver has calculated the constant term and the AR1 term for our model, as seen above. These are the constant and f1 terms of our forecasting equation. See the following output of the Chi - square test.
The very small p-values for the constant term (1.119E-7) and AR1 term (1.19e-89) suggest that the model is a good fit to our data.
Click the Fitted link on the Output Navigator. This table plots the actual and fitted values and the resulting residuals for the training partition. As shown in the graph below, the Actual and Forecasted values match up fairly well. The usefulness of the model in forecasting will depend upon how close the actual and forecasted values are in the Forecast, which we will inspect later.
Use your mouse to select a point on the graph to compare the Actual value to the Forecasted value.
Note: To view these two charts in the Cloud app, click the Charts icon on the Ribbon, select Arima_Fitted for Worksheet and ACF/ACVF/PACF Training/Validation Data for Chart.
Take a look at the ACF and PACF plots for Errors found at the bottom of ARIMA_Output. Analytic Solver contains one more additional chart, the ACVF Plot for the Residuals.
With the exception of Lag1, the majority of the lags in the PACF and ACF charts are either clearly within the UCL and LCL bands or just outside of these bands. This suggests that the residuals are random and are not correlated.
Click the Forecast link on the Output Navigator to display the Forecast Data table and charts.
The table shows the actual and forecasted values along with LCI (Lower Confidence Interval), UCI (Upper Confidence Interval) and Residual values. The "Lower" and "Upper" values represent the lower and upper bounds of the confidence interval. There is a 95% chance that the forecasted value will fall into this range. The graph to the right plots the Actual values for CA against the Forecasted values. Again, click any point on either curve to compare the Actual against the Forecasted values.