Data Analysis of Monthly CO_{2}
 18 mins1 Preanalysis
The data is the Monthly CO2 Level at Alert, Canada (01/1994  12/2004). First we can observe the overall trend of the CO2 level during 01/1994  12/2004.
Judging from the plot, there is no significant outliers in this plot. Also, there is no need to transform the data. We can find that there’s an obvious seasonal component in the picture. Also we can count that there are 11 complete cycles during this period. Therefore, we can calculate that there are 12 time points in each cycle.
2 Remove Seasonal Components
We remove the seasonal components by taking the difference using lag = 12. We can see the plot of the residuals after removing the seasonal component. So in seasonal ARIMA model, the D = 1 here.
The residuals looks not stationary. It seems like there’s no constant mean in these residuals. So we conduct the ADF and KPSS test to test whether the residuals are stationary.
The DickeyFuller test’s null hypothesis is that the series are not stationary(The time series has a unit root). The pvalue of DickeyFuller Test is 0.1575. So we accept the null hypothesis that the residuals are not stationary.
KPSS Test’s null hypothesis is that the time series does not have a unit root, i.e. the time series are stationary. The pvalue is 0.1 which means we accept the null hypothesis that the residuals are stationary.
Combined these two results, we tend to believe that there might be some nonstationality left in the residuals.
3 Find Stationary Series
So we take the difference of the residuals after removing the seasonal component. Here in seasonal ARIMA model, the d = 1. Then we conduct the ADF test and KPSS test again after taking the difference of lag = 1.
The DickeyFuller test’s null hypothesis is that the series are not stationary(The time series has a unit root). The pvalue of DickeyFuller Test is 0.01 here. So we reject the null hypothesis and conclude that the residuals are stationary.
KPSS Test’s null hypothesis is that the time series does not have a unit root, i.e. the time series are stationary. The pvalue is 0.1 which means we can accept the null hypothesis that the residuals are stationary.
Therefore, we find that the residuals after taking difference of lag = 1 are stationary. Then we can fit the seasonal ARIMA model to these residuals.
4 Found the Model
We can plot the ACF and PACF to find the model to fit the stationary residuals.
First of all, we looked at the lags which are multiples of d = 12(i.e. 12,24,36,48,60) to find the ARMA model to fit the seasonal component. As we can see from the plot, the ACF at lag = 12 is significant and then drop to 0 after Lag = 12, and Lag = 12 is the first element of seasonal components’ lags. And PACF seems to trail off to 0 with lag at 12, 24,36 are significant. So we can fit a MA(1) model to the seasonal component. So the seasonal ARIMA model here so far are $ ARIMA(p,1,q)(0,1,1) $
Then we can look at the nonseasonal part of the ACF and PACF plot whose lag are not multiples of 12. We found that in the ACF part, the ACF drop to 0 after lag = 1 and there are 2 other significant lags at 11 and 13. Here, we regard them as the type I error. In the PACF part, the PACF drop to 0 after lag = 2. It is significant at lag = 1,2,11,22. But we regard the lag = 11 and 22 as type I error here. So the possible model for the residuals might be MA(1) or AR(2) model. So we fit these two models to compare which one is better for forecast.
5 Model Selection
5.1 ARIMA(2,1,0)(0,1,1)[12]
We fit the AR(2) model to fit the nonseasonal components. The fit results are shown below.
Then we plot the ACF and PACF of the residuals after we fit the ARIMA(2,1,0)(0,1,1)[12] model and conduct the LjungBox test.
ACF and PACF are not significant expcept for the PACF at lag 18. This significant lags might be type I error. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. LjungBox test interprets the pvalue of 0.005462 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.
Then we conduct the ShapiroWilk test to check the normality assumption.
The pvalue is 0.1235 > 0.05. We accept the null hypothesis that the normality holds for the noises.
5.2 ARIMA(0,1,1)(0,1,1)[12]
We fit the MA(1) model to fit the nonseasonal components. The fit results are shown below.
Then we plot the ACF and PACF of the residuals after we fit the ARIMA(0,1,1)(0,1,1)[12] model and conduct the LjungBox test.
ACF and PACF are not significant expcept for the PACF at lag 18. This significant lags might be type I error. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. LjungBox test interprets the pvalue of 0.01112 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.
Then we conduct the ShapiroWilk test to check the normality assumption.
The pvalue is 0.04 < 0.05. We reject the null hypothesis and say that the normality does not hold for the noises of MA(1) model.
5.3 “auto.arima” Function Model
We use the “auto.arima” function to fit the CO2 data. The results are shown below.
The model here is ARIMA(2,0,1)(1,1,0)$_{12}$. Then we plot the ACF and PACF of the residuals after we fit the model and conduct the LjungBox test.
ACF and PACF are not significant. So we can conclude that there’s no dependence structure remaining in the residuals. They are uncorrelated. LjungBox test interprets the pvalue of 0.001398 < 0.05. We should reject the null hypothesis and say the noises are not independent. As a matter of fact, this conclusion are not totally contradictory to the ACF and PACF results, since uncorrelation cannot imply independence. As long as the residuals are uncorrelated, the white noise assumption holds. Therefore, we accept that the residuals after fitting the seasonal ARIMA model are white noise.
Then we conduct the ShapiroWilk test to check the normality assumption.
The pvalue is 0.04276 < 0.05. We reject the null hypothesis and say that the normality does not hold for the noises of ARIMA(2,0,1)(1,1,0)$_{12}$ model.
5.4 Comparison
Compare these three model in terms of AIC, white noise assumptions of residuals and normality assumption of residuals.

In terms of AIC, the last one model seems much larger than the previous two models.

Their residuals are all white noise.

In terms of normality, which is of great importance for forecasting, only the first model holds the normality assumption.
After some tradeoffs, we choose the first model to forecast the result because the first model’s normality assumption holds. When the normality assumption holds, the forecast intervals in the next step would be more reliabe with less bias.
Hence we choose ARIMA(2,1,0)(0,1,1)$_{12}$
Forecast in 2005
I use the “forecast” function to forecast the CO2 level in 2005. The forecast plot, points and the confidence intervals are shown as below.
The 95% forecast intervals are reliable since the normality assumption of the residuals holds.