Evaluating the Performance of Autoregressive Model for Solar Radiation Forecasting

The solar radiation data taken from 14 meteorological stations in Nigeria has been analyzed. The periodic component of the data which covered a period of 13 (mostly 1977-1989) years was removed via Fourier analysis while the residual series was subjected to autoregressive analysis. It was evident from the t-test and autocorrelation plots of the modified (i.e. without the periodic component) series that there exist significant persistence at nine stations including Sokoto, Nguru, Kano, Maiduguri, Bauchi, Yola, Minna, Ibadan, and Benin. The autocorrelation at Jos, Bida, Ikeja, Enugu and Port Harcourt were however found to be insignificant. As the sample partial autocorrelation function cuts off after lag 1, a non-seasonal autoregressive model of order 1, AR (1), was identified for stations with autocorrelation. The Q-statistic of error series suggested that the models were adequate as identified. Moreover, the exploratory plots of the model residual series showed agreement with the quantitative statistics and thus enforces the inference that the models were adequate for monthly mean daily global solar radiation forecasts at some of the study stations. It is interesting to note that all the stations within the sub-sahelian region showed significant persistence whereas all the stations in the coastal region except Benin were found with insignificant autocorrelation. Expectedly, the performance evaluation of the model gave impressive result for the stations within the sub-sahelian region but a relatively weak result for the coastal region. The result for the midland region was mixed whereas it was difficult to conclude on the Guinea savannah region with result from only one station.


Introduction
Solar radiation is the radiant energy flux emanating from the sun. As it travels through space to the top of the Earth's atmosphere, it is relatively unaffected in spectral distribution and intensity. However, on its way to the Earth's surface, it is subjected to scattering and absorption by atmospheric constituents. Studies have shown that some 25-70% of solar radiation reaching the top of the atmosphere is received at the Earth's surface. Thus, seasonal variation in the composition and concentration of atmospheric constituents determines to a great extent the spatial and temporal variability of solar radiation at the Earth's surface (Brooks, 2006;Lindsey, 2009).
In research and application fields such as architecture, energy, agriculture, meteorology to mention a few, it is usually required that the variable be measured continuously and accurately over a long term. However, there is a huge cost implication to continuously measure solar radiation at all locations. Consequently, many studies have been carried out to develop methods to estimate climatic variables such as solar radiation at places where measurements are not available but shared similar climatic conditions with the measured locations. Some researchers capitalizes on the ability of individual variable to predict self (persistence), others rely on other measured parameters known as explanatory variables in establishing empirical regression models.
Numerous empirical studies have been carried out to investigate the predictive skill of persistence in meteorological and geophysical time series. For instance, in the study of Sulaiman et al. (1997), the daily solar radiation data from four different locations in Malaysia was analyzed using the Box-Jenkins approach. The deterministic component is removed using Fourier analysis and the stochastic component was subjected to Autoregressive Moving Average (ARMA) analysis. Similarly, Zakaria (2011) developed a model for the periodic and stochastic components of monthly rainfall data from Purajaya station, Indonesia and was able to adequately synthesize monthly rainfall data for the Purajaya area. Zaharim et al. (2009) in their work applied the Box-Jenkins method to solar radiation data from Bangi, Malaysia. The non-seasonal autoregressive model of order 1, ARMA (1, 0), was found to model the data adequately. In an insightful work of Barnett et al. (1984) where they investigated the role of persistence in the predictability of short-term climate variation over Eurasian continent, it was found that persistence in air temperature anomalies accounted for most of the predictability ahead of sea surface temperature and sea level pressure anomalies as other predictors. Adeyemi and Fasoranbaku (2008) sought to predict daily mean air temperature over Nigeria and developed non-seasonal autoregressive models of varying order for eight meteorological sites. Results obtained were found to compare well with existing models. Some other authors have fit seasonal ARMA (SARMA) model and periodic autoregressive (PAR) model to meteorological time series in order to explicitly account for seasonal variation. For instance, Tesfaye et al. (2006) developed model identification and simulation techniques based on the periodic autoregressive moving average (PARMA) model that could capture the seasonal variations in river flow statistics. The technique was applied to monthly flow data for the Fraser River in British Columbia and it was found that the statistical analysis of the PARMA model residuals, including a truncated Pareto model for the extreme tails produced a realistic simulation of these river flows. Smadi (2009) investigated the best of the traditional SARMA model and the relatively new PAR model for modeling the average monthly maximum temperature data in Jordan. He found that the PAR model does a better job but the fact that large number of parameters is involved, especially for monthly data, remained its main drawback.
Besides, meteorological time series can also be estimated based on other measured variables. For instance solar radiation has been estimated based on variables such as sunshine hours (Page, 1964;Akpabio et al., 2004;Falayi and Rabiu, 2005;Safari and Gasore, 2009), air temperature (Okundamiya and Nzeako, 2011), precipitation and relative humidity (Trabea and Shaltout, 2000). The most widely used model in those studies is the Angstrom-Prescott model.
In this paper, we sought to develop persistence models for monthly mean daily global solar radiation forecast over Nigeria. This specific effort looked fruitful since prior studies have suggested the existence of strong persistence in global solar radiation elsewhere within the tropical belt (Zaharim et al., 2009;Raji et al., 2012).

Data sourcing and cleaning
The data for this study were obtained from the Nigerian Meteorological Agency (NIMET), Federal Ministry of Aviation, Oshodi, Lagos. A 13-year monthly mean daily global solar radiation data from fourteen meteorological stations in Nigeria was acquired. The first 8-year data was used to calibrate the data and the remaining 5-year data was used to validate the model.
In most of the stations it was found that the data is filled with missing measurements. The long term mean of the month to which the missing measurement belongs was used to replace the missing values in the data set. The data was also inspected for the presence of outliers via the notched box plots. The values lying beyond the whiskers are marked as outliers in comparison to what is expected from a normal distribution with the same mean and variance as the sample data. Some of the stations including Sokoto, Nguru, Maiduguri, Bauchi, Jos, Bida, Ibadan, Ikeja, Enugu and Benin were found with a number of outliers. This is especially important since the least squares method of model parameter estimation employed in this study is very sensitive to the presence of outliers and could lead to spurious results if not checked. The spotted outliers were removed and replaced by the long term mean of the month to which they belong in the data set.

Deterministic and stochastic components
Solar radiation, like most meteorological variables consists of periodic component (deterministic) and random component (non-deterministic). The periodic component is predictable with great precision even far into the future but it is not so with the random component (Jones, 1964). It suffixes then that when doing a statistical analysis of a meteorological time series; it should be the aim to determine the predictability of this nondeterministic component. Thus, suppose , 1 , . . . , −1 is an arbitrary time series of length , the time series can be expressed as (Sharma, 1989): (1) where, is the periodic component and is the stochastic or random component of the series.
The periodic component can be expressed as the sum of sinusoids at the dominant Fourier frequencies of the series as (Sharma, 1989;Zakaria, 2011): The summation in (2) is over the Fourier frequencies given by: are the harmonics, and are the Fourier coefficients. The dominant harmonics are obtained from the spectral estimates of the time series (Sharma, 1989). In particular for the study stations, it was found that the first two harmonics are significant, implying that k=2 for our time series. Thus the time distribution of the series may be written as: is the estimated value of the periodic component and is denoted by ̂ . The deterministic and stochastic components can be separated by subtracting the estimated periodic component ̂ month by month and dividing it by the sample standard deviation, for same month (Sharma, 1989;Adamowski and Smith, 1972). Thus: where is the stochastic component with zero mean and unit variance. The stochastic component may in turn be further decomposed to: is the deterministic component representing persistence or autocorrelation in the residual series, ∅ is the autocorrelation coefficients and is white noise which is independently distributed with zero mean and constant variance and assumed to approximate a normal probability distribution (Sharma,1989 ;Zakaria,2011).

Stochastic modelling
The autoregressive model as in all other persistence models describes a stationary process. Consequently, a preliminary step in any time series modeling is to investigate if the underlying process generating the data is stationary. If the series is non-stationary, it must be transformed to stationary series. Differencing is commonly used to transform a non-stationary time series into a stationary series. Stationary time series can be identified by conducting the unit root test on the series. The common test statistics for unit root are the Phillips-Perron test (PP test) and the Augmented Dickey-Fuller test (ADF test) which is basically an extension of the Dickey-Fuller test (DF test). These statistics evaluates the null hypothesis of the presence of a unit root for the model underlying a series against the alternative hypothesis of a root smaller than one. Suppose that is a first order autoregressive process, the objective of the tests is to evaluate the hypotheses: : ∅ 1 = 1 (i. e. series contains a unit root) 1 : ∅ 1 < 1 (i. e. series is stationary) For the underlying model given by: The decision rule is that: where is the computed statistic, say the Phillips-Perron statistic at lag j and C value is the critical value which does not follow the usual t-distribution but follows a non-standard distribution and derived from Monte Carlo experiments (Fuller, 1976).
Most standard statistical software including MATLAB which is used in this study presents a table of critical values given any level of significance. Where the computed test statistic is more negative than the critical value, the null hypothesis of unit root is rejected implying a stationary series.
A complementary approach is to examine the behavior of the correlograms for the sample autocorrelation function (SACF) and the partial autocorrelation function (PACF) of the time series for several lags (say 20 lags). The sample autocorrelation function at lag k is defined as: The plot of the sample autocorrelation function (SACF) as a function of lag is called the correlogram. ̅ is defined as the mean of the series, .
The sample partial autocorrelation function (SPACF), at lag k is defined as (Durbin, 1960): In order to establish the significance of the ACF values at various lags and for each station, 95% confidence bounds may be computed and superimposed on the correlograms for easy identification. In general, if the SACF of the time series either cuts off or dies down quickly, the time series should be considered stationary and conversely if the SACF values dies down extremely slowly, the series is considered non-stationary (Bowerman et al., 2005). The presence of a trend and/or a periodic signal could cause a violation of the condition of stationarity. A common approach to removing trend and periodic signal in the time series is by de-trending and harmonic analysis respectively (Sharma, 1989;Zakaria, 2011).
The structure and order of the model is determined by examining the behavior of the SACF and the sample partial autocorrelation function (SPACF). If the SACF dies down and the SPACF cuts off after lag , then an autoregressive process of order (Box and Jenkins, 1976;Chatfield, 2004) is indicated and the model, in general is given as: The historic data is then fitted to the identified model by least squares method. We can judge the importance of the autocorrelation parameters using t-test. In doing this we can test the null hypothesis 0 against the alternative hypothesis 1 as follows: where, ∅ are the autoregressive parameter estimates from lag 1 to . Thus we establish the decision rule as: where, − −1 2 ⁄ is the value from the student's t distribution table at half the preset significance level (since it is a two-tail test) and at − − 1 degrees of freedom.
Possible values of include 0.10, 0.05 and 0.01, is the sample size and is the number of parameter estimators. Thus, a rejection of the null hypothesis would suggest that the parameter estimate in question is statistically significant.
After establishing the model, we then check the adequacy of the model by examining the 'whiteness' of model residuals. This is of key importance in showing the effectiveness of the model in monitoring the underlying autoregressive process. The primary tool for model diagnosis is residual analysis. If the model adequately describes the time series, then the residual chronologies should be independent of one another as effective autocorrelation model is expected to remove autocorrelation in the time series leaving a random series. In testing randomness of the residual, the portmanteau test such as the one proposed by Ljung and Box (1978) is carried out on the residuals. The Ljung-Box statistic is also known as the Q-statistic and is given by: with ′ = − , where is the number of observations in the series, is the degree of differencing and 2 () is the sample autocorrelation of the residual at lag .
The Q-statistic, computed from the lowest k autocorrelations, say at 1,2,3,. . . ,20 follows a χ 2 distribution with (k-p-q) degrees of freedom, where p and q are the AR and MA orders of the model. The Q-statistic is used to evaluate the following hypothesis: where, ∅ () are the autocorrelations of the residual ̂ from lag 1 to k. Since the alternative hypothesis is a two-tail test, we adopt the following decision rule: where, − − , 2 ⁄ 2 is the value from the χ 2 distribution table at half the preset significance level α and − − degrees of freedom. Consequently, a failure to reject the null hypothesis 0 suggests that the residual series is random and the model may be judged to be adequate. A complementary approach is by examining the autocorrelation plot of the residual series with a superimposed confidence bounds. The residual series is judged to be random if the autocorrelations at all lags fall within the confidence bounds.

Results and Discussion
The correlograms for two stations (Jos and Port Harcourt) are as shown in Figure 1. It is evident that the correlograms show alternating spikes with repeated patterns and constant period indicating the presence of seasonal fluctuation. Harmonic analysis of the series showed that the first-two harmonics were significant for all the stations.   Table 1 and the sample autocorrelation function plot of the modified series shown in Figure 3 indicate that the series was made stationary.    Figure 3 shows the correlogram for Maiduguri, a station with significant autocorrelation and the correlogram for Jos, a station with insignificant autocorrelation for illustration. For Maiduguri, the autocorrelations for the first few lags fell outside the 95% confidence bounds showing significant persistence whereas for Jos, the autocorrelation values at all lags fell within the 95% confidence bounds showing insignificant persistence in radiation anomalies at this station. This observation suggests that the conventional autoregressive model may be employed to forecast monthly mean daily global radiation anomalies at the nine stations where autocorrelation was found to be significant but not so for the other five stations with insignificant autocorrelation.
Following the method outlined in section 4.0 for identifying the structure and order of the model (e.g. see Figure 3 for Maiduguri); we tentatively identify the non-seasonal autoregressive model of order 1 given by: The point estimate of the parameter ∅ 1 in model (16) for the stations is as presented in Table 2. The statistical significance of model parameters was tested using t-test for the study stations and the result obtained is equally presented in Table 2. The fact that the t value absolute at Sokoto, Nguru, Kano, Maiduguri, Bauchi, Yola, Minna, Ibadan, and Benin are higher than 0.975 from the student's t distribution table causes us to reject the null hypothesis of zero autocorrelation at the preset significance level α = 0.05. Thus we can conclude that the estimated parameters at those stations are statistically significant. However, a comparison of the t value absolute of parameters at Jos, Bida, Ikeja, Enugu and Port Harcourt and 0.975 causes us not to reject the null hypothesis of zero autocorrelation since the computed t statistic at those stations are lower than 0.975 . Thus, we would conclude that the parameter estimates at those stations are not statistically significant.
The Ljung-Box statistic of the error series was computed for all the study stations and the result is presented in Table 3. The computed Q statistic was found to be lower than the critical value . from the 2 distribution table at the preset significance level of 0.05 and 19 degrees of freedom. This causes us not to reject the null hypothesis of zero autocorrelation or randomness in the error series. In addition, the autocorrelation function (ACF) of the model residuals plotted for all the station revealed that the residuals are random with the autocorrelations falling within the 95% confidence interval around zero.  Figure 4 illustrates the residual autocorrelation plots for Maiduguri, Bida, Ibadan and Benin. This is the desired result as effective autoregressive model should explain persistence and yield random residuals (Chatfield, 2004). Thus, we can conclude that at the preset significance level of 0.05, we cannot reject the adequacy of the autoregressive models for forecasting purposes. The plots of Figure 5 illustrates a great deal of agreement between predicted and measured values of the monthly mean global solar radiation at Sokoto (a Sub-Sahelian station), Minna (a Midland station), Ibadan (a Guinea savannah station) and Benin (a Coastal station) indicating that the observed data is well monitored by the model for those stations. However, errors of intensities are common especially around turning points. This observation could be due to the fact that the model did not take into account the effect of several weather phenomena such as the cloud cover and varying atmospheric constituents in estimating the global solar radiation. The performance of the model was evaluated by computing the coefficient of determination (R 2 ), the mean bias error (MBE), and the root mean square error (RMSE) of prediction and the result is presented in Table 4. Generally, the model gave a relatively high R 2 values and low RMSE especially for stations within the sub-sahelian region. However, similar performance is seen at Minna (55.22%) and Yola(86.92%) in the midland region as well as Ibadan(43.46%); a Guinea savannah region. The situation in the coastal region is relatively different with poorer performance as indicated with lower R 2 and higher RMSE values. This tends to suggest that the model is region selective, performing best in the sub-sahelian region but very poor in the coastal region. The MBE values showed that the model estimates were slightly underestimated at Kano, Minna, Ikeja and Enugu but also slightly overestimated for most of the other stations studied.

Conclusions
The solar radiation data taken from 14 meteorological stations in Nigeria has been utilized to develop self-predicting models for monthly mean daily global solar radiation forecast over Nigeria. After removing the deterministic component through harmonic analysis, it was found that the stochastic component of the monthly mean daily global solar radiation over most of the study locations may well be monitored by the first-order autoregressive model, AR(1). The situation at Jos, Bida, Ikeja, Enugu and Port Harcourt however suggests that the underlying mechanism generating the stochastic series at those stations is not autoregressive. In addition, a simulation of the measured global solar radiation revealed that the model did well for the stations within the sub-sahelian region but not so well for the stations in the coastal region. Hence, the model is recommended for use to forecast solar radiation over all the stations within the sub-sahelian region but not appropriate for the coastal stations. The situation in the midland region showed that the model is station specific in terms of performance whereas it is difficult to conclude on the Guinea savannah region with result from only one station.