A Hybrid Approach Based on Seasonal Autoregressive Integrated Moving Average and Neural Network Autoregressive Models to Predict Scorpion Sting Incidence in El Oued Province, Algeria, From 2005 to 2020

Background: This study was designed to find the best statistical approach to scorpion sting predictions. Study Design: A retrospective study. Methods: Multiple regression, seasonal autoregressive integrated moving average (SARIMA), neural network autoregressive (NNAR), and hybrid SARIMA-NNAR models were developed to predict monthly scorpion sting cases in El Oued province. The root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) were used to quantitatively compare different models. Results: In general, 96909 scorpion stings were recorded in El Oued province from 2005-2020. The incidence rate experienced a gradual decrease until 2012 and since then slight fluctuations have been noted. Scorpion stings occurred throughout the year with peaks in September followed by July and August and troughs in December and January. Sting cases were not evenly distributed across demographic groups; the most affected age group was 15-49 years, and males were more likely to be stung. Of the reported deaths, more than half were in children 15 and younger. Scorpion’s activity was conditioned by climate factors, and temperature had the highest effect. The SARIMA(2,0,2)(1,1,1)12, NNAR(1,1,2)12, and SARIMA(2,0,2)(1,1,1)12-NNAR(1,1,2)12 were selected as the best-fitting models. The RMSE, MAE, and MAPE of the SARIMA and SARIMA-NNAR models were lower than those of the NNAR model in fitting and forecasting; however, the NNAR model could produce better predictive accuracy. Conclusion: The NNAR model is preferred for short-term monthly scorpion sting predictions. An in-depth understanding of the epidemiologic triad of scorpionism and the development of predictive models ought to establish enlightened, informed, better-targeted, and more effective policies.


Seasonal Autoregressive Moving Average Model
The Box-Jenkins approach, also known as Autoregressive Integrated Moving Asverage (ARIMA) models, for the analysis of time series data is one of the most widely used predictive techniques in epidemiological surveillance.
Given a stationary time series of data   ( = 1, … , ), the SARIMA model, denoted by (, , )(, , )  , can be expressed by the following difference equation: where the backward shift operator  is defined as     =  − .and  represents the seasonality period.In addition, , , and  denote the number of nonseasonal differences, the degree of seasonal integration, and the number of AR terms, respectively.Further, , , and  are the degree of seasonal AR terms, the number of MA terms, and the degree of the seasonal MA model, respectively.Furthermore, ∇=1−, ∇  = 1 −   , and   denote the differencing operator, the seasonal differencing operator, and the white noise process, respectively.The polynomials   (), Φ  (),   (), Θ  () are the AR, the MA, the seasonal AR, and the seasonal MA polynomials, respectively.
SARIMA modelling is best performed while following a protocol.The first step is to check the stationary condition.The augmented Dickey-Fuller unit-root test was used for this purpose.To stabilize the variance of a time series that exhibits non-stationary variance, transformations such logarithm, square root, or reciprocal can be applied to each observation   ( = 1, … , ).To stabilize the mean, an appropriate order of differences can render a non-stationary series a stationary one.The orders p and q are lags for cutting off the autocorrelation function and partial autocorrelation function, respectively.
Once orders are determined, the parameters may be estimated by a nonlinear optimization technique or the least squares procedure.

Neural Network Model
Given the observed nonlinear trend in the data, ANN is one among the appropriate models that can be used to approximate various nonlinearities in the data.The single hidden layer feed forward network is the most widely applied model form for time series modeling and forecasting.This model is characterized by a network of three layers, namely, input layer (Input variables), hidden layer (Layers of nodes between the input and output layers), and the output layer (output variables) of simple processing units which are connected by acyclic links.
The relationship between the output   , and  −1 ,  −2 , … ,  − is formalized as follows: where  and  are the number of input nodes and the number of hidden nodes, respectively.Moreover,

Hybrid SARIMA-NNAR Forecasting
Almost all real-world time series contains both linear and non-linear correlation structures among the observations.Neither ARIMA nor ANN is universally suitable for all types of time series.Indeed, the approximation of nonlinear time series by ARIMA models or linear time series by ANN models may not be appropriate.The (, , )  model was employed in this study.It is one type of the ANN model, in which the lagged values of data can be used as inputs to the neural network.An (, , )  model has inputs ( −1 ,  −2 , … ,  − ,  − ,  −2 , … ,  − ) and k neurons in the hidden layer.A (, , 0)  model is equivalent to an (, 0,0)(, ,0,0)  model but without restrictions on parameters that ensure stationarity.The hybrid method is proposed for combining the linear and nonlinear models.To perform this method, the original time series at time  needs to be composed of an autocorrelated linear (  ) and a nonlinear (  ) components.
First, the SARIMA model is utilized to capture the linear component in the data.Thereafter, NNAR is used to capture the nonlinear component in the residuals part.The residuals are expressed as   =   −  �  , where  �  is the forecasting value at time t of   , estimated by the SARIMA model, and are represented as follows: where p and   represent the optimal number of lags and the white noise, respectively.Additionally,  �  denotes the forecast value at time  by the NNAR model, and  is a nonlinear function determined by the multilayer perceptron.
The linear and nonlinear forecasting values obtained by SARIMA and NNAR models are then combined to get the forecast:

Measures of Accuracy
Frequently used metrics to measure performance and estimate the accuracy of the forecasts and to ( = 0,1, … , ) and   ( = 0,1, … , ;  = 1, … , ) represent the parameters of the model, and  denotes the hidden layer transfer function.The logistic function defined by () = 1 1 +  − was utilized as the hidden layer transfer function.It is noteworthy that the neural network and nonlinear AR model have similar representation.
�  , and   are the size of the test set, the forecasted observation, and the actual observation at time , respectively.Model with the lowest value of the error measurements indicates the better performance model.
BIC), smallest Root Mean Squared Error (RMSE), smallest Mean Absolute Error (MAE), smallest Mean Absolute Percentage Error (MAPE), and the highest adjusted  2 in addition to stationary and invertibility conditions and the white noise condition for residuals.
The residuals are analyze to test the model for goodness-of-fit.The residuals should be uncorrelated with a mean of zero and follow a Gaussian distribution; moreover, the autocorrelations of the residuals should not be significantly different from zero.The correlation structure provides various choices for  and  values, thus generating several models.The best-fit model selection is based on criteria such as the smallest Akaike information criterion (AIC), smallest Schwarz Bayesian information criterion (