Indian COVID-19 Dynamics: Prediction Using Autoregressive Integrated Moving Average Modelling

Background: The forecasting of Coronavirus Disease-19 (COVID-19) dynamics is a centrepiece in evidence-based disease management. Numerous approaches that use mathematical modelling have been used to predict the outcome of the pandemic, including data-driven models, empirical and hybrid models. This study was aimed at prediction of COVID-19 evolution in India using a model based on autoregressive integrated moving average (ARIMA). Material and Methods: Real-time Indian data of cumulative cases and deaths of COVID-19 was retrieved from the Johns Hopkins dashboard. The dataset from 11 March 2020 to 25 June 2020 (n = 107 time points) was used to fit the autoregressive integrated moving average model. The model with minimum Akaike Information Criteria was used for forecasting. The predicted root mean square error (PreRMSE) and base root mean square error (BaseRMSE) were used to validate the model. Results: The ARIMA (1,3,2) and ARIMA (3,3,1) model fit best for cumulative cases and deaths, respectively, with minimum Akaike Information Criteria. The prediction of cumulative cases and deaths for next 10 days from 26 June 2020 to 5 July 2020 showed a trend toward continuous increment. The PredRMSE and BaseRMSE of ARIMA (1,3,2) model were 21,137 and 166,330, respectively. Similarly, PredRMSE and BaseRMSE of ARIMA (3,3,1) model were 668.7 and 5,431, respectively. Conclusion: It is proposed that data on COVID-19 be collected continuously, and that forecasting continue in real time. The COVID-19 forecast assist government in resource optimisation and evidence-based decision making for a subsequent state of affairs.


Introduction
Coronavirus Disease-19 , caused by the novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), has caused a pandemic with global devastation to human life and health. SARS-CoV-2 is closely related to two bat-derived SARS like coronaviruses, bat-SL-CoVZC45 and bat-SL-CoVZXC21. It is transmitted by human-to-human transmission via droplets or direct contact and the mean incubation period is 6.4 days. 1 According to World Health Organisation, 10,185,374 confirmed cases and 503,862 deaths have been recorded by 1 July 2020. 2 Furthermore, India recorded 568,092 confirmed cases and 17,400 deaths by the same time. 3 The novelty and rapid spread of SARS CoV-2 has challenged medical science across the disciplines of epidemiology, clinical signs and Methods A descriptive study was run to forecast COVID-19 evolution in the Indian subcontinent. The time series of COVID-19 cumulative cases and deaths from 11 March 2020 to 25 June 2020 (n = 107 time points) was used in prediction using autoregressive integrated moving average modelling. A useful ARIMA model depends on the number of sample time points, and a good series would have more than 50 sample points. 15

Data acquisition
The Indian data of COVID-19 cumulative cases and deaths from 11 March 2020 to 25 June 2020 were sourced from the official website of Johns Hopkins University Center for Systems Science and Engineering (https://systems.jhu.edu/) and the repository (https://github.com/CSSEGISand-Data/COVID-19). 16 Excel 2010 was used to build the database.
To validate the model, dataset was distributed into a training set and a validation set. The training dataset from 11 March 2020 to 4 June 2020 (n = 86) was used to fit autoregressive integrated moving average (ARIMA) model and estimation of parameters. The validation dataset from 5 June 2020 to 25 June 2020 was used to validate the model (Table 1). symptoms, pathophysiology, disease progression and evolution and management and its preventive protocol. Meanwhile, mathematical models have been used to predict the course of disease and mortality. [4][5][6][7] Such models have potential in disease management as well as preventive protocols that are cost-effective that can help optimum allocation of resources to manage the disease. [8][9][10][11][12][13][14] An econometric model has been proposed in the present study to predict and extrapolate the transmission of COVID-19. The model makes use of autoregressive integrated moving average modelling on the epidemiological dataset of COVID-19. The objective of this study was to estimate the forecast of COVID-19 cumulative cases and deaths for India. It is proposed that data on COVID-19 should be collected continuously and that forecasting continues in real time to assist governments in evidence-based decision making.

Model description
The autoregressive integrated moving average model was proposed to estimate the forecast of COVID-19 evolution. [17][18][19] Box-Jenkins methodology was followed ( Figure 1). The methodology encompasses three phases of identification, estimation and testing and application. The Identification phase involves data preparation and model selection. The data were analysed for trends and seasonal components. The Augmented Dickey-Fuller (ADF) unit root test was performed on time series to examine stationarity. Logarithmic transformation and differencing operations were performed to stabilise the time series. Selection of ARIMA model was required to establish the order of the autoregressive (AR) process 'p', the differencing operator 'd', and the order of moving average (MA) process, 'q'. The succeeding system of mathematical equations delineates the ARI-MA(p,d,q) model:

W t = α 1 W (t-1) + α 2 W (t-2) + ... + α (t-p) W (t-p) + a t + β 1 a (t-1) + β 2 a (t-2) + ... +β q a (t-q)
where {X t } is the original time series, {W t } is the time series acquired after differencing operation on {X t }, ∇ d is the difference operator, α p ,β q are parameters and a t is the white noise. Since, The autocorrelation function (ACF) and partial autocorrelation function (PACF) were used to guide the autoregressive and moving average order, respectively. The Akaike Information Criterion (AIC) was used to select the best model. The AIC considers the maximum log-likelihood estimation (Log L) and number of parameters as the criteria for best model selection. The minimisation of the following equation is required: Supplemental Figure S1: Correlograms of time series of cumulative cases of COVID-19 (a) sample autocorrelation function (b) sample partial autocorrelaton function

AIC = -2 log (maximum likelihood) + 2k
where k = p + q + 1, 2k serves as a penalty function. The model having minimum AIC value was considered the best. Estimation of parameters was accomplished using the method of maximum likelihood estimation. The assumptions of the model were scrutinised with residual diagnostics. The forecasts of cumulative cases and deaths was performed from the best selected ARIMA model.

Statistical analysis
The estimate from the ARIMA model follow a normal distribution, and variation of the forecast was considered within 95 % confidence intervals. The MS Excel 2010 was used to maintain the database and MATLAB 2016a (version 9.0.0.341360) was used for analysis. 20 test showed stationarity (p value < 0.05). The autocorrelation function and partial autocorrelation function were calculated and correlograms were used to guide the autoregressive order and the moving average order of the ARIMA models (Supplemental Figure 1    (1,3,2) and ARIMA (3,3,1) model for cumulative cases (AIC = 2.32) and death cases (AIC = 108.75) was chosen respectively with the lowest AIC values. Goodness of fit and model assumptions were tested with residual analysis. The graph of standardised residuals showed random distribution of residuals. The middle values of the QQ plot were on a straight line. The autocorrelation function and partial autocorrelation functions within random limits showed normal distribution and independence (Supplemental Figure 3 and 4). The

Discussion
The dynamics of any infectious disease involve interactions of three elements -host, agent and environment. 21 India lies in both the Northern and Eastern hemispheres, with latitudes to north and longitudes to east results in high environmental variability, and subsequent effects on human behaviour and society. 22 The mathematical models are a centrepiece to forecasting the dynamics of infectious disease pandemics. The fundamental models used in prediction include data-driven, [23][24][25][26][27] empirical, 28-30 hybrid 31-32 models. The parameters of empirical models (incubation period, attack rate, and recovery rate) possess probability dis-tributions such as Ehrlang and the Poisson distribution. 21 The data-driven models include ARIMA, single-input and single-output (SISO) models 33 and AI-based model (machine learning and deep learning techniques). 24 The significance of ARI-MA model lies in modelling nonstationary time series. The ARIMA model assumes the trends will continue in the future indefinitely as against the empirical model which assume convergence. Few studies used nonparametric models like Fourier decomposition methods to predict turn-around dates of the epidemic and the results were found to agree with popular SIR models. 34 SARS-CoV-2 expressed its presence in India with the first case diagnosed on 30 January 2020. On 1 July 2020, the cumulative cases and deaths in India reached 568,092, and 17,400, respectively. 3 India ranked 57 th among the 100 countries in Global Health Security Index 2019, a scale to gauge preparedness for the outbreak of serious infectious diseases. 35 The forecast of COVID-19 cases helps government agencies in early preparedness to combat subsequent state of affairs. The patients of COVID-19 can be classified into confirmed cases, recovered cases, admitted, and death cases. All categories have differential importance from the management point of view. The requirement of hospital beds, medical equipment, hospital staff is a function of the number of admitted cases. Similarly, procurement of plasma for antibody therapy is a function of the number of recovered cases.
The present study forecasts the cumulative cases and deaths for India from 26 June 2020 to 5 July 2020. The forecast shows continuously increasing trends for both cumulative cases and deaths and shows no decreasing trends until October 2020.
The ARIMA model was proposed by various authors for forecasting the COVID-19 evolution. The data from 10 January 2020 to 20 February 2020 was used to fit ARIMA (1,0,4) and ARIMA (1,0,3) models for cumulative diagnosis and newer diagnoses to forecast the next two days. 36 The ARIMA and wavelet hybrid model was proposed to forecast ten time points of cumulative cases from 05 Apr 2020 to 14 Apr 2020 for India. The forecast showed oscillations may be due to effects of lockdown. 32 In another data-driven model, using bidirectional LSTM (long short-term memory) model, 15 days prediction of actual cases in India from 30 April 2020 to 14 May 2020 showed an error of less parameters of both the ARIMA models were estimated (Supplemental Table S1 and S2).  Table 3). than 3 %. 31 Susceptible-Exposed-Infectious-Recovered (SEIR) model was used to predict cumulative cases of COVID-19 in India during lockdown and post lockdown. The model predicts the peak of cumulative cases around 43,000 in mid-May. However, 7-21 % increase in peak value of cases was predicted for post lockdown period, reflecting relaxation in control strategies. 30 The evolution of COVID-19 in topmost affected states of India was done using SISO model. 33 The most severely affected states were Maharashtra, Gujarat, Tamil Nadu, Delhi, and Rajasthan. 33,[37][38][39] One time series analysis used genetic programming to predict the COVID-19 evolution in India. On 13 May 2020, the cumulative cases and death cases were 80,000 and 2500, respectively. The prediction for the next ten days was done with 142,000 and 4,200 cumulative cases and deaths, respectively, on 23 May 2020. 40 Sujath used linear regression, multilayer perceptron and vector autoregressive method and 80 time points till 10 April 2020, to predict confirmed cases, deaths, and recovered cases from 11 April 2020 to 18 June 2020. Although prediction varies across the methods and did not seem very accurate. 41 Yadav used six regression analysis-based machine learning models for prediction and found six-degree polynomial models predict very close to observed data. 42 As the epidemic continues, the effect of different interventional strategies inherited in the time series, and thus data-driven model seem to be more accurate than empirical models. The real-time data modelling has been proposed for monitoring trends of the Indian subcontinent. Evidence-based interventions should be implemented to control the pandemic.

Conclusion
The present study produces encouraging results with the potential to serve as a good adjuvant to existing models for continuous predictive monitoring of the COVID-19 pandemic. The forecast of COVID-19 may assist public health authorities and governmental agencies for early preparedness and evidence-based decision making.

The limitations of the study
Firstly, ARIMA model used in forecasting does not capture the non-linear and chaotic dynamics of the pandemic. Secondly, the parameter selection procedure requires repetition with each time series update.

Contribution of Authors
All authors contributed equally.