Evolving trajectories of COVID-19 curves in India: Prediction using autoregressive integrated moving average modeling.

The forecasting of Coronavirus Disease-19 (COVID-19) dynamics is a centerpiece in evidence based disease management. Numerous approaches that use mathematical modeling have been used to predict the outcome of the pandemic, including data driven models, empirical and hybrid models. This study was aimed at prediction COVID-19 evolution in India using a model based on autoregressive integrated moving average (ARIMA). Retrieving real time data from the Johns Hopkins dashboard from 11 Mar 2020 to 25 Jun 2020 (N = 107 time points) to t the model. The ARIMA (1,3,2) and ARIMA (3,3,1) model t best for cumulative cases and deaths respectively with minimum Akaike Informaton Criteria. The prediction of cumulative cases and deaths for next 10 days from 26 Jun 2020 to 05 Jul 2020 showed a trend toward continuous increment. The predicted root mean square error (PredRMSE) and base root mean square error (BaseRMSE) of ARIMA(1,3,2) model was 21137 and 166330 respectively. Similarly, PredRMSE and BaseRMSE of ARIMA(3,3,1) model was 668.7 and 5431 respectively. We propose that data on COVID-19 be collected continuously, and that forecasting continue in real time. The COVID-19 forecast assist government in resource optimization and evidence based decision making for a subsequent state of affairs.


Introduction
Coronavirus Disease-19 (COVID-19) caused by novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has caused a pandemic with global devastation to human life and health. SARS-CoV-2 is closely related to two bat-derived SARS like coronaviruses, bat-SL-CoVZC45 and bat-SL-CoVZXC21. It is transmitted by human-to-human transmission via droplets or direct contact, and mean incubation period is 6.4 days [1]. According to World Health Organization, 10,185,374 con rmed cases and 503,862 deaths have been recorded by 1 Jul 2020 [2]. Further, India recorded 568,092 con rmed cases and 17400 deaths by the same time [3]. The novelty and rapid spread of SARS CoV-2 has challenged medical science across the disciplines of epidemiology, clinical signs and symptoms, pathophysiology, disease progression and evolution and management and its preventive protocol. Meanwhile, mathematical models have been used to predict the course of disease and mortality [4]. Such models have potential in disease management as well as preventive protocols that are cost effective that can help optimum allocation of resources to manage the disease.
An econometric model has been proposed in the present study to predict and extrapolate transmission of COVID-19. The model makes use of autoregressive integrated moving average modeling on the epidemiological dataset of COVID-19. Our objective is to estimate forecast of COVID-19 cumulative cases and deaths for India. We propose that data on COVID-19 be collected continuously, and that forecasting continue in real time in order to assist governments in evidence based decision making.

Methods
A retrospective cohort study was run to forecast coronoavirus-19 disease evolution in Indian subcontinent. The time series of COVID-19 cumulative cases and deaths from 11 Mar 2020 to 25 Jun 2020 (N = 107 time points) was used in prediction using autoregressive integrated moving average modeling. A useful ARIMA model depends on the number of sample time points, and a good series would have more than 50 sample points [5].

Data acquisition:
The Indian data of COVID-19 cumulative cases and deaths from 11 March 2020 to 25 Jun 2020 were sourced from o cial website of Johns Hopkins University Center for Systems Science and Engineering Jun 2020 N = 21 was used to validate of the model (Table 1).

Model description
The autoregressive integrated moving average model was proposed to estimate the forecast of COVID-19 evolution [7][8] [9]. We followed Box-Jenkins methodology ( Figure 1). The methodology encompasses three phases-identi cation, estimation and testing and, and application. The Identi cation phase involves data preparation and model selection. The data were analyzed for trends and seasonal components. The Augmented Dickey-Fuller (ADF) unit root test was performed on time series, to examine stationarity. Logarithmic transformation and differencing operations were performed to stabilize the time series. Selection of ARIMA model, was required to establish the order of the autoregressive (AR) process, 'p', the differencing operator, 'd', and the order of moving average (MA) process, 'q'. The succeeding system of mathematical equations delineates the ARIMA (p,d,q) model: where, {X t } is the original time series, Wt is the time series acquired after differencing operation on {X t }, ∇ d is the difference operator,αp,βqare parameters and at is the white noise. Since, The autocorrelation function (ACF) and partial autocorrelation function (PACF) were used to guide the autoregressive and moving average order respectively. The Akaike Information Criterion (AIC) was used to select the best model. The AIC contemplate the maximum log likelihood estimation (Log L) and number of parameters, as the criterion for best model selection. The minimization of the following equation is required: where, k = p+q+1, 2k serves as a penalty function. The model having minimum AIC value was considered the best. Estimation of parameters was accomplished using the method of maximum likelihood estimation. The assumptions of the model were scrutinized with residual diagnostics. The forecast of cumulative cases and deaths were performed from the best selected ARIMA model. The validation of the model was performed from validation dataset by enumerating Predicted Root Mean Square Error (PredRMSE) and comparing it with Base Root Mean Square Error (Base RMSE). The forecast of cumulative cases and deaths was performed from 26 Jun to 05 Jul 2020.

Statistical analysis
The estimates from the ARIMA model follows a normal distribution, and variation of the forecast was considered within 95% con dence intervals. The MS Excel 2010 was used to maintain the database and MATLAB 2016a (version 9.0.0.341360) was used for analysis [10].

Results
In order to achieve stationarity, the logarithmic transformation of datasets (cumulative cases and deaths) were performed as they were showing upward trends. Further difference operations were performed on cumulative cases (d = 3) and death (d = 2) datasets. The Augmented Dickey-Fulller (ADF) test showed stationarity (P value < 0.05). The autocorrelation function and partial autocorrelation functions were calculated and correlograms were used to guide the autoregressive order and the moving average order of the ARIMA models (Supplemental Figure 1 and 2). The ARIMA (1,3,2) and ARIMA (3,3,1) model for cumulative cases (AIC = 2.32) and death cases (AIC = 108.75) was chosen respectively with lowest AIC values. Goodness of t and model assumptions were tested with residual analysis. The graph of standardized residuals showed random distribution of residuals. The middle values of the QQ plot were on a straight line. The autocorrelation function and partial autocorrelation functions within random limits, showed normal distribution and independence (Supplemental Figure 3 and 4). The parameters of both the ARIMA models were estimated (Supplemental Table S1 and S2).  Table 2).The estimation of forecast for cumulative cases and death cases were performed at 95% con dence interval (Figure 2 and 3; Table 3).

Discussion
The dynamics of any infectious disease involves interactions of three elements -host, agent and environment [11]. India lies in both the Northern and Eastern hemispheres, with latitudes 8°4' to 37°6' north and longitudes 68°7' to 97°25' east results in high environmental variability, and subsequent effects on human behavior and society [12]. Mathematical models are a centerpiece to forecasting the dynamics of infectious disease pandemics. The fundamental models used in prediction include data driven ( [20], hybrid [21][22] models.The parameters of empirical models (incubation period, attack rate and recovery rate) possess probability distributions such as Ehrlang and the Poisson distribution [11]. The data driven models include ARIMA, single-input and single-output (SISO) model [23] and AI based models (machine learning and deep learning techniques) [14]. The signi cance of ARIMA model lies in modeling non-stationary time series. The ARIMA model assumes the trends will continue in the future inde nitely as against the empirical models which assumes convergence. Few studies used non-parametric models like fourier decomposition methods to predict turn around dates of the epidemic and the results were found to agree with popular SIR models [24].
Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) expressed its presence in India with the rst case diagnosed on 30 Jan 2020. On 01 Jul 2020 the cumulative cases and deaths in India reached 568,092 and 17400 respectively [3]. India ranked 57 th among the 100 countries in Global Health Security Index 2019, a scale to gauge preparedness for the outbreak of serious infectious diseases [25].The forecast of COVID-19 cases helps government agencies in early preparedness to combat subsequent state of affairs. The patients of COVID-19 can be classi ed into con rmed cases, recovered cases, admitted and death cases. All categories have differential importance from the management point of view. The requirement of hospital beds, medical equipment, hospital staff is a function of number of admitted cases. Similarly, procurement of plasma for antibody therapy is a function of number of recovered cases.
The present study forecast the cumulative cases and deaths for India from 26 Jun 2020 to 05 Jul 2020. The forecast shows continuously increasing trends for both cumulative cases and deaths and shows no decreasing trends till October 2020.
The ARIMA model was proposed by various authors for forecasting the COVID-19 evolution. The data from 10 Jan 2020 to 20 Feb 2020 was used to t ARIMA (1,0,4) and ARIMA (1,0,3) model for cumulative diagnoses and newer diagnoses to forecast next two days [26]. The ARIMA and wavelet hybrid model was proposed to forecast ten time points of cumulative cases from 05 Apr 2020 to 14 Apr 2020 for India. The forecast showed oscillations, may be due to effects of lockdown [22]. In another data driven model, using bi-directional LSTM (long short term memory) model, 15 days prediction of actual cases in India from 30 Apr 2020 to 14 May 2020 showed error of less than 3% [21]. Susceptible-Exposed-Infectious-Recovered (SEIR) model was used to predict cumulative cases of COVID-19 in India during lockdown and post lockdown. The model predicts peak of cumulative cases around 43,000 in mid-May. However, 7-21% increase in peak value of cases was predicted for post lockdown period, re ecting relaxation in control strategies [20]. The evolution of COVID-19 in topmost affected states of India was done using SISO model [23]. The most severely affected states were Maharashtra, Gujarat, Tamil Nadu, Delhi and Rajasthan [23][27][28] [29]. One time series analysis used genetic programming to predict the COVID-19 evolution in India. On 13 May 2020, the cumulative cases and death cases were 80,000 and 2500 respectively. The prediction for next ten days was done with 142,000 and 4,200 cumulative cases and deaths respectively on 23 May 2020 [30]. Sujath used linear regression, multilayer perceptron and vector autoregressive method and 80 time points till 10 Apr 2020, to predict con rmed cases, deaths and recovered cases from 11 Apr 2020 to 18 Jun 2020. Although prediction varies across the methods and did not seems very accurate [31]. Yadav used six regression analysis based machine learning models for prediction and found six degree polynomial model predict very close to observed data [32].
As the epidemic continues, the effects of different interventional strategies inherited in the time series, and thus data driven models seems to be more accurate than empirical models. The real-time data modeling has been proposed for monitoring trends of the Indian subcontinent. Evidence based interventions should be implemented to control the pandemic.

Conclusion
The present study produces encouraging results with the potential to serve as a good adjuvant to existing models for continuous predictive monitoring of the COVID-19 pandemic. The forecast of COVID-19 may assist public health authorities and governmental agencies for early preparedness and evidence based decision making.

Limitations Of The Study
Firstly, ARIMA model used in forecasting does not capture the non-linear and chaotic dynamics of the pandemic. Secondly, parameter selection procedure requires repetition with each time series update.

Declarations
Contribution of Authors: All authors contributed equally.
Competing Interest: The authors declare that they have no known competing nancial interests or personal relationships that could have appeared to in uence the work reported in this paper Data Availability-Data is available on reasonable request from corresponding author. BaseRMSE: base root mean square error. Table 3. Forecast of cumulative cases and death cases using ARIMA(1,3,2) and ARIMA (3,3,1) models respectively. The dataset used for prediction is from 11 Mar 2020 to 25 Jun 2020 (N= 107 time points).