ORIGINAL ARTICLE Year : 2021  Volume : 14  Issue : 12  Page : 564574 Predicting COVID19 fatality rate based on age group using LSTM Zahra Ramezani^{1}, Seyed Abbas Mousavi^{2}, Ghasem Oveis^{3}, Mohammad Reza Parsai^{4}, Fatemeh Abdollahi^{5}, Jamshid Yazdani Charati^{1}, ^{1} Department of Epidemiology and Biostatistics, School of Health, Mazandaran University of Medical Sciences, Sari, Iran ^{2} Department of Psychiatry, Psychiatry and Behavioral Sciences Research Center, Addiction Institute, Mazandaran, University of Medical Sciences, Sari, Iran ^{3} Health vicechancellor of Mazandaran University of Medical Sciences, Sari, Iran ^{4} Control Disease Center, Mazandaran University of Medical Sciences, Sari, Iran ^{5} Department of Public Health, Psychiatry and Behavioral Sciences Research Center, Mazandaran University of Medical Sciences, Sari, Iran Correspondence Address: Objective: To predict the daily incidence and fatality rates based on long shortterm memory (LSTM) in 4 age groups of COVID19 patients in Mazandaran Province, Iran. Methods: To predict the daily incidence and fatality rates by age groups, this epidemiological study was conducted based on the LSTM model. All data of COVID19 disease were collected daily for training the LSTM model from February 22, 2020 to April 10, 2021 in the Mazandaran University of Medical Sciences. We defined 4 age groups, i.e., patients under 29, between 30 and 49, between 50 and 59, and over 60 years old. Then, LSTM models were applied to predict the trend of daily incidence and fatality rates from 14 to 40 days in different age groups. The results of different methods were compared with each other. Results: This study evaluated 5 0826 patients and 5 109 deaths with COVID19 daily in 20 cities of Mazandaran Province. Among the patients, 25 240 were females (49.7%), and 25 586 were males (50.3%). The predicted daily incidence rates on April 11, 2021 were 91.76, 155.84, 150.03, and 325.99 per 100 000 people, respectively; for the fourteenth day April 24, 2021, the predicted daily incidence rates were 35.91, 92.90, 83.74, and 225.68 in each group per 100 000 people. Furthermore, the predicted average daily incidence rates in 40 days for the 4 age groups were 34.25, 95.68, 76.43, and 210.80 per 100 000 people, and the daily fatality rates were 8.38, 4.18, 3.40, 22.53 per 100 000 people according to the established LSTM model. The findings demonstrated the daily incidence and fatality rates of 417.16 and 38.49 per 100 000 people for all age groups over the next 40 days. Conclusions: The results highlighted the proper performance of the LSTM model for predicting the daily incidence and fatality rates. It can clarify the path of spread or decline of the COVID19 outbreak and the priority of vaccination in age groups.
1. Introduction The coronavirus disease caused by COVID19 virus has urged many countries to control its spread through social distancing, masking, and determining the number of people who contact an infected person[1],[2],[3]. Many scientific and medical studies have investigated how to prevent its spread[4],[5]. However, one of the most important issues is predicting the epidemic trend of COVID19[6],[7]. Although traditional time series methods work well in timedependent sequence observations, they have many limitations. For example, outliers can cause biased estimation of model parameters; when a large number is estimated, direct human intervention and evaluation are necessary to select the final model[8]. Time series models are often linear; they might not be able to explain nonlinear behavior well. Many traditional statistical methods do not learn new data entry well; they require periodical reevaluation. Neural networks can overcome these limitations, or at least they have fewer problems compared to traditional time series statistical methods[9]. Although they are inherently nonlinear, they are also able to model linear patterns[8],[10]. [INLINE:1] Kırbaş et al. performed a comparative analysis in Turkey and employed AutoRegressive Integrated Moving Average (ARIMA), Nonlinear AutoRegressive Neural Network (NARNN), and long shortterm memory (LSTM) methods to model the COVID19 confirmed cases in Denmark, Belgium, Germany, France, the United Kingdom, Finland, Switzerland, and Turkey. They used six model performance metrics (i.e., MSE, PSNR, NMSE, MAPE, and SMAPE) to choose the most precise model. The results of the first stage of their study confirmed LSTM as the most precise model. However, the second stage revealed that it was successful in predicting a 14day view. It showed that the growth rate would slightly drop in many countries[11]. In 2020, Arora et al. conducted a study to predict and analyze positive cases of COVID19 using deep learningbased models in India. To achieve their goal, they employed different LSTM models based on recurrent neural networks (RNNs), including Deep LSTM, Convolutional LSTM, and Bidirectional LSTM. Finally, they selected the LSTM model with minimal error to predict the daily and weekly cases[12]. Moreover, Rashed et al. proposed an LSTM architecture to predict the spread of COVID19 considering various factors such as public mobility estimates and meteorological data; finally, they applied it to the data collected in Japan. They predicted the positive cases in six prefectures of Japan for different time frames[13]. Other studies have been performed for forecasting new cases and deaths consisting of vanilla, stacked, bidirectional, and multilayer LSTM models. Chatterjee et al. tried to limit the exponential spread to slow down the transmission rate (spread factor) and then assessed the risk factors associated with COVID19. However, the results indicate that vanilla, bidirectional, and stacked LSTM models outperformed multilayer LSTM models[14]. Albahli et al. applied a semantic analysis of three levels (negative, neutral, and positive) to measure the people's feelings towards the pandemic and lockdown in the Gulf countries[15]. In another study by Odhiambo et al., an RNN within LSTM was compared to the traditional ARIMA method in countries with limited data availability, such as Kenya. The results demonstrated that the LSTM network was precise when forecasting the future systematic fatality risks compared to the traditional time series method[16]. Unlike previous studies, we predict the daily incidence and fatality rates in each age group in detail. The daily incidence rate is the proportion of the number of cases to the total population multiplied by WHO Standard Population per 100 000 people. Also, the fatality rate is the proportion of the number of fatality to the total population multiplied by WHO Standard Population per 100 000 people. The advantage of this study is predicting the daily incidence and fatality rates of COVID19 cases in different age groups based on different populations by LSTM in areas near the Caspian Sea. In this way, a proper decision can be made to prevent the spread of the disease and prioritize vaccination. To predict the daily incidence and fatality rates from 14 to 40 days for each age group, we focused our analysis on the data recorded by Mazandaran University of Medical Sciences. 2. Materials and methods 2.1. Study design and data collection To predict the daily incidence and fatality rates of COVID19 by age groups in Mazandaran Province, diagrams and descriptive statistics tables have been used to describe the existing conditions. This could help us to investigate the effect of age on the increased daily incidence and fatality rate. Then, the groups have been compared in terms of prevalence and prediction of daily incidence and fatality rate. As for modeling, we attempted to predict the daily incidence and fatality rate daily and monthly. Thus, the data have been collected for training based on 50 826 admitted patients and 5 109 deaths of COVID19 in 20 cities in Mazandaran Province from February 22, 2020 to April 10, 2021. After we prepared the data, regression coefficients, confidence interval, correlation heatmap, and comparison graphs for daily incidence and fatality rate were presented for clear descriptions and better decision making. Then, the traditional ARIMA model and the LSTM models have been implemented for forecasting. 2.2. Proposed model We used an expertbased standard checklist to collect data, including disease symptoms, demographic characteristics, history of disease, and other risk factors. This study attempted to predict the daily incidence and fatality rates in Mazandaran Province based on WHO standard population[17]. Due to the time series data and the large volume of data, we could use the LSTM networks, widely applicable in timedependent studies, for forecasting. Statistical analyses were done by SPSS software version 26 and Python software version 3.7. The LSTM model is an RNN in which the prediction result for the next time unit is based on the current situation and previous knowledge[18],[19]. This can also consider shortterm and longterm correlations within the time series in the LSTM network by using the hidden layer as a memory block, which can learn longterm dependencies of the content[20]. Each LSTM cell consists of input, output, and forget gates in a hidden layer. The LSTM cell internal memory stores only useful and relevant information. [Figure 1] depicts the structure of an LSTM network with 3 gates. The LSTM network is defined using the following equations:{Figure 1} [INLINE:2] Where xt and ht are input and output vectors, respectively, ft is a forget gate vector, ct represents the cell state vector, it is the input gate vector. ot is the output gate vector, and W and b show the parameter matrices. By assigning different functions to gates, the LSTM memory block can record complex features correlations in shortterm and longterm time series; it is a significant advantage over RNN[21]. We should note that other appropriate transformations may be used if necessary to establish conditions and assumptions along with better estimates. The data are divided into two datasets of training and testing, and finally, the prediction occurs through experimental data. The purpose of normalization is generally to reduce the computation time due to the shrinking of the numbers. The mean squared logarithmic error (MSLE) criteria and Adam optimizer are chosen for better forecasting. The lower the value, the better the model estimate. 3. Results Before predicting the daily incidence and fatality rates, we compared different age groups according to the available data from 20 cities in Mazandaran Province. COVID19 case data were recorded daily from February 22, 2020 to April 10, 2021 in Mazandaran University of Medical Sciences. The daily incidence and fatality rates of different age groups were calculated daily according to the WHO World Standard Population. [Table 1] indicates the characteristics and behavior of COVID19. Among the patients infected with this virus, 25 240 were females (49.7%), and 25 586 were males (50.3%). A total of 5 109 patients died, among which 2 763 (54.1%) were women and 2 346 (45.9%) were men. [Table 2] shows the population of the province in age groups and the population of urban/rural men and women. Of the total population, 1 581 594 were urban and 1 175 263 were rural. We classified the data into 4 age groups, patients under 29, between 30 and 49, between 50 and 59, and over 60 years old in [Table 2]. The Pvalue is calculated based on the Chisquare test among the 4 age groups (P<0.001).{Table 1}{Table 2} In the following, we analyzed the collected data to identify patterns and trends. [Table 3] examines the effects of several specific disease histories on the fatality age of COVID19 patients. The coefficient estimates the marginal effect of a oneunit increase (a disease) in that independent variable on the dependent variable (age category), holding constant all other variables in the model. According to the disease history of the people of the region, the results show that the effects of cardiovascular and diabetic diseases and other diseases, including asthma, have the greatest impact on the age categories. It is shown that COVID19 patients with diabetes (the regression coefficient 0.545) were at a higher risk in the age groups. Although the coefficient in the model on cardiovascular disease (the regression coefficient 0.610) is larger than the coefficient on diabetes (the regression coefficient 0.545), it does not make sense to compare those coefficients directly. Other diseases, including asthma, are in the next ranks in terms of regression coefficients. Also, smaller regression coefficients have a lesser effect on the age categories. The negative coefficient of liver disease is due to the low frequency of this disease in the study population or the lack of registration of this type of disease in COVID19 patients. Regression coefficients and confidence intervals were presented for considering significance level in [Table 3]. The history of various diseases is significant for P<0.001, such as diabetes.{Table 3} The correlation heatmap of real COVID19 data is depicted in [Figure 2]. As age increases, the number of new fatalities increases due to the high correlation value of r.{Figure 2} [Figure 3] shows the average daily incidence and fatality rates of the groups based on the World Standard Population per 100 000 people. [Figure 4]A shows the daily incidence trend of the registered cases in 20 cities in Mazandaran Province. [Figure 4]B and [Figure 4]C show the evaluation of the daily incidence and fatality rate for each age groups in Mazandaran Province regarding the population per 100 000 people. As shown in [Figure 4]C which evaluates and compares the COVID19 fatality rate in 4 age groups in Mazandaran Province, patients over 60 and between 50 and 59 have the highest fatality rate according to the WHO World Standard Population.{Figure 3}{Figure 4} 3.1. Time series ARIMA model ARIMA is a time series prediction model which is a form of regression analysis and is used to forecast the future trends in the time series dataset. This model is applied to capture the autocorrelation from the data which computes the future values based on the correlations between the previous values. A traditional ARIMA model has been implemented to the COVID19 data before considering the LSTM model. Then, the predicted results of COVID19 cases using the ARIMA model have been presented. At first, the DickeyFuller test is used to examine if the time series is stationary. The null hypothesis (H0) was rejected with a Pvalue ≤ 0.05 in the DickeyFuller test, indicating that the data do not have a unit root and are stationary. If the test statistic is less than critical values, we reject the null hypothesis. If the test statistic is greater than critical values, we accept the null hypothesis. Here, the test statistics value=2.83 is greater than the critical value(1%)=3.45 and the critical value(5%)=2.87, thus the data is not stationary. The test statistic is less than the critical value(10%)=2.57 and the data is stationary. We have to transform the data to make the data more stationary for critical value 1% and critical value 5%. But, the data are stationary in significant value 10% and we apply the ARIMA model for a significant value 10%. An ARIMA statistical model has been used to predict the daily incidence trend of the COVID19 outbreak in the time series [Figure 5].{Figure 5} Note that for a series to be stationary, it must follow some principles such as modeling, estimating trends, and seasonal changes in the series, along with their removal from the series. Then, the forecasting techniques can be implemented in the data. In the following, it can be seen that the LSTM models do not have the complexities of traditional time series methods and produce more accurate results and are closer to the actual data. 3.2. LSTM model We have illustrated applied hyperparameters, various LSTM models, and loss functions to consider the proposed model in this section. Optimizer explores specific configurations to speed up or slow down learning that leads to benefits. Adam optimizer applies the learning rate of 0.001, provides a reliable method in the stochastic gradient descent algorithm, and computes adaptive learning rates for each parameter. The 50 epochs have been specified for observing the loss curve during training and convergence of the loss curve. The main hyperparameters, including the sequence length, activation function, learning rate, batch size, epochs, optimizer, loss function, and n_hidden, are listed in [Table 4].{Table 4} The training set is 85% of the data, while the remaining 15% are applied as testing set[11]. We considered an approximately 14 to 40day prediction period for testing data. More specifically, the data were split into two subsets. The first subset was composed of training (from February 22, 2020 to April 10, 2021) and test data (the last 14 days, from April 11, 2021 to April 24, 2021). On the other hand, the second subset was composed of training (from February 22, 2020 to April 10, 2021) and test data (the last 40 days, from April 11, 2021 to May 25, 2021) for prediction analysis. [Table 5] illustrates the average performance results of various LSTM models. In this study, the differences in various loss values between models are insignificant due to the sufficient data availability and a more detailed investigation in each age group. Although the results show that vanilla, stacked, and bidirectional LSTM models outperform other LSTM models, we selected a simple LSTM model for faster training and prediction with lower loss. An MSLE loss function was selected as the suitable metric to train to predict the daily incidence and fatality rates in the LSTM model. For models without data grouping, selecting stacked LSTM is more appropriate due to being a deeper model.{Table 5} Daily incidence and fatality rates of real data have been evaluated in [Table 6] from March 20, 2020 (March 20 is the first day of the first month of the year in Iran) for 12 consecutive months. Since, in the first days of the disease outbreak in the country, the data were not well recorded or the disease was not diagnosed, the daily incidence and fatality rate of real data have been calculated from March 20, 2020. In general, training data from February 22, 2020 (i.e., the first recorded data) have been used daily to predict the COVID19 outbreaks using the LSTM model.{Table 6} [Table 6] separately displays the COVID19 daily incidence rate for 12 consecutive months in 4 groups. For example, the 10th month has the highest daily incidence rate, and the vulnerable class of the category of over 60 years old has the highest rate of 405.53 person per 100 000 people. In the same way, [Table 6] also depicts the fatality rate in each age group for 12 consecutive months, indicating a trend similar to the daily incidence rate in the groups. Training data from February 22, 2020 to April 10, 2021 were trained by LSTM architecture. [Figure 6] shows the trend of loss function values of training and validation to predict the confirmed cases and fatality rate in the two age groups as examples. Moreover, similar results have been achieved for other groups. Predictions of group 1 are related to under 29, group 2 between 30 and 49, group 3 between 50 and 59, and group 4 over 60 years old. Then, we predicted the daily incidence and fatality rates for 14 to 40 days from April 11, 2021.{Figure 6} [Figure 7] and [Figure 8] illustrate the performance of the proposed model and prediction by age groups in Mazandaran Province. [Figure 7] shows the predicted values of the COVID19 patients in Mazandaran Province by 4 age groups with the LSTM model for 14 days after the last date of the training. On the other hand, [Figure 8] depicts the prediction of the daily incidence rate of Mazandaran Province for 4 age groups in 40 days by the LSTM model. Before the vertical line, the trend of the training data daily incidence rate before April 11, 2021 is shown, and the trend of predicting the daily incidence rate can be observed after this line.{Figure 7}{Figure 8} [Table 7] shows the prediction of cases and daily incidence rates for the four groups from April 11, 2021 to April 24, 2021 for 14 consecutive days. For a simpler and more meaningful representation of the prediction values for 40 consecutive days, we have shown the prediction trend of daily COVID19 cases in [Figure 8] for all 4 groups.{Table 7} In addition, the average predicted values of daily incidence and fatality rates for 40 days have been shown for each age category in [Table 8]. Predictions in stable conditions are very close to the actual values[22].{Table 8} 4. Discussion Previous studies have mainly focused on the effective factors such as age, underlying diseases, and fatality rate of COVID19[23],[24]. Moreover, they investigated the COVID19 disease predictions and fatality rate regardless of the incidence rate in age groups[12],[25]. For example, Sasson showed that the age pattern of COVID19 fatality in different countries might indicate a difference in population health, clinical care standards, or data quality[26]. Researchers have shown that COVID19 is very common in elderly patients with underlying diseases such as cardiovascular disease, high blood pressure, and diabetes. Due to the diversity in the demographic statistics, underlying diseases, and health systems, the fatality rate of COVID19 disease was predicted for 187 countries, ranging from 0.43% in SubSaharan Africa to 1.45% in Eastern Europe[27]. What distinguishes this research from other studies is the accurate prediction of incidence and fatality rates by different age groups using the LSTM deep learning technology. Furthermore, we achieved accurate results compared to those who worked on the general case disregarding age grouping. Thus, the diagnosis of the highrisk age group and the predicted values illuminates the future of the disease outbreak. A metaanalysis with a large number of patients highlights the determining effect of age on fatality. The data of this study were collected from the patients in 20 cities near the Caspian Sea in Mazandaran Province, and the daily incidence and fatality rates of each age group were predicted in detail. Due to the time series data and their large volume, the researchers selected LSTM networks, widely applicable in the study of timedependent issues for forecasting. Evaluation metrics are loss functions such as mean absolute error (MAE), mean squared error (MSE), mean squared logarithmic error (MSLE), binary crossentropy, categorical crossentropy, residual forecast error/forecast error, forecast bias/mean forecast error, root mean square error (RMSE), and R2 score as adjusted Rsquared for the model. To assess individual regression models, we applied MAE, MSE, MSLE, and R2 regression metrics. The LSTM model is compiled with Adam optimizer, loss function of MSLE, and accuracy. In a comparative study with national reports data on May 7, 2020, from China, Italy, Spain, the United Kingdom, and New York State, Bonanad et al. showed an overall fatality rate of 12.10%. The fatality rate changes between countries with the relevant thresholds on age >50 and age >60 years old. The lowest fatality rate was in China (3.1%), and the highest was in the United Kingdom (20.8%) and New York State (20.99%). The fatality rate was <1.1% in patients aged <50 years, and it has exponentially increased in older ones in the recorded data in 5 countries. Besides, the highest fatality rate occurred in patients aged 80 years[24]. This study scrutinized 50 826 COVID19 patients with 5 109 deaths in 20 cities of Mazandaran Province from February 22, 2020 to April 10, 2021. The researchers assessed the mean standardized incidence and fatality rates by age group based on training data available for 12 months. The results revealed that in each age group, that is, patients under 29, between 30 and 49, between 50 and 59, and over 60 years old, the standard incidence rates per 100 000 people were 31.27, 57.13, 28.70, and 70.69 in the first month, respectively. In the 12th month, the standard incidence rates were 61.18, 70.83, 52.92, and 193.92 in each age group, respectively. Moreover, the fatality rates in each age group in the first month were 2.32, 4.35, 6.08, and 33.97 per 100 000 people, while in the 11th month it was 1.73, 3.30, 6.28, and 55.58, and in the 12th month, it was 0.53, 2.70, 2.33, and 31.07 per 100 000 people. The results demonstrate the daily incidence rates fluctuations in different months and the increase in the incidence rates with the increase in age. In addition, we obtained the daily number of incidence and fatality by age groups. Finally, we predicted the standard incidence and fatality rates in each age group for the next 14 to 40 days. The prediction values were close to the real values. The daily incidence rates in April 11, 2021 were at 91.76, 155.84, 150.03, and 325.99 per 100 000 people, respectively. In general, the average standard daily incidence rate for 4 age groups per 100 000 people were 34.25, 95.68, 76.43, and 210.80 for the next 40 days, respectively. Correspondingly, the average daily fatality rate for the 4 age groups were 8.38, 4.18, 3.40, and 22.53, respectively. Although a fixed parameter cannot be a single factor, COVID19 infections are inherently associated with the age pattern. In this article, all indices were based on the WHO standard population. We also underestimated our calculations; that is, the patients with mild COVID19 had not been included in the study. Overall, the results show that COVID19 is lifethreatening not only for older adults but for middleaged people, and the high or low risk is predictable in the coming days. Similar to any other study, this research is subject to several limitations. First, model training with more data leads to better results when compared to different countries. In addition, the accuracy of the LSTM prediction improves after considering more parameters instead of relying on the univariate trend of time series data. Currently, this model can predict 14 to 40 days with acceptable accuracy. Moreover, we had an underestimation in the calculation due to not including the mild disease in the study. According to the purpose of the study, i.e., predicting the growth of coronavirus disease in different age groups, we applied the LSTM models. Since the results were obtained with limited data availability (i.e., 20 cities near the Caspian Sea in Mazandaran province), the researchers used the results of the other studies conducted in different countries. However, information on transmission distance based on different variants was not available due to the lack of appropriate technology. This can be a recommended issue to be studied in the future, considering different age groups. The results show that the main priority in the preventive measures should be older patients who are more susceptible to this disease. If public health proceedings reduce infection in the old patients, it can significantly reduce fatality. By predicting the number of admitted patients and the fatality and incidence rates of patients in each age group, we can prevent COVID19 prevalence. In conclusion, we predicted COVID19 incidence and fatality rates by age groups using the LSTM network based on the WHO population. The LSTM network predicted the number of confirmed cases and incidence and fatality rates in 14 to 40 days. For example, the incidence rate for over 60 years old patients was obtained 210.80 per 100 000 people. The results showed that the incidence and fatality rates of COVID19 patients in Mazandaran Province in the age group of 60 years and above are higher than other groups. The prediction results show fluctuations in the incidence and fatality rates, though the values are accurately predicted for each age group. By differentiating age groups in predicting the number or rates of incidence and fatality, the researchers obtained accurate results compared to predictions without differentiating groups. Predicting the incidence and fatality rates of different groups, we can make better decisions about the essential health proceedings as well as vaccination prioritization. Conflict of interest statement The authors declare that there is no conflict of interest. Authors' contributions ZR performed research, designed the analysis, implemented python programming, analyzed, interpreted the data, wrote and revised the manuscript. SAM contributed to COVID19 data acquisition. GO participated in the discussion. MRP contributed to COVID19 data acquisition. FA participated in the discussion. JYCh designed research, contributed to the interpretation and edited the manuscript. References


