### dataset

In the field of professional science, our study focuses on the use of a regression model to predict COVID-19 trends using data from the Hospital Insular de Gran Canaria (Spain). This dataset spans from the beginning of 2020 to March 29, 2022, and includes only two inputs in the simplest case: date and daily new COVID-19 cases. Despite the simplicity of this dataset, our analysis has demonstrated the extraordinary ability of the model to accurately predict future COVID-19 trends, identifying temporal patterns, seasonality, and the impact of interventions. This work underlines the value of accessible data and shows how even minimal data input can yield profound insights, revolutionizing the landscape of professional research and analysis in science. As mentioned above, the database is owned by the Government of the Canary Islands (Spain) and the data is public.^{11}It can be accessed or downloaded from https://opendata.sitcan.es/dataset/capacidad-asistential-covid-19.

### performance index

A set of statistical parameters have been used to evaluate the accuracy of the model. The selection of these parameters is based on their widespread use in the literature, which allows us to compare our results with the current state of the art. are the most prominent *RMSE, MAE, MAPE,* And *R*^{2}Which measure the accuracy of measurements, as well as their dispersion and correlation. Their mathematical expressions are shown in the following equations, where \({y}_{i}\) The observed values are, \(\widehat{{y }_{i}}\) estimated value and \(\overline{y }\) The mean of these values respectively^{12,13,14,15},

mean square error (*MSE*,

$$MSE=\frac{1} ^{2}$$

(1)

root mean square error (*rmse*,

$$RMSE=\sqrt{\frac{1} ight)}^{2}}$$

(2)

mean average error (*mae*,

$$MAE=\frac{1} $$

(3)

mean square error (*map*,

$$MAPE=\frac{100}right|}{{y}_{i}}}$$

(4)

coefficient of determination (*R*^{2},

,

(5)

### data preprocessing

To perform data preprocessing and labeling, the “new daily cases” variable was separated into one vector and the date variable into another vector. Then, a labeling window with different values was used to assign a label to the “new daily cases” values. This label specifies the value of “Ytrain”. These “Ytrain” values depend on the size of the window. Thus, for a window n = 2, the “new daily cases” of dates n = 1 and n = 2 will be grouped in the first row of the “Xtrain” vector and their “Ytrain” values will be those of the later date. , i.e., n + 1. Next, considering a step = 1, dates n = 2 and n = 3 will be grouped in the second row, and the value of “ytrain” will be n = 4.

This study was conducted with values ranging from n = 1 to n = 20 to test which window was better suited to the data and which window could better handle the higher slope presented by COVID-19 waves Is. Figure 4 shows a scheme of the starting vectors “date” and “new daily cases” and the labeling process for a window n = 2 and n = 5. The dataset is available from the link given in the previous section.

### network architecture

To accurately predict COVID-19 data, an architecture has been designed that is capable of analyzing time series and capturing the existing gradient differences in slopes generated by different waves through the use of deep learning. The various layers used in the overall architecture are described in detail below.

#### LSTM-BiLSTM

Long-short term memory is a type of recurrent neural network (RNN) that is particularly useful for modeling sequential data. These types of algorithms have been applied to a wide variety of tasks, including speech recognition, natural language processing, and time series forecasting.^{16}By using memory cells, LSTMs can retain useful data from the current or previous steps and use it in the future. Therefore, they use algorithmic gates that are also capable of retaining such information for future use and goal achievement.

These can also be combined to improve the overall network architecture. There are variants with different functions, such as bidirectional LSTM (BiLSTM), gate recurrent units (GRU), or new algorithms focused on the attention layer, called “transformers”, described by Vaswani et al. In 2017 his work was titled “Attention is All You Need”^{17}In the case of BiLSTM, the only difference is the relationship between states, as they are bidirectional and can take into account data from the previous state as well as the next state.

LSTM consists of a memory cell and three parts, which can be expressed mathematically as follows^{18},

**Input Gate:** The layer responsible for updating the state of the network through the sigmoidal function.

$$i_{t} = \sigma \left( {W_{i} \cdot\left[ {h_{t – 1} ,x_{t} } \right] + b_{i}} \right)$$

(6)

\({W}_{i}\) is the representation of the weights of the input, \({b}_{i}\) The corresponding bias is, \({x}_{t}\) is the current time step, and \({h}_{t-1}\) is the output of the previous time step. σ will have a value [0, 1]represents complete discard or complete saving of data respectively^{16,18},

**Forget the gate:** The layer that is responsible for deciding whether to save or discard information. This is the first stage of LSTM.

$$f_{t} = \sigma \left( {W_{f} \cdot\left[ {h_{t – 1} ,x_{t} } \right] + b_{f}} \right)$$

(7)

\({W}_{f}\) The weight representation of the input is, \({b}_{f}\) The corresponding bias is, \({x}_{t}\) current time step and \({h}_{t-1}\) is the output of the previous time step.

**Output Gate:** This is where the information output is determined. This output is based on a filtered version of the cell state. The output value is determined by the sigmoid layer and then multiplied by the cell position^{18},

$$o_{t} = \sigma \left( {W_{o} \cdot\left[ {h_{t – 1} ,x_{t} } \right] + b_{o}} \right)$$

(8)

$$h_{t} = o_{t} \cdot \tan \,h(C_{t} )$$

(9)

\({W}_{o}\) The weight representation of the input is, \({b}_{o}\) have consistent bias, and \({x}_{t}\) It is a step towards the present time. \({h}_{t-1}\) is the output of the LSTM layer at the current time step. Finally, the previous cell state \({C}_{t-1}\) Must be updated. It is calculated by forgetting one input gate, as shown in Figure 5.

$$C_{t} = f_{t} \cdot C_{t – 1} + i_{t} \cdot g_{t}$$

(10)

Where? \({g}_{t}\) Body layer.

Furthermore, the BiLSTM model is composed of two LSTM networks and is capable of reading input evaluations in both forward and backward directions. Forward LSTM processes information from left to right, while backward LSTM processes information from right to left.^{19},

#### dense layer

A dense or fully connected layer, also known as a fully connected feedforward neural network, is a type of artificial neural network in which every neuron in one layer is connected to every neuron in the next layer. The basic formula for a fully connected neural network with one hidden layer and one output layer (\({y}_{fc}\)) can be represented as^{20},

$$y_{fc} = f\left( {mathop \sum \limits_{i = 1}^{n} \left( {W_{i} *x_{i} } \right) + b} \right) $ $

(11)

Where? \({x}_{i}\) is the input vector to the network and \({W}_{i}\) There are weight matrices for connections between layers. \(B\) have prejudice and \(F\)is the activation function applied to the output of each layer (sigmoid, ReLU, tanh).

It is important to note that this formula is for a single hidden layer neural network, but in practice, fully connected neural networks usually have multiple hidden layers, in which case the formula will be more complex and will involve an additional weight matrix. And each additional layer will contain bias vectors.

#### drop out

Dropout is a regularization technique used in deep learning to avoid overfitting. It works by randomly “dropping” (i.e., setting to zero) a certain number of neurons during each training iteration. The mechanism of the dropout layer is quite simple: it is applied to the output of the previous layer and involves multiplying the input vector by a mask. This mask is a binary mask that is randomly generated for each training iteration, its size is the same as the input and each element is either 0 or 1. The probability that each element of the mask is 1 is called the dropout rate. The dropout rate is a hyperparameter that is typically set between 0.2 and 0.5 depending on the specific application and complexity of the model. Typically, a lower dropout rate is used for the input layer and a higher dropout rate is used for the hidden layers. During the testing phase, it is common to use a dropout rate of 0, which means that all neurons are active. This is because dropout is only applied during the training phase and is not used during the testing phase.^{21},

### hyperparameter

The network was trained and tested using Python's TensorFlow. Adaptive moment estimation or Adam method, which is a widely used optimization algorithm for neural network training, is used. ADAM combines the techniques of RMSprop and Momentum Optimizer to efficiently and effectively adjust the weights of the neural network during training. See the equation. (12)–(14) down^{22,23},

$$m_{t} = \beta_{1} m_{t – 1} + \left( {1 – \beta_{1} } \right)g_{t}$$

(12)

$$v_{t} = \beta_{2} v_{t – 1} + \left( {1 – \beta_{2} } \right)g_{t}^{2}$$

(13)

$$\theta_{t} = \theta_{t – 1} – \frac{\alpha }{{\sqrt {v_{t} } + \epsilon }}m_{t}$$

(14)

Where? \({m}_{t}\) The first update moment (mean) is, \({v}_{t}\) The second update moment (variance) is, \({\beta }_{1}\) And \({\beta }_{2}\) are the moment decay parameters, \({g}_{t}\) is the gradient in the current step, α is the learning rate, “epsilon” (ϵ) is a small numerical constant to avoid division by zero, and \({\theta }_{t}\) is the current value of the parameter being updated. This is the parameter that the algorithm optimizes.

The value of these hyperparameters was 1·10^{-6} For “epsilon” in training options and 1·10^{-4} For learning rate. Batch sizes were set to 5 and 15, with epochs equal to 1000 and a “shuffle” at each epoch. The value of \({\beta }_{1}\) was set to 0.99, and \({\beta }_{2}\) to 0.999. Training and testing is done using the holdout method for regression with a training percentage of 40-60.

### overall network architecture

The architecture developed in this work consists of sequential inputs of previously defined temporal windows. This sequence passes through 3 levels. In the first one, there is an LSTM layer whose hidden layer has 128 units and sequence return is enabled. Then, in levels 2 and 3, there are 2 BiLSTM layers with sequence return enabled and 128 units each. Finally, at the output of this last level, a dense fully connected layer with 128 connections is applied. Then, to reduce the randomness of the weights, a dropout layer with a value of 0.4 and a flatten layer are added to flatten the output sequence into a vector.^{24}Finally, a dense layer with one neuron and linear activation is added to obtain the output. In Figure 6, a plan of the entire architecture is shown.