Subscribe to DSC Newsletter

Nonlinear regression of COVID19 infected cases.

In 1927, W. O. Kermack y A. G. McKendrick described the first mathematical model for infectious diseases using a set of differential equations. This model is called SIR because of the three states one individual can have.
These states are:

  • Susceptible: The individuals that can be infected by the disease
  • Infected: The individuals that have been infected and suffer the disease.
  • Recovered: The individuals that recovered from the disease and have become immune.

The equations that represent these states are as follows:

  1. Variation with time of the susceptible individuals to be infected will depend inversely on a transmission factor β and the susceptible population.
  2. Variation of those infected will depend on the number of people who are still susceptible of being infected, minus the number of people who have already recovered and are therefore immune.
  3. The variation of recovered ones depends directly on the number of infected multiplied by α, a factor that determines the time that infected need to recover, that is:

The boundary conditions are:

  • Population must always remain constantAt t=0

The analytical solution of this system can be found in different articles, for example here: arXiv:1403.2160

Instead of that, I will focus in equation (2) to note that it is a Bernoulli equation of the form


The solution for this Bernoulli differential equation is the logistic function, which most general form is this:


In the epidemiologic context, this logistic function represents the accumulative number of infected people as a function of time.

Using this model, it’s possible to fit it to the real data, to obtain the values for the variables, the way to do it consists in minimizing the residuals in the loss function

Because the function to be fitted is not linear, the method to minimize de loss function must be suitable for nonlinear regressions. To do this regression, I used the NLS package for R, which implements the Gauss-Newton algorithm.

The data corresponds to the number of infected people in Spain as a function of time provided by the Ministry of Health.

This graph represents the data.

How to execute the regression using R.

  1. Load the CSV with data using read_csv

descarga <- read_csv("serie_historica_acumulados.csv",col_types = colsFallecidos = col_double(), Fecha = col_date(format = "%d/%m/%Y"), Hospitalizados = col_double(), Recuperados = col_double(), UCI = col_double(), X8 = col_skip()))


  1. Group the data by date and sum all regions

agregados_por_fecha<-descarga %>% group_by(Fecha) %>% summarize(Fallecidos=sum(Fallecidos), Casos=sum(Casos), Hospitalizados=sum(Hospitalizados),UCI=sum(UCI), Recuperados=sum(Recuperados))


  1. Create a sequence to use it as a time scale


tabla_absolutos["dia"] <- s


  1. Use nls to fit the curve. To have a good fit, it is necessary to provide initial data compatible with the data. This need to be made manually.

 logis.m1 <- nls(Casos ~ logis(dia, a, b, c,d), data = agregados_por_fecha, start = list(a = 0, b = 180000, c = 40, d=5))


  1. Use summary to retrieve the details of the regression.



Formula: Casos ~ logis(dia, a, b, c, d)


    Estimate Std. Error t value Pr(>|t|)   

a -2.320e+03  5.344e+02  -4.342 0.000115 ***

b  1.788e+05  2.111e+03  84.706  < 2e-16 ***

c  3.914e+01  1.317e-01 297.217  < 2e-16 ***

d  5.362e+00  1.033e-01  51.920  < 2e-16 ***


Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This graph represents the data and the regression curve.


  • The regression found the values for the variable that are compatible withe the data.
  • The inflexion point occurred on day 39 (march29)
  • The maximum number of infected people will be 180.000 people
  • The number of infected will grow until May 15th.

Views: 3791


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Pablo Gutierrez on April 16, 2020 at 1:15am

Hi Jason.

The data is for cumulative cases. Of course most of these people will recover after a while, but the analysis is focused on the infection process.


Comment by Jason Chia Kim Leng on April 15, 2020 at 7:34pm

This seems to model a scenario where the number of infected COVID-19 cases will plateau eventually instead of decline back down to zero-which suggests that life will never return back to normal...unless the graphical plot’s y axis represents total cumulative case count since day of inception that does not take into account the number of cases that have recovered, but there is minimal description and/or annotation of the plot to firm up a fixed interpretation.

Comment by Pablo Gutierrez on April 13, 2020 at 10:16pm

Thanks Peter. I will check it out. Regards

Comment by Peter Cotton on April 13, 2020 at 9:02am

Nice exposition. Thanks. Of course there are some issues with assuming a representative agent. See this post for some discussion of population density etc and simulations using pandemic on PyPI. I'll follow up with a blog article here. 

Comment by Habib on April 12, 2020 at 7:30pm

Could you please share the R codes and the data files in English to replicate the results at:

[email protected]


  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service