.

# R Nonlinear Regression Analysis

## R Nonlinear Regression Analysis

Nonlinear Regression and Generalized Linear Models:
Regression is nonlinear when at least one of its parameters appears nonlinearly. It commonly sorts and analyzes data of various industries like retail and banking sectors. It also helps to draw conclusions and predict future trends on the basis of user’s activities on the net.
The nonlinear regression analysis is the process of building a nonlinear function. On the basis of independent variables, this process predicts the outcome of a dependent variable with the help of model parameters that depend on the degree of relationship among variables.
Generalized linear models (GLMs) calculates nonlinear regression when the variance in sample data is not constant or when errors are not normally distributed.

Generalized Linear Model commonly applies to the following types of regressions when:

• Count data is expressed as proportions (e.g. logistic regressions)
• Count data is not expressed as proportions (e.g. log-linear models of counts)
• We have binary response variables (e.g. “yes/no”, “day/night”, “sleep/awake”, buy/not buy)
• Data is showing a constant coefficient of variation (e.g. time data with gamma errors)

## Logistic Regression

In statistics, logistic regression is one of the most commonly used of nonlinear regression. It is used to estimate the probability of an event based on one or more independent variables. Logistic regression identifies the relationships between the enumerated variables and independent variablesusing the probability theory.

A variable is said to be enumerated if it can possess only one value from a given set of values.

Logistic regression models are generally used in cases when the rate of growth does not remain constant over a period of time. For example when a new technology is introduced in the market, firstly its demand increases at a faster rate but then gradually slows down.

Logistic Regression Types:

• Multivariate Logistic Regression – Here logistic regression includes more than 1 independent variable. Multivariate logistic regression is commonly used in the fields of medical and social science. It is also used to make electoral assumptions such as the percentage of voting in a particular area; the ratio of voters in terms of sex, age, annual income, and residential location; and the chances of a particular candidate to win or lose from that area. This analysis helps to predict the level of the failure or success of a proposed system, product, or process.
• Multinomial Logistic Regression – Here set of possible values are more than 2. It is used to estimate the value of an enumerated variable when the value of the variable depends on values of 2 or more variables.

Logistic Regression is defined using logit() function:

f(x) = logit(x) = log(x/(1-x))

Suppose p(x) represents the probability of the occurrence of an event, such as diabetes, on the basis of an independent variable, such as age of a person. The probability p(x) will be given as follows:

P(x)=exp(β0+ β1x1 )/(1+ exp(β0+ β1x1)))

Here β is a regression coefficient.

On the taking of logit of above equation, we get:

logit(P(x))=log(1/(1-P(x)))

On solving the above equation, we get:

logit(P(x))=β0+ β1x1

The logistic function that is represented by an S-shaped curve is known as theSigmoid Function.

When a new technology comes in the market, usually its demand increases at a fast rate in the first few months and then gradually slows down over a period of time. This is an example of logistic regression. Logistic regression models are generally used in cases where the rate of growth does not remain constant over a period of time.

Multivariate logit() Function

In case of multiple predictor variables, following equation represent logistic function:

p = exp(β0+ β1x1+ β2x2+—– βnxn)/(1+exp(β0+ β1x1+ β2x2+…+βnxn))

Here, p is the expected probability; x1,x2,x3,…,xn are independent variables; and β0, β1, β2,…βn are the regression coefficients.

Estimating β Coefficients manually is an error-prone and time-consuming process, as it involves lots of complex and lengthy calculations. Therefore, such estimates are generally made by using the sophisticated statistical software.

β coefficients need to be calculated in statistical analysis. For this, use the following steps:

• Firstly you need to calculate the logarithmic value of the probability function.
• Now calculate the partial derivatives with respect to each β coefficient. For n number of unknown β coefficients, there will be n equations.
• For n unknown β coefficients, you need to set n equations.
• Finally, to get the values of the β coefficients, you can solve the n equations for n unknown β coefficients.

Interaction is a relationship among three or more variables to specify the simultaneous effect of 2 or more interacting variables on a dependent variable. We can calculate the logistic regression with interacting variables, i.e. three or more variables in relation where two or more independent variables affect the dependent variable.

In logistic regression, an enumerated variable can have an order but it cannot have magnitude. This makes arrays unsuitable for storing enumerated variables because arrays possess both order and magnitude. Thus, enumerated variables are stored by using dummy or indicator variables. These dummy or indicator variables can have two values: 0 or 1.

After developing a Logistic Regression Model, you have to check its accuracy for predictions. Some of the Useful Logistic Regression Model Adequacy Checking Techniques are as below:

• Residual Deviance – High residual variation refers to insufficient Logistic Regression Model. The ideal value of residual variance Logistic Regression Model is 0.
• Parsimony – Logistic Regression Models with less number of explanatory variables are more reliable than models with a large number of explanatory variables and can be more useful also.
• Classification Accuracy – It refers to the process of setting the threshold probability for a response variable on the basis of explanatory variables. It helps to interpret the result of a Logistic Regression Model easily.
• Prediction Accuracy – To check the accuracy of predictions for Logistic Regression Model, cross-validation is used which is repeated several times to improve the prediction accuracy of the Logistic Regression Model.

Fitting the Regression Model

 1 `model<-glm(occupied~resources,binomial)`

It fits a logistic regression model for a given set of values. Here binomial parameter specifies the type of error distribution and is associated with situations involving 2 outcomes.

Drawing Logistic Regression Line

 1 `lines(xv,yv)`

Now to draw regression line for the model to check the validity of the prediction, let’s cut the ranked values on the x-axis is into 5 categories and then work out the mean and standard error of the proportions in each group.

Dividing the Range of Given Values

 1 `cutr<-cut(resources,5)`

It divides the range of values in 5 sections.

Calculating the Probabilities for Divided Range

Calculate the actual probabilities for the divided range by using the following command:

 1 2 3 `probs<-``as``.vector(probs)` `resmeans<-tapply(resources,cutr,mean)` `resmeans<-``as``.vector(resmeans)`

Plotting the Generated Points

 1 `points(resmeans,probs,pch=16,cex=2)`

It is used to plot the generated points for logistic regression line.

Statistical interpretation is performed by government agencies and business organizations to draw inferences and derive conclusions on the basis of research data. Such conclusions help organizations analyze the efficacy of current measures as well as decide future trends for further growth and profit.

## Line Estimation Using MLE

Regression lines for models are generated on the basis of the parameter values that appear in the regression model. So first you need to estimate the parameters for the regression model. Parameter estimation is used to improve the accuracy of linear and nonlinear statistical models.

The process of estimating the parameters of a regression model is called as Maximum Likelihood Estimation (MLE).

We can estimate the parameters in any of the following ways:

• You can specify model parameters with certain conditions, such as the resistance of a mechanical engine and inertia.
• You can manipulate input and output test data, such as the rate of the influx of current and output of the mechanical engine in round per minute (rpm).
• On different values of a variable, you can perform a number of measurements for a function.

The presence of bias while collecting data for parameter estimation might lead to uneven and misleading results. Bias can occur while selecting the sample or collecting the data.

## Transformation of a Nonlinear Model into a Linear Model

Linear least square method fits data points of a model in a straight line. However, in many cases, data points form a curve.

Nonlinear models are sometimes fitted into linear models by using certain techniques as linear models are easy to use. Consider the following equation which is a nonlinear equation for exponential growth rate:

y=cebxu

Here b is the growth rate while u is the random error term and c is a constant.

We can plot a graph of above equation by using linear regression method. Use the following steps to transform the above nonlinear equation into a linear equation, as follows:

• On taking these base logarithm of the equation, you get the result as In(y)-In(c)+bx+In(u)
• Now if you substitute Y for In(y), C for In(c), and U for In(u), you will get the following result: Y-A+bx+U

## Other Nonlinear Regression Models

There are several models for specifying the relationship between y and x and estimate the parameters and standard errors of parameters of a specific nonlinear equation from data.

Some of the most frequently appearing nonlinear regression models are:

We can change the working directory R, as follows:

a) Michaelis-Menten

y=ax/(1+bx)

b) 2-parameter asymptotic exponential

y=a(1−e-bx)

c) 3-parameter asymptotic exponential

y=a−be−cx

Below are few S-shaped Functions:

d) 2-parameter logistic

y=( ea+bx)/(1+ea+bx)

e) 3-paramerter logistic

y=a/(1+be−cx)

f) 3-parameter asymptotic exponential

y=a/(1+be−cx)

g) Weibull

y=a- be-(cx2)

Below are few humped Curves:

h) Ricker curve

y=axe−bx

i) First-order compartment

y=kexp(−exp(a)x)−exp(−exp(b)x)

j) Bell-shaped

y=a exp(−|bx|2)

k) Biexponential

y=aebx −ce−dx

The accuracy of a statistical interpretation largely depends on the correctness of the statistical model on which it depends.

The following are the most common statistical models:

• Fully Parametric – In this model, assumptions are on the basis of number of parameters known.
• Non Parametric – In this model, assumptions are on the basis of the features of the available data. These models do not use parameters to describe the process of generating data.
• Semi Parametric – In this model, both parametric and nonparametric approaches describe the process of data generation. These models use parameters as well as the main features of the data to derive conclusions.

An example of nonlinear regression: This example is based on the relationship between jaw bone length and age in deers.

a) Reading the Dataset from jaws.txt file; Path of the file acts as an argument.

 1 `deer<-read.table(``"c:\\temp\\jaws.txt"``,header=T)`

b) Fitting the Model – Nonlinear equation is an argument in nls() command with starting values of a, b and c parameters. The Result goes in the model object.

 1 `model<-nls(bone~a-b*``exp``(-c*age),start=list(a=120,b=110,c=0.064))`

c) Displaying Information about model Object using the summary() command. Model object is an argument to the summary() command as shown below:

 1 `summary(model)`

d) Fitting a Simpler Model

y=a (1−e−cx)

e) Applying nls() Command to the New Model for modified regression model. The result goes in the model2 object.

 1 `model2<-nls(bone~a*(1-``exp``(-c*age)),start=list(a=120,c=0.064))`

f) Comparing the Models as below – Use Anova() command to compare result objectsmodel1 and model2. These objects then act as arguments to anova()command.

 1 `anova(model,model2)`

g) Fitting the Logistic Regression Line – Generate the curve by passing av and bv objects to the lines() command.

 1 `lines(av,bv)`

h) Viewing the Components of the New Model2 as below:

 1 `summary(model2)`

Sometimes we can see that the relationship between y and x is nonlinear but we don’t have any theory or any mechanistic model to suggest a particular functional form (mathematical equation) to describe the relationship. In such circumstances, Generalized Additive Models (GAMs) are particularly useful because they fit a nonparametric curve to the data without requiring us to specify any particular mathematical model to describe the nonlinearity.

GAMs are useful because they allow you to identify the relationship between y and x without choosing a particular parametric form. Generalized additive models implemented in R by the function gam() command.

The gam() command has many of the attributes of both glm() and lm(), and the output can be modified using update() command. You can use all of the familiar methods such as print, plot, summary, anova, predict, and fitted after a GAM has been fitted to data. The gam function is available in the mgcv library.

## Self-Starting Functions

In nonlinear regression analysis, nonlinear least squares method becomes insufficient because the initial guesses by users for the starting parameter values may be wrong. The simplest solution is to use R’s self-starting models.

Self-starting models work out the starting values automatically and nonlinear regression analysis make use of this to overcome the chances of the initial guesses, which the user tends to make, being wrong.

Some of the most frequently used self-starting functions are:

a) Michaelis-Menten model(SSmicmen)

R has a self-starting version called SSmicmen that is as follows:

y=ax/(b+x)

Here, a and b are two parameters, indicating the asymptotic value of y and x (value at which we get half of the maximum response a/2) respectively.

b) Asymptotic regression model (SSasymp)

Below gives the self-starting version of asymptotic regression model:

3 parameter asymptotic exponential equation can be as:

y=a−be−cx

Here, a is a horizontal asymptote, b=a-R0 where R0 is the intercept (response when x is 0), and c is rate constant.

c) Four parameter logistic model (SSfpl)

y=A+(B-A)/(1+e(D-x)/c)

Here, A is horizontal asymptote on left (for low values of x), B is horizontal asymptote on right (for large values of x), D is the value of x at the point of inflection of the curve, and c is a numeric scale parameter on the X-axis. It gives the self-starting version of four-parameter logistic regression.

d) Self-Starting First-Order Compartment Function (SSfol)

This function is given as follows:

y=k exp(−exp(a)x)−exp(−exp(b)x)

Here, k=Dose*exp(a+b−c)/(exp(b)- exp(a)) and Dose is a vector of identical values provided to the fit. It gives the self-starting version of first-order compartment function.

e) Self-Starting Weibull Growth Function (SSweibull)

R’s parameterization of the Weibull growth function is as follows:

Asym-Drop*exp(-exp(lrc)*x^pwr)

It gives the self-starting version of Weibull growth function.

Here, Asym is the horizontal asymptote on the right

Drop is the difference between the asymptote and the intercept (the value of y at x=0)

lrc is the natural logarithm of the rate constant

pwr is the power to which x is raised.

## Bootstrapping a Family of Nonlinear Regressions

When performing nonlinear regression analysis, many times we have only 1 sample data that is not sufficient. In this case, we need to create new sample data by using the existing sample.

Bootstrapping is the method of creating new samples from the existing sample datasets. It finds the following two broad applications to the parameter estimation in nonlinear models:

• To duplicate some data points and remove some, select certain data points at random with a replacement for any given model fit.
• Fit the model and estimate the residuals, then allocate the residuals at random, adding them to different fitted values in different simulations.

Code for Bootstrapping Nonlinear Regression

 1 `> bv<-numeric(1000)`

It creates bv vector with a capacity of storing 1000 values.

 1 `> cv<-numeric(1000)`

It creates cv vector with a capacity of storing 1000 values.

 1 2 3 4 5 6 7 8 9 10 `> ``for``(i in 1:1000)           ``//Creates for loop that will execute 1000 times.` `> {` `> ss<-sample(1:23, replace=T)    ``// Samples the indices of 23 cases at random with replacement.` `> y<- Time[ss]   ``// Stores value in y variable located at ss indices in time vector` `> x1<-Viscosity[ss]             ``// Stores value in x1 variable located at ss indices in Viscosity vector` `> x2<-Wt[ss]          ``// Stores value in x2 variable located at ss indices in Wt vector` `> model<-nls(y~b*x1/(x2-c), start=list(b=29, c=2))    ``//Models regression analysis equation with starting values of b and c` `> bv[i]<-coef(model)[1]      ``//Stores value of b coefficient in bv vector at I indices.` `> cv[i]<-coef(model)[2]        ``//Stores value of b coefficient in cv vector at I indices and close the for loop` `}`

By using the code as above, you can generate 1000 different samples of the given data. The idea is very simple. You can have a single sample of n measurements but you can sample this in many ways so long as you allow some values to appear more than once and other samples to be left out.

## Applications of Logistic Regression

Logistic regression is the most commonly used form of regression analysis in real life. As a result they are quite useful for classifying new cases into one of the two outcome categories.

A few applications, for example, are as follows:

• Loan Acceptance – By using logistic regression, on the basis of customer’s previous behavior, organizations which provide banks or loan can determine whether the customer would accept an offered loan or not. Various explanatory variables include the age of customer, experience, the income of customer. Family size of customer, CCAvg, Mortgage etc.
• German Credit Data – The German credit dataset was obtained from the UCI ( the University of California at Irwin) Machine Learning Repository (Asuncion and Newman, 2007). The dataset, which contains attributes and outcomes of 1,000 loan applications, was provided in 1994 by Professor Dr. Hans Hofmann of the Institut fuer Statistik und Oekonometrie at the University of Hamburg. It also served as an important test dataset for several credits coring algorithms. The logistic expression is used to estimate the probability of default, using continuous variables(duration, amount, installment, age) and categorical variables (loan history, purpose, foreign, rent) as explanatory variables.
• Delayed Airplanes – Logistic regression analysis can also predict a possible delay in airplane timing. Explanatory variables include different arrival airports, different departure airports, carriers, weather conditions, the day of the week and a categorical variable for different hours of departure.

Views: 4145

Comment