Steps of Modelling - DataScienceCentral.com

Regression analysis is a method to find functional relationships among variables. The relationship is expressed in the form of an equation or a model depicting connection between the response or dependent variable and one or more explanatory or predictor variables.

Regression analysis includes the following steps:

Statement of the problem– Regression analysis usually starts with the formulation of the problem which includes the question(s) that has to be answered by the analysis. This is the first and most important step in regression analysis. It is important because an incorrectly defined problem can lead to wasted effort. Often irrelevant set of variables can be selected or a wrong statistical method of analysis is chosen due to this. A question that has not been carefully formulated can lead to the wrong choice of a model.

Selection of potentially relevant variables– The next step after the statement of the problem is to select a set of variables that are thought to be the predictor variables(done by the experts of domain) which will be able to explain or predict the response variable. The response variable is denoted by Y and the explanatory or predictor variables are denoted by X 1, X2, . . . , Xp where p denotes the number of predictor variables.

Data collection- The next step after the selection of potentially relevant variables is to collect the data from the environment under study to be used for analysis. Sometimes the data are collected in a controlled setting so that factors that are not of primary interest can be held constant. More often the data are collected under non-experimental conditions where very little can be controlled by the investigator. In either case, the collected data consist of observations on n subjects (also known as individuals often) . Each of these n observations consists of measurements for each of the p potentially relevant variables.
The data are usually recorded in rows and columns. A column represents a variable,whereas a row represents an observation, which is a set of p + 1 values for a single subject i.e. one value for the response variable and one value for each of the p predictors. Each of the variables can be classified as either quantitative or qualitative. A technique used in cases where the response variable is binary is called logistic regression. In regression analysis, the predictor variables can be either quantitative and or qualitative. For the purpose of computations, however, the qualitative variables, if any, have to be coded into a set of indicator or dummy variables. If all predictor variables are qualitative, the techniques used in the analysis of the data are called the analysis of variance techniques (special cases of regression analysis). If some of the predictor variables are quantitative while others are qualitative, regression

analysis in these cases is called the analysis of covariance.
Model specification- Initially, the form of the model that is assumed to explain the relationship between the response variable and the set of predictor variables is usually specified by the experts in the area of study based on their knowledge or their objective and or subjective judgments. The hypothesized model can then be either confirmed or rejected by the analysis based on the collected data. Note that the model need to be specified only in form, but it can
still depend on unknown parameters. We need to select the form of the function

f ( X1,X2,…,Xp) . This function can be classified into two types: linear and nonlinear. Note that the term linear (nonlinear) here does not describe the relationship between Y and X1 ,X2,…, Xp. It is related to the fact that the regression parameters enter the equation linearly (nonlinearly). All nonlinear functions that can be transformed into linear functions are called linearizable functions. Accordingly, the class of linear models is actually wider because it includes all linearizable functions.However, that not all nonlinear functions are linearizable. A regression model with one predictor variable is called a simple regression model. A model having more than one predictor variable is called a multiple regression model. When we have only one response variable, the regression analysis is called univariate regression and in cases where we have two or more response variables, the regression is called multivariate regression. Simple and multiple regressions should not be confused with univariate versus multivariate regressions.
Choice of fitting method – After defining the initial model and collecting the data , the next step is to estimate the parameters of the model based on the collected data. This is also referred to as parameter estimation or model fitting. The most commonly used method of estimation is called the least squares method. Under certain assumptions, least squares method produce estimators with desirable properties. In some instances (e.g., when one or more of the assumptions does not hold) other estimation methods may be superior to least squares. The other estimation methods that can be considered are the maximum likelihood method, the ridge method, and the principal components method.
Model fitting- The next step in the analysis is to estimate the regression parameters or to fit the model to the collected data using the chosen estimation method. The estimates of the regression parameters ,, ……., are denoted by 0 , 1 , ……,p . The estimated regression equation then becomes 0 + 1 1 + 2 2 + ….. + p p
A hat on top of a parameter denotes the estimate of the corresponding parameter. The value (pronounced as Y hat) is called the fitted value. Using this equation we can compute n fitted values, one for each of the n observations in our data. It can be used to predict the response variable for any values of the predictor variables not observed in our data. In this case, the obtained Y is called the predicted value. The difference between fitted and predicted values is that the fitted value refers to the case where the values used for the predictor variables correspond to one of the n observations in our data, but the predicted values are obtained for any set of values of the predictor variables. It is generally not recommended to predict the response variable for a set of values of the predictor variables far outside the range of our data. In cases where the values of the predictor variables represent future values of the predictors, the predicted value is referred to as the forecasted value.
Model validation and criticism- The validity of a statistical method, such as regression analysis, depends on certain assumptions. Assumptions are usually made about the data and the model. The accuracy of the analysis and the conclusions derived from an analysis depends crucially on the validity of these assumptions. Regression analysis is viewed here as a iterative process, a process in which the outputs are used to diagnose, validate, criticize, and possibly modify the inputs. The process has to be repeated until a satisfactory output has been obtained. A satisfactory output is an estimated model that satisfies the assumptions and fits the data reasonably well.
Using the chosen model(s) for the solution of the posed problem- The explicit determination of the regression equation is the most important product of the analysis. It is a summary of the relationship between Y (the response variable) and the set of predictor variables X1 , X2, . . . , Xp. The equation may be used for several purposes. It may be used to evaluate the importance of individual predictors, to analyze the effects of policy that involves changing values of the predictor variables, or to forecast values of the response variable for a given set of predictors.