Subscribe to DSC Newsletter

Is predictive modeling different from interpolation? Do we really need stats?

You can get the same forecasts whether or not you use a statistical model (predictive modeling) or traditional math (interpolation, a technique used in numerical analysis). Indeed, Hastie has shown in his famous data mining book that many apparently very different methodologies such as logistic regression, decision trees, nearest neighbors or neural networks, can produce the exact same results, depending on how parameters are fine-tuned.

One might argue that you can create confidence intervals for your predicted values if you use a statistical model, but you can also do so without models.

So what are the advantages of predictive modeling (if any) over non-stochastic methods?

Related articles and questions:

Views: 3069

Reply to This

Replies to This Discussion

 

Statistical models allow you to predict into regions of the data space where you do not currently have observations.  Also, if the the distributional properties of the process you are modeling are already understood from prior experience, you can build a new model with relatively little data by using the known statistical form.

When the underlying phenomena generating your data are complex, using an empirical distribution when you have a lot of data (as in your confidence interval example) will often do better than using a statistical model that doesn't fit well.

The real key is understanding the nature of your problem, the nature of the data you have (or will be collecting), and using a method, statistical or otherwise, that best answers the question.   If several methods are capable of modeling your data, you should get similar results for each, as Hastie shows.   My advice is: pick the one that best matches your problem and provides the kind of information you need to make your decisions.

 

 

In one word: generalization.

In many more words:

When you say 'interpolation' of the data, I assume you don't mean threading a function through all of the data points, but rather doing some sort of least-squares fitting over intervals (e.g. using splines a la Hastie, Tibshirani, and Friedman's chapter 5).

The goal of statistical modeling is to distinguish between what's there by coincidence (noise) and what's there because of something inherent to the object of study (signal). Usually, we care about the signal, but not the noise. Without statistics, you're modeling noise + signal, and the noise doesn't generalize. If we didn't think there was noise (coincidences) in the data, then you're correct. We might as well throw out our probabilistic model and use approximation theory, maybe applying an approach like Nick Trefethen's chebfun:

http://www.maths.ox.ac.uk/chebfun/

But the noise doesn't generalize, so the interpolated function will get us further from the 'truth,' on average, then using a regression-based method.

How to build 'model free' regression methods is an open question. See, for example, section 2.2 from this paper:

http://www.rmm-journal.de/downloads/Article_Wasserman.pdf

But even here, we don't assume that the data are *right*.

If the world weren't noisy, we could just use interpolation and throw out the stochastic considerations inherent in statistical models. But large parts of the world *are* noisy.

It is a good read no doubt!! Being a beginner in the field of Analytics, my understanding is that once u undersatnd the business problem you can then use any model. Further using statistical processes you would be using certain samples which may give better insights depending on the CI level set up.

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service