# Sometimes outliers are real data

How do you know if an outlier is the result of a data glitch, or a real data point -- indeed maybe not an outlier. Difficult question to answer, but the chart below shows that in some cases, the outlier is not an error.

In this example, you could argue that we are not using the right metrics: comparing health expenditures in US (twice above average among developed countries) when US salary (after tax) is twice above average among developed countries, lead to a bias. When corrected for this salary bias, US might not be an outlier anymore in the above chart.

Also, is life expectancy the right metric to use? What if a large group of people die very young because of gang membership, and another group (the majority) dies pretty old? What would be interesting to see is the impact over time, in US, of increased health expenditures on life expectancy, after eliminating people dying from gun shots or car accidents. Note that a more stressful life (typical in US) can cause early death despite higher health expenditures.

Note the massive impact of the USA dot (outlier) on R^2 (at the bottom right corner) - making it much smaller than it should be (R^2 = 0.51). But R^2 is a bad metric, sensitive to outliers, and should not be used. Use this metric instead, to measure quality of fit. Indeed, the entire black curve going through the cloud, is bended too much towards the South-East, because of this outlier.

Views: 19438

Comment

Join Data Science Central

Comment by Michael Clayton on March 8, 2014 at 11:20am

http://www.economist.com/content/big-mac-index   Perhaps weight cost scales by Big Mac Index?

Then weight life expectancies by actual Big Mac Sales per citizen? :-)

Comment by Vincent Granville on March 8, 2014 at 11:16am

Other interesting fact: there are two main clusters. The one at the bottom where health expenditures directly correlate with longevity, and the bigger one at the top where this correlation disappears. It is as if beyond \$2,000/year per person in health expenditures, there is no additional longevity gain.

If this extra spending does not help with longevity, does it help with something else? Maybe more babies born alive? Better quality of life? Or you could argue that in US, if expenditures were slashed by half, longevity would be even lower. How to test this hypothesis? Maybe you would need to look at people who spend no money in healthcare, comparing UK with US. Maybe health care is so expensive in US, resulting in many that can't afford it and die earlier. Maybe in US the focus is not on prevention, thus treating problems when it's too late. Maybe other countries will catch up and see longevity goes down once they start consuming sugar and fats in vast amounts, just like in US.

The French drinking habit does not seem to hurt French longevity though, but they've been used to drinking wine for thousands of years, unlike in US (so there might be some kind of a genetic resistance to alcohol).

Comment by Michael Clayton on March 8, 2014 at 10:39am

It may be example of contrived graph to "prove" prior conclusion, as this data has been analyzed many ways recently by various political spinsters, each proving their spin with same data plotted uniquely.

This type of graph can be replotted many ways to learn something interesting if you avoid R-sq nonsense and fit lines based on normality assumptions.

For example:

Normalize the "spent" axis by GDP of country per citizen.

Show quartiles (boxplots) for life expectancies of each country to show variability

Would love to see each country broken up into wealth segments to if life expectancy drops vs wealth in different ways.

-----

And in some comments, bad performance of US was mentioned in terms of STRESS.

Having been in many of these countries, and lived in US so far to age 78, I am sure stress levels are very high in every country as function of safety net presence or absence.

But why the negativism about robust statistical methods?

They are best for this kind of data, and often prevent people from going down wrong paths for further study.

Comment by Papadopoulos Zacharias Dimitrios on March 7, 2014 at 10:49pm

I have to agree with Julio Rodrguez. More specifically, when it comes to outliers I find them quite usefull in terms of data interpetation and parametrization. Thus I can understand some aspects of the data set long before I start to working on it.

Comment by jaap Karman on March 7, 2014 at 10:33pm

Using Graphs and drawing lines is telling also something about intrepretation.
You could als bin into 3 groups:
1/ usa still an outlayer as segregated below life espectance and hig cost.

2/ all those above 78 and spending between  2000/6000
3/ those below 78 and spending less than 2000

Is there a reason for those? (agree julio)
Not going into speculations but real argumented and being compared to the others.

Comment by Elaine Allen on March 7, 2014 at 8:10pm

Interesting chart.  So I would say that when you have a picture like this you should always look at the influence statistics or do a jackknife (leave out one country at a time & refit the model) to see how it changes when the 'outlier' is included or excluded.  Also, sometimes the 'outlier' is the most interesting point in your data, just as the anomalies that are found when you do some machine-learning algorithm. I agree with Julio - we must try to understand why they are anomalies.

Comment by Eric A. King on March 7, 2014 at 5:23am

I'm not an outlier!  I just haven't found the right distribution yet.  X^D

Comment by Julio Rodriguez Martino on March 7, 2014 at 1:51am

All data points are real. There is always an explanation for every "odd" behavior. It is our task as scientists to understand every single point. Then we can decide wich ones we use to draw conclusions. This decision must always be well understood.

• View All