Subscribe to DSC Newsletter

# Sometimes outliers are real data

How do you know if an outlier is the result of a data glitch, or a real data point -- indeed maybe not an outlier. Difficult question to answer, but the chart below shows that in some cases, the outlier is not an error.

In this example, you could argue that we are not using the right metrics: comparing health expenditures in US (twice above average among developed countries) when US salary (after tax) is twice above average among developed countries, lead to a bias. When corrected for this salary bias, US might not be an outlier anymore in the above chart.

Also, is life expectancy the right metric to use? What if a large group of people die very young because of gang membership, and another group (the majority) dies pretty old? What would be interesting to see is the impact over time, in US, of increased health expenditures on life expectancy, after eliminating people dying from gun shots or car accidents. Note that a more stressful life (typical in US) can cause early death despite higher health expenditures.

Note the massive impact of the USA dot (outlier) on R^2 (at the bottom right corner) - making it much smaller than it should be (R^2 = 0.51). But R^2 is a bad metric, sensitive to outliers, and should not be used. Use this metric instead, to measure quality of fit. Indeed, the entire black curve going through the cloud, is bended too much towards the South-East, because of this outlier.

Views: 19771

Comment

### You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Mohamed Sameer on September 21, 2017 at 8:11pm

Good read on Outliers

Comment by Tomas Ramirez on December 27, 2016 at 11:08pm

1- Social Services has nothing to do with Health Care https://stats.oecd.org/glossary/detail.asp?ID=1216. Social Services is a aggregate data including Health Care. https://stats.oecd.org/glossary/detail.asp?ID=2441 Then you have to compare apples with apples and pears with pears

2- Even when they are included in OECD Chile, Mexico, Brasil or Latvia are not developed countries, you must to compare US with those counties with the equivalent GDP per capita in the high part of the table. https://en.wikipedia.org/wiki/List_of_countries_by_average_wage.

3- Anyway you have to take the GDP per capita because tax policies and Health systems are different in US and Europe. In Europe in general the Health Care do not depend of salaries and everybody have access not matter the salary. In the rest of countries of OECD the most of the expenditures are public. n 2012, 48% of health spending in the United States was publicly financed, well below the average of 72% in OECD countries. https://www.oecd.org/unitedstates/Briefing-Note-UNITED-STATES-2014.pdf If you remove US the public sector in OECD es much higher.

4- Health expenditure as a share of GDP, selected G7 countries, 2005-13 http://www.keepeek.com/Digital-Asset-Management/oecd/social-issues-...

5-In terms of measuring the efficiency of the system, those countries that have lower per capita costs with better life expectancies use their resources better. If in order to have the same life expectancy one country spends twice as much as another, that country is less efficient or has a health system less efficient than the other.

6-Countries such as Greece, Korea, Spain, Italy, Portugal and even Chile have a life expectancy much higher than that of the United States with a per capita cost of a quarter.

7-The life expectancy of the United States can only be compared with that of the underdeveloped countries of the OECD

8-A study of international health care spending levels published in the health policy journal Health Affairs in the year 2000 found that the United States spends substantially more on health care than any other country in the Organization for Economic Co-operation and Development (OECD), and that the use of health care services in the U.S. is below the OECD median by most measures. The authors of the study conclude that the prices paid for health care services are much higher in the U.S. than elsewhere.[117] While the 19 next most wealthy countries by GDP all pay less than half what the U.S. does for health care, they have all gained about six years of life expectancy more than the U.S. since 1970.[96] https://en.wikipedia.org/wiki/Health_care_in_the_United_States#Syst...

9-While the cost of health services in the United States was in line with the rest of the countries until 1980, after that date the cost in the United States has grown twice as much in the rest of the developed countries. Http://www.commonwealthfund.org/publications/issue-briefs/2015/oct/...

10-Finally I wanted to express that a country where people are not shooting each other is definitely a healthier country.

Comment by Vivek Agarwal on February 9, 2015 at 11:47pm

Yes, outliers may be the most valuable data points according to some business problems. I have also posted one article sometime back on Linkedin. Below is the link:

https://www.linkedin.com/pulse/outliers-good-bad-vivek-agarwal

Please read & provide your comments.

Comment by Stephan Meyn on February 9, 2015 at 6:24pm

Outliers are always real data. It's only when what makes them the outlier is specifically not relevant that they become non real data.

This demonstrates nicely why visualisation is such a powerful tool. And it raises for me the one question: is Watson going to be the application that will be able to replace the need for visualisation - by providing advanced insight in handling outliers to the point that we can much quicker understand why something is an outlier and whether we can disregard it or not?

Comment by Jordi Raso on February 8, 2015 at 8:13am

I've read this post rather late, but if someone has interest, I've found this interesting explanation of the USA outlier:

http://paxonbothhouses.blogspot.com.es/2013/09/5-of-people-account-...

It confirms my impression and yours that behind some outliers are very interesting hidden data.

Thanks for the interesting post!!

Comment by Maciej Rudziński on July 15, 2014 at 3:07pm

If you do not know whether outlier is important, it would be better to keep it, and to use robust tests that reduce weight.

Also, having a correlation for clusters could be more reasonable in this case.

Good source for robust metrics:

Comment by Vincent Granville on March 12, 2014 at 10:15am

Also interesting, is the fact that the two ouliers (Russia and USA) are below the curve. It looks like it is possible to strongly underperform, but you can't really beat the upper 83 years limit for longevity. Note that ISR and ITA that seem to be well above the curve, are not really that much above: once you remove the too real outliers (Russia and USA) the curve will move up and ISR and ITA will get closer to the curve.

Comment by Vincent Granville on March 12, 2014 at 7:47am

I agree that outliers are the most interesting points. I find the relationship (curve) between X and Y to be quite interesting too, as it can sometimes lead to physical and thus true causal interpretation. In this case, the fact that beyond \$2000 per year per person in health expenditures, you gain very little in terms of longevity, assuming cross-country comparisons are valid (could be another apples-to-oranges comparison).

Now if your focus is on forecasting, you need to understand outliers (maybe refine your model) and most of the time, remove them. They introduce a systematic bias in your forecasts.

Comment by Michael Clayton on March 10, 2014 at 1:26pm

Based on the comments so far, one could conclude that this simplistic use of global data sources ignoring known causal factors and known personal and state income distribution impacts can be used to mislead in many ways.

1. Saving every baby rather than letting weakest die may be skewing the US distribution of life expectancy (as well as the "average" of anything else about population performance or health).

2. Insert here your favorite explanation!  Its a rich, complex, dataset in raw form, where the "flaw of averages" simply tells us something is unique about the US in healthcare results vs spending.

Comment by George Vander Meulen on March 10, 2014 at 12:55pm

The US spends > \$8200 per capita to achieve a longevity outcome worse than every other country spending  >\$2000? hmmm...

Omitting the US would obviously make the curve steeper. Does that imply that increased spending  leads to longer life? Doesn't this chart also imply that life expectancy is limited to about 84 years (for now) regardless what you spend? These two observations seem to be at odds with each other.

Just for fun I tried a little scoring system. It looks like the US spends about 8.2K to live 78 years so I gave them a score of 9.5. (78/8.2) Italy gets a 27.7 (83/3) and Russia gets 43.1 (69/1.6) Who would have ever guessed that Italy and Russia deliver health care so much more efficiently?

One could also conclude that overspending on health care leads to diminishing returns. There's a lot of stuff here.

## Videos

• ### DSC Webinar Series: Data, Analytics and Decision-making: A Neuroscience POV

Added by Sean Welch

• ### DSC Webinar Series: Knowledge Graph and Machine Learning: 3 Key Business Needs, One Platform

Added by Sean Welch

• ### ODSC APAC 2020: Non-Parametric PDF estimation for advanced Anomaly Detection

Added by Kuldeep Jiwani

• Add Videos
• View All

© 2021   TechTarget, Inc.   Powered by