.

# How to Lie with Visualizations: Statistics, Causation vs Correlation, and Intuition!

Post adapted from Correlation vs Causation: Visualization, Statistics, and Intuition!

As someone who has a tendency to think in numbers, I love when success is quantifiable.

However, I suppose that means I must accept defeat (or in true statistician fashion-- try to discredit the correlation) when the numbers don't demonstrate what I had hoped for or intuitively believed!

With that, I decided to look into how my working at Cameron relates to the company's stock price. Alongside this analysis, I'll include a quick demo of scaling and data manipulation for visualization.

Of course, this post is meant to highlight one of the basic lessons of statistics in a mildly entertaining way.

To begin, I pulled Stock Price over my first ~90 Days. Since the market is only open on business days, it fits perfectly with the number of days worked.

If only every analysis was this convenient! From there, I merely added a column that counts number of days.

Eventually the data looked something like this:

Neat! Now, let's graph Adjusted Close Price vs Days Worked.

Super! As you can see in this graph, there's obviously no Relationship!

Not so fast. Let's Regress Days Worked Across Stock Price.

It's important to realize that while visualization is a phenomenal tool and incredibly insightful way to ingest data, it's not the whole story.

With an R squared of .88 and a P Value out 42 Decimal Places, traditional statistics would say we are incredibly confident about these results!

So what do all those numbers really say? Well one interpretation would be that we can explain Stock Price by:

StockPrice= \$75.99 -\$.29672(NumberDaysAlexHasWorked)

That's a heck of a deal! I cost a little under 30 cents a day...

WRONG. That's per share. Since the company currently has ~197.45M Shares Outstanding, that means, based on these statistically significant results, I cost \$58,587,364 per day.

Well this is awkward...

Quick! Let's see if we can perform some "Transformations" on the data to get a "Better result".

First, let's Scale Stock Price from 0 (lowest price) to 1 (highest price). To do so, we'll get the Minimum value and Maximum value. With those, we'll be able to get the Spread/ Span.

That calculation is simply Spread= Maximum - Minimum. Simple enough!

Now how do we scale every datapoint? Great question.

We'll take (Stock Price X - Minimum)/ Spread. Boom! Scaled.

Now let's graph that!

Oh great! No relationship! Just as I wanted.

Woa woa woa... that doesn't seem right? Ok? Then what do you propose? Scale the number of days worked?

Well I guess we could try that. Same formula/ process applied to days worked.

Ok, so maybe there's a relationship here... I suppose we should Invert days worked so that the lines go in the same general direction.

See how the Orange line (Days worked) currently starts at 0 and goes to 1. Let's flip that. How? We'll apply the formula Inverted Days Worked= 1-Scaled Days Worked. Now the line is flipped!

Let's Graph them.

Holy Moly. I see it now.

So now we have taken two vectors of differing relative magnitudes, scaled them to an equivalent range and controlled for directionality. Thereby enabling a linear depiction of the relationship and a more intuitive visualization!

Sorry, that was unnecessary. Nerdiness got the best of me.

So what then does this mean? Now what are the results of the regression?!

You better sit down for this... The regression results, in absolute terms, are EXACTLY the same. Even though the equation (shown on the final graph) is apparently different, once we "undo" all the scaling and transformations, and we get the numbers back into their original values... they will be the exact same as the original!

Hm.. Why is that? Because we're just transforming data! We're not changing the underlying geometry of the relationships. Relatively speaking, the data remains holistically the same. We didn't pick out one data point and change JUST that one. We changed all of them at the same time.

In other words, we're just moving the data's perspective in a multi-dimensional space, relative to US the viewers. You can zoom, stretch, angle, compress, and turn data in any way you want!

Let's take a second and think about this. For a moment, think of our data as a cube-- just to help conceptualize what's going on.

If we turn, flip, invert, scale, zoom out, or angle the cube in any way-- has the cube itself changed? Absolutely not. It's the exact same cube!

We're simply looking at it from a different perspective. So when we transform a "Data Vector / Cube" (as long as we "undo" those changes when we analyze the data in real terms)-- we're just finding that perfect angle to tell our story and create a compelling visual. That's powerful and exciting!

Victory is mine! Data hath been conquered!

Even with these marvelous findings, we must address the issue of primary concern--Causation vs Correlation! Based on statistics--- "data driven" results, and the interpretation we proposed earlier-- I'm the worst!

However, that's a myopic approach to statistics. Rather-- I bet you there's a 3rd variable indicative of the movement of stock price. What does days worked really represent? It is merely a count of the past ~90 days. So what else has happened in that period?

Well, if we consider the fact that the company is a major oilfield services firm or pick our head up and look at the companies and markets around us-- we quickly realize-- the missing link is the price of oil (at least I certainly hope so!).

What you should realize is that these relationships aren't always evident or obvious! In fact, visualizations in their raw form could disguise relationships! Statistics is still a subjective science-- subject to the availability of information and robustness of the analyst's forethought and interpretation!

More importantly, we can identify the importance that macro-oil market plays in stock price, rather than otherwise extraneous relationships! For brevity sake, we'll omit another full analysis saga.

Most importantly, this should help to exemplify one of the most exciting value potentials of "Big Data". Essentially, we now have access to incredible amounts of information relative to the "Universal Variable"-- of time. With that point to relate on, we can now see how major indexes, markets, events, weather patterns, customer announcements, etc interrelate!

As we move towards an even smaller and more interconnected world, expect to see more "Universal" data points-- (and actively promote them, in the long run, it'll make your analysis more resilient and dynamic!).

Views: 24320

Comment

Join Data Science Central

Comment by Dominic Lusinchi on December 17, 2017 at 8:37am

Thank you, Duncan. I will try that, although for statistical analyses I do not use Excel but either Stata, SPSS or R. But I will try that just to see what results I get. Best - Dominic

Comment by duncan on December 17, 2017 at 2:59am

For Dominic ...

You are right that Excel does repeat the 95% levels in the ToolPak Regression utility. In the dialogue box for regression you can change the confidence level from 95% to whatever you want. When you do That, you will get the 95% results and your own results instead of 95% twice.

Duncan

Comment by Sriram Sitharaman (Latentview) on January 27, 2015 at 7:46pm
Comment by Sione Palu on January 12, 2015 at 8:03am

The data transformations (not all) can be said to be invariant:

http://en.wikipedia.org/wiki/Invariant_(mathematics)

Comment by Alex Jones on January 9, 2015 at 9:58am

Ah interesting. Great catches!

Honestly, I never use Excel for analysis-- just too many challenges (scale, robustness, etc)! But that's really the overarching theme of this article-- that statistics (and even visualization) isn't just something you run in Excel or do without proper forethought.

So in the spirit of things, the flaws are by design and you're on point-- there are a lot of errors and we should be cautious of being bamboozled by "data-driven results" because not all of them are created equal!

Comment by Dominic Lusinchi on January 9, 2015 at 8:57am

And another thing - which has nothing to do with your analysis. I'm sure you have noticed that Excel generates 95% confidence interval for the regression coefficients - twice!! This has been going on since I can remember using Excel (i.e. the 90s). I am using Excel 2010 now, and Microsoft has still not fixed this problem. Have they fixed it in the most recent version? I'm not a programmer, so I don't know how difficult it would be to fix this problem.

Comment by Dominic Lusinchi on January 8, 2015 at 1:20pm

Alex: should you not use a different data analytic approach since OLS regression assumes that observations are independent? Is it safe to assume that stock prices are independent? (I'm not a financial analyst.)