Let me share a little bit of data-analysis wisdom I picked up in graduate school, in the Psych Department*. Social psychology relies on experimental research designs -- our interest is in identifying cause and effect. In an experiment you randomly assign subjects to groups, and administer a particular level of an independent variable to each group. Then you measure a dependent variable to see if the groups differ. Since you, the experimenter, have controlled the group treatments, you can know if the effect was caused by your manipulation. With the right statistical tests you can say with known confidence whether the variable you manipulated caused a difference in the variable you measured.
In most of what we call “data science” we are not conducting actual experiments. But we often do have numerical data separated into groups, say, gender or age or nationality or brand. And we want to know, what is the effect of gender, or brand -- or both -- on some measurement? You can learn a lot by looking at your data as if it had come from an experiment. Specifically, you can look at main effects and interactions of your independent variables.
We will imagine an experiment. Say you had a headache drug. You recruit a bunch of people with chronic headaches and randomly assign each one to a group. You give members of one group 0.5 milligrams, Group Two gets five milligrams of your wonder-drug, and you give Group Three twenty milligrams of the stuff. Then you wait some amount of time and assess their headaches. A high Cranial Felicity (CF) score means they feel pretty good, low score means a bad headache.
Now the scientific question is, how ya gonna tell if the stuff worked? The three group means are bound to differ by some amount, simply due to chance, and you would like to know whether the observed difference was caused by your drug or by random noise.
The ideal answer is that you would use analysis of variance, or ANOVA. But let's not get into that today. We don’t need to get caught up in a bunch of squishy psychology stuff, like statistics. You need to be able to show your boss a graph so he’ll know what to invest in tomorrow, or who's going to win the election.
So good idea, let's make a graph. With three levels of one independent variable we can make a line graph. Then your eyeballs can do a statistical test to determine whether it appears the drug had an effect; and this is tested perceptually by answering the question, is the line level? If it is level all the way across then there is no effect. Slope means causation. Here is a graph with almost no slope. It does not appear that Dosage has any effect on headaches.
The line may be level in sections, and sloped in other sections. Maybe Group Three's CF rating is higher than Two's. Or maybe Two and Three are higher than One. In either case, your tip-off is that the line is not level. Here are a couple of patterns you might see.
A more interesting experiment involves two independent variables. Let's say you have reason to think your headache drug works on Type-Z people but not Type Y. I don't know what that means, it doesn't matter, maybe redheads or people with eleven or more fillings in their teeth. Maybe you have a group of college students and a group of seniors. Now you have a 3x2 research design, with three levels of Dosage, two levels of Type.
Let's plot again with both independent variables. Hmm, it does look like something is going on between the two Types. The lines are level but it appears Cranial Felicity is higher for Type Y than on Type Z subjects across all Dosage levels. Your eyeballs perceive this as a gap between the two groups’ lines.
In fluffy social-science statistics we call this a main effect for Type. One independent variable has a consistent effect across all levels of the other independent variable. And the way your eyeballs know this is that there is either a slope in the lines indicating an effect of Dosage or a gap between them indicating a main effect for Type. In the example above there is a main effect for Type of subject, because there is separation between the lines, and none for Dosage, because the lines are level.
Sloping lines, a gap, look for these to discover main effects when the lines are parallel.
Now it starts to get cool. You have two independent variables, and you could have two main effects: say, your drug works better in higher doses, and it also works better on Type Y subjects. Your eyeballs know this because there is a gap between the lines, they are not level, and they are parallel. Type has an effect (gap) and Dosage has an effect (slope), and each is consistent across levels of the other (they're parallel).
So far this is good, you have found that your drug works, and also that it is better for one kind of population than another. It's like you've done two experiments in one! The boss likes it when you're efficient. Plus, you've got dataviz, and you know management loves that.
But this gets even more powerful. Look at the next graph, where the lines are not parallel. This is what squishy social scientists call an interaction, and it means that effect of one level of an independent variable depends on the level of another one. The effect of the Dose of your wonder-drug depends on the Type of subject -- or either way, you could say the effect of Type depends on Dosage. In this chart, the lower dose works better for Type Z subjects and the higher dosages are better for Type Y.
It is sometimes assumed that an effect is just a sum of its causes, but this is often not the case in interesting systems. Causes can cancel one another out or potentiate one another, causes can have effects on other causes. Think of the common regression model, it is typically literally a sum of hypothesized causes -- economists, I'm looking at you now. It is important to understand interaction effects if you want to make predictions or gain understanding of a system that contains any real degree of complexity. You can put interactions into a regression model, of course, but you have to know they’re there, you have to specify them.
Your visual tip-off to an interaction is that lines are not parallel. Sometimes they form a kind of trapezoid, sometimes they cross each other in a big “X,” sometimes they stick out in opposite directions. Sometimes one line is level while another slopes. This non-parallelism means the effect of the level of one thing depends on the level of the other thing.
And by the way, you can do this with any number of levels of your independent variables. You could include 50 and 75 milligrams, and Type U, V, and X. Slope, gaps, and parallelism work just the same.
In the presence of main effects you can say things about one or the other causal variable, like, “The drug works best at low dose,” or, “The drug works better for Type Y than for Type Z subjects.” If there is a slope to the lines and they are parallel you can say both: “A low dose is best, and the drug works best on Type Z subjects.” Those are main-effect statements.
It is crucially important to know when you have a statistical interaction. When you see an interaction between causes you can make a more nuanced statement about causality, for instance, “Type Z subjects respond to a lower dose more than to a higher one, but Type Y subjects respond better to a moderate or high dose.” Identifying causal interactions gives you a much better understanding of the system you are studying.
It may be possible to have main effects in the presence of interactions, but you want to study the topic more thoroughly before you can say so. It may turn out, for instance, that your results are affected by the levels of the independent variables that you sampled, and that the main effects would wash out if you used something different. If you have an interaction, you should almost always ignore the main effects.
Independent variables are called factors in this kind of model, and no it has nothing to do with factor analysis. Or factoring. If we actually analyzed the variance this would be called a factorial model. It can get tricky with more than two independent variables, but you can set two charts next to one another and eyeball the same thing -- slopes, gaps between the height of lines and the presence or absence of parallelism. Higher-order interactions can be very difficult to explain.
I'm talking here about continuous numeric observed data from grouped observations. Statistically this is ANOVA, not regression (though these are variations of the same general linear model). ANOVA is a potentially complicated set of methods for working with variance between and within groups and working with your degrees of freedom to test a causal hypothesis. You need a few graduate seminars to get that, it won't be explained in an infographic titled “The Eleven Absolutely Necessary Statistical Tools For Your Data Science Toolkit.” You can put confidence intervals into your graphs to get an informal measure of significant differences; these graphs are just for discovering effects, how you report them is up to you and your statistical quality standard.
Even without analysis of variance you can squeeze a lot of insight out of your dataset by posing it as a factorial design, with one or more hypothesized causes, and looking at the pattern. Graphing it in this simple form can help your eyeballs discover the story that is hidden between the numbers.
*After writing this post I decided to see what similar discussions existed online already. Interestingly, the closest exposition to this is a very thorough and good one posted by the Psych Department at UNC Chapel Hill -- still teaching the same approach as when I went there. To learn more, follow that link.