Over the years I’ve often been asked by beginners where they should start in statistics, what they should do first, and which parts of statistics they should prioritise to get them to where they want to be (which is usually a higher paid job).
Now, as I’m almost completely self-taught I don’t really consider myself an authority in where one should get started, and I struggle to answer this question with any great conviction.
Sure, I have some thoughts about this subject, but they are coloured by my own experiences.
So I thought I’d reach out to some of our statistics friends to see what they can bring to the party.
Each of the statisticians in this post were asked the same question:
If you had to start statistics all over again, where would you start?
The answers were astounding — they turned out to be a roadmap of how to become a modern statistician from scratch.
In short, how to be a future statistician without ever needing a single lesson!
There is a schism in statistics, and that is between the frequentists and the Bayesians.
Let’s see what the statisticians have to say about this debate.
“I am not a statistician, nor have I ever had a single course in statistics, though I did teach it at a university. How’s that possible?”
Funnily enough, that was the same for me! So where did he get all his stats from?
“I learned basic statistics in undergraduate physics and then I learned more in graduate school and beyond while doing data analysis as an astrophysicist for many years. I then learned more stats when I started exploring data mining, statistical learning, and machine learning about 22 years ago. I have not stopped learning statistics ever since then”.
This is starting to sound eerily like my stats education. All you need to do is drop the ‘astro’ from astrophysics and they’re identical! So what does he think of starting stats all over again?
“I would have started with Bayesian inference instead of devoting all of my early years to simple descriptive data analysis. That would have led me to statistical learning and machine learning much earlier. And I would have learned to explore and exploit the wonders and powers of Bayesian networks much sooner”.
This is also what Frank Harrell, author and professor of biostatistics at Vanderbilt University School of Medicine at Nashville thinks about hitting the reset button on statistics (Twitter: @f2harrell). He told me:
“I would start with Bayesian statistics and thoroughly learn that before learning anything about sampling distributions or hypothesis tests”.
“If I had to start statistics all over again, I’d start by tackling 3 basics: t-test, Bayesian probability & Pearson correlation”.
Personally, I haven’t done very much Bayesian stats, and it’s one of my biggest regrets in statistics. I can see the potential in doing things the Bayesian way, but as I’ve never had a teacher or a mentor I’ve never really found a way in.
Maybe one day I will — but until then I will continue to pass on the messages from the statisticians in here.
Repeat after me:
Learn Bayesian stats.
Learn Bayesian stats.
LEARN BAYESIAN STATS!
As I was reaching out and gathering quotations I got a rather cryptic response from Josh Wills (Twitter: @josh_wills), software engineer at Slack and founder of the Apache Crunch project (he also describes himself as an ‘ex-statistician’):
“Computation before calculus is the pithy answer”, he told me.
This intrigued me, so I asked him if he could elaborate a little, and here is his reply:
“So I think stats can be and is taught in three ways:
1. a set of recipes
2. from the perspective of calculus — mostly integrals and what not, and
3. computationally (like the bootstrap as a fundamental thing)”
“Most folks do the recipes approach, which doesn’t really help with understanding stuff but is what you do when you don’t know calculus”.
Ah, I understand the ‘set of recipes approach’, but I didn’t know anyone was still doing the calculus approach. He went further:
“I was a math major, so I did the calculus based approach, because that’s what you did back in the day. You mostly do some integrals with a head nod to computational techniques for distributions that are too hard to do via integrals. But the computational approach, even though it was discovered last, is actually the right and good way to teach stats”.
Whew, thank God for that — I thought he was saying that we should all learn the calculus approach!
“The computational approach can be made accessible to folks who don’t know calculus, and it’s actually most of what you use in the hard parts of real world statistics problems anyway. The calculus approach is historically interesting, but (and I feel heretical for saying this) it should be relegated to a later course on the history of statistical thought — not part of the intro sequence”.
It’s interesting to see the evolution of statistics in this light and shows just how far we’ve come — and in particular how much computers and computing power have developed over the past couple of decades.
It’s truly mind-blowing to think that when I was doing my PhD 20 years ago it was difficult getting hold of data, and when you did get some, you had to network computers together to get enough computing power. Now we’re all swimming in data and err, well, we still struggle to get enough computing power to do what we want — but it’s still way more than we used to have!
I also got a really interesting perspective from Cassie Kozyrkov, Head of Decision Intelligence at Google (Twitter: @quaesita), who told me that she’d:
“Probably enjoy making a bonfire out of printed statistical tables!”
Well, amen to that, but seriously though, where would you start again with stats?
“Simulation! If I had to start all over again, I’d want to start with a simulation-based approach to statistics”.
OK, I’m with you, but why specifically simulation?
“The ‘traditional’ approach taught in most STAT101 classes was developed in the days before computers and is unnecessarily reliant on restrictive assumptions that cram statistical questions into formats you can tackle analytically with common distributions and those nasty obsolete printed tables”.
Got you. So what exactly have you got against the printed tables?
“Well, I often wonder whether traditional courses do more harm than good, since I keep seeing their survivors making ‘Type III errors’ — correctly answering the wrong convenient questions. With simulation, you can go back to first principles and discover the real magic of statistics”.
Statistics has magic?
“Sure it does! My favorite part is that learning statistics with simulation forces you to confront the role that your assumptions play. After all, in statistics, your assumptions are at least as important as your data, if not more so”.
“I would start with Leo Breiman’s paper on Two Cultures, plus I would study Bayesian inferencing”.
If you haven’t read that paper (which is open access), Leo Breiman lays out the case for algorithmic modelling, where statistics are simulated as a black box model rather than following a prescribed statistical model.
This is what Cassie was getting at — statistical models rarely fit real-world data, and we are left to either try to shoe-horn the data into the model (getting the right answer to the wrong question) or switch it up and do something completely different — simulations!
This is an excerpt of my original post, which is quite long — too long to post here in its entireity (there are more than 30 world-class contributors!).
If you’re enjoying reading, you might be interested to hear what Dez Blanchfield had to say about domain experts, or what Michael Friendly and Alberto Cairo said about the past, present and future of data visualisation.
There’s also a free book to download detailing all the comments made by the contributors, including what Jacqueline Nolis and Kristen Kehrer had to say about starting their careers over.
And don’t get me started with the epic suggestions that Natalie Dean and Jen Stirrup had about Information Flow and Detective Work.
Awesome — you really don’t want to miss them!
Read more here