Over the years I’ve often been asked by beginners where they should start in statistics, what they should do first, and which parts of statistics they should prioritise to get them to where they want to be (which is usually a higher paid job).Now, as I’m almost completely self-taught I don’t really consider myself an authority in where one should get started, and I struggle to answer this question with any great conviction.Sure, I have some thoughts about this subject, but they are coloured by my own experiences.So I thought I’d reach out to some of our statistics friends to see what they can bring to the party.Each of the statisticians in this post were asked the same question:If you had to start statistics all over again, where would you start?The answers were astounding — they turned out to be a roadmap of how to become a modern statistician from scratch.In short, how to be a future statistician without ever needing a single lesson!Frequentist Statistics vs Bayesian StatisticsThere is a schism in statistics, and that is between the frequentists and the Bayesians.Let’s see what the statisticians have to say about this debate.We start with Kirk Borne (Twitter: @KirkDBorne), astrophysicist and rocket scientist (well, rocket data scientist). Surprisingly, he tells me he’s never never had any interest in being an astronaut!“I am not a statistician, nor have I ever had a single course in statistics, though I did teach it at a university. How’s that possible?”Funnily enough, that was the same for me! So where did he get all his stats from?“I learned basic statistics in undergraduate physics and then I learned more in graduate school and beyond while doing data analysis as an astrophysicist for many years. I then learned more stats when I started exploring data mining, statistical learning, and machine learning about 22 years ago. I have not stopped learning statistics ever since then”.This is starting to sound eerily like my stats education. All you need to do is drop the ‘astro’ from astrophysics and they’re identical! So what does he think of starting stats all over again?“I would have started with Bayesian inference instead of devoting all of my early years to simple descriptive data analysis. That would have led me to statistical learning and machine learning much earlier. And I would have learned to explore and exploit the wonders and powers of Bayesian networks much sooner”.This is also what Frank Harrell, author and professor of biostatistics at Vanderbilt University School of Medicine at Nashville thinks about hitting the reset button on statistics (Twitter: @f2harrell). He told me:“I would start with Bayesian statistics and thoroughly learn that before learning anything about sampling distributions or hypothesis tests”.And Lillian Pierson, CEO of Data-Mania (Twitter: @Strategy_Gal) also mentioned Bayesian statistics when I asked her where she would start:“If I had to start statistics all over again, I’d start by tackling 3 basics: t-test, Bayesian probability & Pearson correlation”.Personally, I haven’t done very much Bayesian stats, and it’s one of my biggest regrets in statistics. I can see the potential in doing things the Bayesian way, but as I’ve never had a teacher or a mentor I’ve never really found a way in.Maybe one day I will — but until then I will continue to pass on the messages from the statisticians in here.Repeat after me:Learn Bayesian stats.Learn Bayesian stats.LEARN BAYESIAN STATS!Statistical Recipes vs Calculus vs Simulated StatisticsAs I was reaching out and gathering quotations I got a rather cryptic response from Josh Wills (Twitter: @josh_wills), software engineer at Slack and founder of the Apache Crunch project (he also describes himself as an ‘ex-statistician’):“Computation before calculus is the pithy answer”, he told me.This intrigued me, so I asked him if he could elaborate a little, and here is his reply:“So I think stats can be and is taught in three ways:1. a set of recipes2. from the perspective of calculus — mostly integrals and what not, and3. computationally (like the bootstrap as a fundamental thing)”“Most folks do the recipes approach, which doesn’t really help with understanding stuff but is what you do when you don’t know calculus”.Ah, I understand the ‘set of recipes approach’, but I didn’t know anyone was still doing the calculus approach. He went further:“I was a math major, so I did the calculus based approach, because that’s what you did back in the day. You mostly do some integrals with a head nod to computational techniques for distributions that are too hard to do via integrals. But the computational approach, even though it was discovered last, is actually the right and good way to teach stats”.Whew, thank God for that — I thought he was saying that we should all learn the calculus approach!“The computational approach can be made accessible to folks who don’t know calculus, and it’s actually most of what you use in the hard parts of real world statistics problems anyway. The calculus approach is historically interesting, but (and I feel heretical for saying this) it should be relegated to a later course on the history of statistical thought — not part of the intro sequence”.It’s interesting to see the evolution of statistics in this light and shows just how far we’ve come — and in particular how much computers and computing power have developed over the past couple of decades.It’s truly mind-blowing to think that when I was doing my PhD 20 years ago it was difficult getting hold of data, and when you did get some, you had to network computers together to get enough computing power. Now we’re all swimming in data and err, well, we still struggle to get enough computing power to do what we want — but it’s still way more than we used to have!Simulated Statistics is the New BlackI also got a really interesting perspective from Cassie Kozyrkov, Head of Decision Intelligence at Google (Twitter: @quaesita), who told me that she’d:“Probably enjoy making a bonfire out of printed statistical tables!”Well, amen to that, but seriously though, where would you start again with stats?“Simulation! If I had to start all over again, I’d want to start with a simulation-based approach to statistics”.OK, I’m with you, but why specifically simulation?“The ‘traditional’ approach taught in most STAT101 classes was developed in the days before computers and is unnecessarily reliant on restrictive assumptions that cram statistical questions into formats you can tackle analytically with common distributions and those nasty obsolete printed tables”.Got you. So what exactly have you got against the printed tables?“Well, I often wonder whether traditional courses do more harm than good, since I keep seeing their survivors making ‘Type III errors’ — correctly answering the wrong convenient questions. With simulation, you can go back to first principles and discover the real magic of statistics”.Statistics has magic?“Sure it does! My favorite part is that learning statistics with simulation forces you to confront the role that your assumptions play. After all, in statistics, your assumptions are at least as important as your data, if not more so”.And when it came to offering his advice, Gregory Piatetsky, founder of KDnuggets (Twitter: @kdnuggets), suggested that:“I would start with Leo Breiman’s paper on Two Cultures, plus I would study Bayesian inferencing”.If you haven’t read that paper (which is open access), Leo Breiman lays out the case for algorithmic modelling, where statistics are simulated as a black box model rather than following a prescribed statistical model.This is what Cassie was getting at — statistical models rarely fit real-world data, and we are left to either try to shoe-horn the data into the model (getting the right answer to the wrong question) or switch it up and do something completely different — simulations!And There’s More…This is an excerpt of my original post, which is quite long — too long to post here in its entireity (there are more than 30 world-class contributors!).If you’re enjoying reading, you might be interested to hear what Dez Blanchfield had to say about domain experts, or what Michael Friendly and Alberto Cairo said about the past, present and future of data visualisation.There’s also a free book to download detailing all the comments made by the contributors, including what Jacqueline Nolis and Kristen Kehrer had to say about starting their careers over.And don’t get me started with the epic suggestions that Natalie Dean and Jen Stirrup had about Information Flow and Detective Work.Awesome — you really don’t want to miss them!Read more hereSee More

]]>

]]>

]]>

]]>

Knowing when and how to choose the right statistical hypothesis test is no mean feat. It can takes years of learning and practice before you get comfortable with it.Fortunately, there are ways to shortcut this by having a process, a strategy and a nice, big diagram!Here I'm going to give you all three!Getting StartedI think everyone responds well to a good visualisation, so that's where we're going to start.I've created what I call The Hypothesis Wheel, and here it is making its debut in the world:Now, there's a HUGE amount of information in there, and I don't expect anyone to absorb it all with just a quick glance, so it will take you quite a bit of study time to get to grips with it all.Hypothesis Testing - a 4 Step StrategyWhen making decisions about which hypothesis test to select, you need a plan of action, and here's my 4 step strategy:Deduce the properties of your outcome variable (aka dependent or hypothesis variable)Deduce the properties of your input variable (aka independent or predictor variable)Deduce the parameters of the relationshipLook up the statistic on the Hypothesis WheelSteps 1 & 2: Your Variable PropertiesAs steps 1 and 2 are the same thing, you can do them together. The properties you need to check for your input and outcome variables are:Data TypeDistributionNumber of ClassesThere are 4 distinct data types that you'll come across in your research, and they are Ratio (R), Interval (I), Ordinal (O) and Nominal (N), like this:In terms of the distribution you need to check whether your data (Ratio or Interval data only) are normally-distributed (ND) or non-normally distributed (NND). Actually, all you really need to know is whether they are symmetrical or not - they don't actually need to be full-blown Gaussian distributions to qualify here. Finally, you need to check (Ordinal or Nominal only) how many classes (categories) there are in your data. It's easier to explain what that means by example - the variable Gender has 2 classes; Male and Female, whereas Colour Of The Rainbow has 7 (ROYGBIV). What you really need to know is whether your variable has 2 classes or more than 2.Step 3: Relationship ParametersThe relationship parameter you need to know for the Hypothesis Wheel is which type of analysis are you conducting, univariate or multivariate, like this:Hypothesis Wheel Colour CodesTo help you navigate around the hypothesis wheel I've colour coded various parts of it, like this: Step 4: Look up Your Statistic on the Hypothesis WheelWe always start in the centre with the properties of the hypothesis variable, coloured in purple. There are 3 concentric circles corresponding to Data Type, Distribution and Number of Classes.Spinning further out, in red we have the properties of the predictor variable - again, there are 3 circles for Data Type, Distribution and Number of Classes.Then we have a blue circle for the relationship parameters, which denotes whether our analysis is univariate (UV) or multivariate (MV). When you look closely you'll see that there are 2 hypothesis wheels, and the larger one contains only univariate hypothesis tests while the smaller one has only the multivariate hypothesis tests.Finally, the outer orange circle tells us which hypothesis test we should choose in any given circumstance.Hypothesis Wheel ExampleLet's zoom in on a particular example to see how you would use the hypothesis wheel to tell you which univariate test you should use.Let's say that your hypothesis variable has the following properties:Ordinal>2 classesAnd your predictor variable has these properties:Nominal>2 classesNow let's see what that looks like on the hypothesis wheel: Starting from the centre, locate the data type of your hypothesis variable (Ordinal). It has more than 2 classes, so we locate that too. Spinning out to the red segment, locate the data type of your predictor variable (Nominal). In this case, since the hypothesis variable has more than 2 classes it doesn't matter how many classes the predictor variable has - the correct statistic is the Chi-Squared Test.SummaryThe Hypothesis Wheel is more than just another flow chart that helps you choose which statistical hypothesis test you should use. The world doesn't need another flow chart, it needs a better one - and I believe this is it.The Hypothesis Wheel is a framework for helping you to ask the right questions of your data so you can get the correct answers. All you need to do is ask 3 questions to correctly select your hypothesis test:What are my data types (RION)?What are their distributions (ND, NND), and/or how many categories do they have (2, >2)?What types of analysis am I looking to perform (UV, MV)?Once you've answered these questions - and they are right there on the chart to help you decide - the Hypothesis Wheel will help you choose the correct statistical tool to use.But this isn't why it is a framework. It is a framework because if there is a statistical test that is not present on the chart (I've only included the most used hypothesis tests), it is really easy to see exactly where it should fit on the Hypothesis Wheel, like this: Hypothesis Wheel - Free DownloadIf you want your very own hypothesis wheel to download and keep, you can get a high definition pdf right here.See More

Knowing when and how to choose the right statistical hypothesis test is no mean feat. It can takes years of learning and practice before you get comfortable with it.Fortunately, there are ways to shortcut this by having a process, a strategy and a nice, big diagram!Here I'm going to give you all three!Getting StartedI think everyone responds well to a good visualisation, so that's where we're going to start.I've created what I call The Hypothesis Wheel, and here it is making its debut in the world:Now, there's a HUGE amount of information in there, and I don't expect anyone to absorb it all with just a quick glance, so it will take you quite a bit of study time to get to grips with it all.Hypothesis Testing - a 4 Step StrategyWhen making decisions about which hypothesis test to select, you need a plan of action, and here's my 4 step strategy:Deduce the properties of your outcome variable (aka dependent or hypothesis variable)Deduce the properties of your input variable (aka independent or predictor variable)Deduce the parameters of the relationshipLook up the statistic on the Hypothesis WheelSteps 1 & 2: Your Variable PropertiesAs steps 1 and 2 are the same thing, you can do them together. The properties you need to check for your input and outcome variables are:Data TypeDistributionNumber of ClassesThere are 4 distinct data types that you'll come across in your research, and they are Ratio (R), Interval (I), Ordinal (O) and Nominal (N), like this:In terms of the distribution you need to check whether your data (Ratio or Interval data only) are normally-distributed (ND) or non-normally distributed (NND). Actually, all you really need to know is whether they are symmetrical or not - they don't actually need to be full-blown Gaussian distributions to qualify here. Finally, you need to check (Ordinal or Nominal only) how many classes (categories) there are in your data. It's easier to explain what that means by example - the variable Gender has 2 classes; Male and Female, whereas Colour Of The Rainbow has 7 (ROYGBIV). What you really need to know is whether your variable has 2 classes or more than 2.Step 3: Relationship ParametersThe relationship parameter you need to know for the Hypothesis Wheel is which type of analysis are you conducting, univariate or multivariate, like this:Hypothesis Wheel Colour CodesTo help you navigate around the hypothesis wheel I've colour coded various parts of it, like this: Step 4: Look up Your Statistic on the Hypothesis WheelWe always start in the centre with the properties of the hypothesis variable, coloured in purple. There are 3 concentric circles corresponding to Data Type, Distribution and Number of Classes.Spinning further out, in red we have the properties of the predictor variable - again, there are 3 circles for Data Type, Distribution and Number of Classes.Then we have a blue circle for the relationship parameters, which denotes whether our analysis is univariate (UV) or multivariate (MV). When you look closely you'll see that there are 2 hypothesis wheels, and the larger one contains only univariate hypothesis tests while the smaller one has only the multivariate hypothesis tests.Finally, the outer orange circle tells us which hypothesis test we should choose in any given circumstance.Hypothesis Wheel ExampleLet's zoom in on a particular example to see how you would use the hypothesis wheel to tell you which univariate test you should use.Let's say that your hypothesis variable has the following properties:Ordinal>2 classesAnd your predictor variable has these properties:Nominal>2 classesNow let's see what that looks like on the hypothesis wheel: Starting from the centre, locate the data type of your hypothesis variable (Ordinal). It has more than 2 classes, so we locate that too. Spinning out to the red segment, locate the data type of your predictor variable (Nominal). In this case, since the hypothesis variable has more than 2 classes it doesn't matter how many classes the predictor variable has - the correct statistic is the Chi-Squared Test.SummaryThe Hypothesis Wheel is more than just another flow chart that helps you choose which statistical hypothesis test you should use. The world doesn't need another flow chart, it needs a better one - and I believe this is it.The Hypothesis Wheel is a framework for helping you to ask the right questions of your data so you can get the correct answers. All you need to do is ask 3 questions to correctly select your hypothesis test:What are my data types (RION)?What are their distributions (ND, NND), and/or how many categories do they have (2, >2)?What types of analysis am I looking to perform (UV, MV)?Once you've answered these questions - and they are right there on the chart to help you decide - the Hypothesis Wheel will help you choose the correct statistical tool to use.But this isn't why it is a framework. It is a framework because if there is a statistical test that is not present on the chart (I've only included the most used hypothesis tests), it is really easy to see exactly where it should fit on the Hypothesis Wheel, like this: Hypothesis Wheel - Free DownloadIf you want your very own hypothesis wheel to download and keep, you can get a high definition pdf right here.See More

]]>

]]>

]]>