Did you have a good, relaxing break over the summer? Are you refreshed and re-energised, looking forward to a new start, a new you and brushing up on your data analysis skills? If so, I’ve thrown together a collection of a few excellent (and free!) statistics eBooks for your Kindle to sharpen up your stats while you’re on the long commute to work. Just try not to read them while driving!
These books require different levels of existing knowledge, and while some are for early-stage data scientists others are for more hard-core physicists and mathematicians. Nonetheless, it’s likely that you’ll find something in here that will get your mental juices flowing with ideas about how to tackle your data.
There’s even a bonus book at the end about the reasons why correlation does not necessarily imply causation.
All these books are free, so dive in and enjoy!
Authors: Trevor Hastie, Robert Tibshirani and Jerome Friedman
During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. Data analysis, which not long ago was primarily the domain of statistics, has evolved dramatically in the last few decades. This is almost entirely a consequence of the revolution in computing which has occurred over that period. At the start of this revolution, researchers were enabled to perform analyses that they might previously have balked at. But gradually things advanced so that nowadays tools can be applied which would be quite inconceivable without machine assistance. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of colour graphics.
The book’s coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting, graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization and spectral clustering. There is also a chapter on methods for ‘wide’ data, including multiple testing and false discovery rates.
This is a beautiful book. Not only in presentation, where it makes excellent use of colour, but also in content and style. It would make a first class text for an advanced undergraduate or an initial graduate course in modern statistical tools. A true fore-runner to what is now called Data Science.
Author: Allen B. Downey
If you know how to program, you have the skills to turn data into knowledge using the tools of probability and statistics.
This concise introduction shows you how to perform statistical analysis computationally, rather than mathematically, with programs written in Python.
Think Stats emphasizes simple techniques you can use to explore real data sets and answer interesting questions, and you are encouraged to work on a project with real datasets.
If you have basic skills in Python, you can use them to learn concepts in probability and statistics, and many of the exercises use short programs to run experiments and help you develop understanding.
You'll work with a case study throughout the book to help you learn the entire data analysis process – from collecting data and generating statistics to identifying patterns and testing hypotheses.
Along the way, you'll become familiar with distributions, the rules of probability, visualization, and many other tools and concepts.
Author: Sanjoy Mahajan
Traditional mathematics teaching is largely about solving exactly stated problems exactly, yet life often hands us partly defined problems needing only moderately accurate solutions. Street-Fighting Mathematics teaches us how to guess answers without needing a proof or an exact calculation.
In Street-Fighting Mathematics, Sanjoy Mahajan describes six tools: dimensional analysis, easy cases, lumping, picture proofs, successive approximation, and reasoning by analogy. Illustrating each tool with numerous examples, he carefully separates the tool – the general principle – from the particular application so that the reader can most easily grasp the tool itself to use on problems of particular interest.
Given the title of this book (and its subtitle “The Art of Educated Guessing and Opportunistic Problem Solving”), I was expecting a pop-maths book, but instead found this book to be a straight-up maths textbook. As a physicist and mathematician, I enjoyed it immensely. Learning to see problems the way Mahajan sees them takes deep thought, time, and practice, but that is what makes Street-Fighting Mathematics an enjoyable read that provides an enlightening look at solving problems. On the other hand I would hesitate to recommend it to those that might have difficulties with maths. You definitely need a strong understanding of calculus, differential equations, statistics and basic physics to get the best out of this book.
Author: Roger D. Peng
Data science has taken the world by storm. Every field of study and area of business has been affected as people increasingly realize the value of the incredible quantities of data being generated.
This book covers the essential exploratory techniques for summarizing data with R.
Exploratory data analysis is a key part of the data science process because it allows you to sharpen your question and refine your modelling strategies to develop more complex statistical models.
This book covers the plotting systems in R as well as some of the basic principles of constructing informative data graphics and some of the common multivariate statistical techniques used to visualize high-dimensional data.
Some of the topics covered are making exploratory graphs, principles of analytic graphics, plotting systems and graphics devices in R, clustering methods, and dimension reduction techniques.
Author: Brian Caffo
Statistical inference is the process of drawing conclusions about populations or scientific truths from data.
There are many modes of performing inference including statistical modelling, data-oriented strategies and explicit use of designs and randomisation in analyses. Further complications include alternative broad statistical theories (frequentists, Bayesian, likelihood, design based) and numerous complexities (missing data, observed and unobserved confounding, biases) for performing inference that can leave you in a debilitating maze of techniques, philosophies and nuance.
This book – the accompaniment to the online Coursera Course on Statistical Inference – presents the fundamentals of inference in a practical approach for getting things done, and is designed to help you to understand the broad directions of statistical inference and use this information for making informed choices in analysing data.
Topics covered include probability, random variables, expectations, variability, distributions, limits and confidence intervals, testing, p-values, power, Bootstrapping and permutation tests.
Author: Lee Baker
Correlation Is Not Causation.
You know it and I know it, and yet we are constantly having to be reminded of it because we can’t seem to help but get it wrong.
How many times have you heard someone really smart say something like ‘wow, this correlation has a p-value of 0.000001 so A must be causing B…’?
It’s not our fault though – we’re only human. We seek explanation for patterns and events that happen around us, and if something defies logic, we try to find a reason why it might make sense. If something doesn’t add up, we make it up.
OK, so if correlation does not necessarily imply causation, there must be a reason for that, and there must be something that is causing what we observe. That is what this book is all about.
If we discover a correlation between a pair of variables there are five alternatives to one being the direct cause of the other, and we’ll unmask all five in this book.
Then, once we understand each of these alternatives, we’ll formulate a plan to discover whether we have a direct causal link or whether there is some other explanation.
Correlation Is Not Causation explains how to systematically test for the five most common correlation-causation pitfalls that even the pros fall into (occasionally). We’ll learn to create strategies to analyse the data and interpret the results in a way that is easy to understand.
Best of all, there is no technical or statistical jargon – it is written in plain English.
It is packed with visually intuitive examples and makes no assumptions about your previous experience with correlations – in short, it is perfect for beginners!
So there you have it – 5 free statistics eBooks (plus a bonus book) to get your back-to-work-after-the-holidays head back on and into the swing of things.
I hope you enjoy them, and it would be great if you would leave brief reviews of these books in the comments below – I’m sure all the authors would appreciate your comments and shares.
About the Author
Lee Baker is an award-winning software creator with a passion for turning data into a story.
A proud Yorkshireman, he now lives by the sparkling shores of the East Coast of Scotland. Physicist, statistician and programmer, child of the flower-power psychedelic ‘60s, it’s amazing he turned out so normal!
Turning his back on a promising academic career to do something more satisfying, as the CEO and co-founder of Chi-Squared Innovations he now works double the hours for half the pay and 10 times the stress - but 100 times the fun!
He also wanted to be rich, famous and good looking. Ah well...
PS - Don't forget to connect with me in Twitter: @eelrekab
Other Articles by the same Author