Before Los Angeles Times reporter Laurie Becklund died of metastatic breast cancer earlier this month, she wrote a powerful final column. She asked friends and family not to say that she died fighting a courageous battle, a meaningless, trite phrase. She excoriated the pink ribbon and early detection breast cancer campaigns, calling the focus on awareness shamefully out of date. And, in a particularly telling excerpt that has caught the Internet’s attention, she decried a “criminal” lack of available data about the people at the center of all this, the ones facing the fatal diagnosis.
“We now know that breast cancer is not one disease,” she wrote. “What works for one person might not for another: There is no one “cure.” We are each, in effect, one-person clinical trials. Yet the knowledge generated from those trials will die with us because there is no comprehensive database of metastatic breast cancer patients, their characteristics and what treatments did and didn’t help them.
“In the Big Data-era,” Becklund concluded, “this void is criminal.”
Becklund’s poignant plea is only the latest and most dramatic appeal to unleash the potential of meaningful Big Data, and its power to change people’s lives for the better. The ability to sift through huge amounts of raw data can offer us everything from personalized health medicine to shortcuts in the struggle to wrestle with the vast information available on the Internet. The enthusiasm Big Data creates is entirely legitimate — and so is our need to use it to our greatest advantage.
Before I talk about those advantages, a note of caution. Big Data’s popularity has prompted some commentary that I’d describe as frothy and meaningless — a faddish approach to a serious subject. Browse an airport bookshop and you night find the popular book Big Data placed between airplane reads on using people tools in business, and how children succeed. Big Data is caught in same hype cycle that accompanied the early days of the Internet, with outsized claims followed by disappointment, settling back into a more realistic view of its transformational potential.
The Big Data airport book authors jump into this by incorrectly declaring the end of the need for statistical sampling, one reason why statisticians and data scientists seem at war regarding the Big Data phenomenon.
Let’s look more carefully at this question. It’s true that, in some cases, interesting patterns and information emerge only when huge amounts of raw data are collected. It’s also true that once the pattern is identified, the data that carry the information of interest may still be of modest size.
Here’s where Big Data really hits a home run: It can do things that can’t be done with small data, no matter how well designed the sampling procedure. Take Google search — it works by storing all the search strings that have ever been run (or run recently) and finding out who else searched for the same thing that you just searched for. Then it has to collect enough such searches so that it can evaluate what people subsequently clicked on, and thus make good recommendations about what links you should see and in what order. If you search for “nude pictures of Angelina Jolie” you might be able to collect a big enough sample of similar searches fairly quickly. But if you search for “nocturnal habits of aardvarks,” you really need Big Data. Note that the relevant data here can still be small in size — the number of qualifying searches needed to make a good first-pass judgment might be in the low thousands. But you have to sift through hundreds of millions, or billions, to get those relevant records.
When it comes to medical and other research, we can take Big Data and apply statistical design and sampling principles to learn more, and learn more definitively, about what these large amounts of information are really telling us. One principle is to separate exploration (roaming around the data to see what looks interesting) from confirmation (stating a hypothesis in advance then looking systematically at fresh data to see if it is confirmed).
After obtaining medical records of 2 million people and following them for 30 years, researchers in Denmark concluded that the chances of premature death are higher for people who have ADHD, or attention deficit hyperactivity disorder, according to a study published in the Lancet.
As the news site Vox deftly deconstructed, the study had limitations. It was was done in Denmark, and Vox speculated that the medical profession there may diagnose and treated ADHD differently from other countries, making the findings unique to the Danes.
But the main caution is to determine which realm the study falls in — exploration or confirmation. Did the investigators have some other basis for believing there was a link between ADHD and early death, then look to these data for confirmation? Or were they exploring the data, looking for some interesting pattern? Rapid growth in the digitization and availability of patient data and health data in general holds great potential for medical research and personalized medicine. Yet the appropriate statistical methodology is still needed to unlock this potential, and guard against error.
Laurie Becklund’s plea tells us in a particularly powerful way how crucial it is for us to capitalize on the ways Big Data can transform our lives. At the same time, we need to use our full array of statistical tools and techniques to fully achieve that transformational potential. In assessing the impact of Big Data, that’s the reality, and not the hype.
(Peter Bruce is founder of The Institute for Statistics Education at Statistics.com, the leading online provider of analytics and statistics courses since 2002. He also is the author of the newly-released Introductory Statistics and Analytics: A Resampling Perspective. (Wiley)