I focus here on a typical example. The kind of topics you regularly find in news outlets. At the face of it, the title is not a lie. It is a true fact. But the result of cherry-picking or the choice of words. Many people with a statistical background would consider it as misleading at best. There may or may not be political motivation behind the article. But definitely, the main goal is click bait to generate impressions and advertising revenue.
On May 14, 2023, I read an article entitled “Northwest Heat Wave Could Smash Records In Washington, Oregon”. I explain why this is misleading: how it exploits the innumeracy of most readers. There were indeed pretty high temperatures, even government warnings about the risk of heat stroke and so on. Somewhat exaggerated, but the main issue is the use of words such “all time record”, even if indeed it was true.
Obviously temperatures are getting hotter in summer in Seattle. This is a reality. But it is much less dramatic than what the authors make it to be in that article. In short, the article is not a good example to illustrate the problem. The same applies to many similar articles popping up regularly, regardless of the location.
What the Article Says
According to the author, a few record highs were set on Friday, including in Portland, Oregon (90 degrees), and Olympia, Washington (85 degrees). This may very well be true. It does not sound that high, but for May and in Seattle, it could be a 100-year record. The author pointed that out. Since I don’t deny this statement, why do I claim it is misleading? I now explain what is problematic in the statement in question.
What is the Probability of an All-time Record?
To answer the question in relation to the article, you must re-phrase it. Indeed, the question is about the probability for a specific day (say May 25) to be the daily record over one hundred years of historical data. It does not matter which day, and this is the critical part missing in the article. The author picked up the day when it actually took place, but it could have been three months later or sooner. Thus turning an otherwise common event, into an all-time record. The goal is to get people to click to read the article. If telling the reader upfront about the actual probability for this to happen, no one would read it. The probability is a lot higher than most people think, and thus not as exceptional as the author makes it to be.
Mathematical Computation of the Odds
The probability that (say) May 25 is the hottest May 25 in 100 years is 1/100: this is the probability, assuming no trend, that one number randomly picked up out of 100, is the largest one. But picking up May 25 is misleading. It could have happened 3 months later or before. So what you should actually look at is the probability, in a given year, that at least one day is the hottest in one hundred year, compared to the same day in all other years. This probability is one minus the probably of never hitting a 100-year record day for the same day, over the course of a year. Since one year is 365 days, it is equal to 1 – (99 / 100)365. It turns out to be about 97.5%.
So the event, presented as an all-time record, is actually very common. Each year is expected to have a day like that just by chance. I do not consider 97.5% to be such a small chance to qualify as “exceptional event” or “100-year record”.
By playing with words and cherry-picking, news outlets can regularly find events that they can market as incredibly rare. They do so by misrepresenting the laws of probability. And the public, due to innumeracy, actually believes in the exceptional rarity of the events discussed. Most likely, this is not done on purpose. The authors are just as ignorant as their public. Unfortunately, it can lead to some dramatization and eventually government policies that do not address a reality, but an artificial (exaggerated) reality: a perception. This is true with public health and in many other domains. Regardless of political orientation, the publishers make the same mistakes.
They present the information in a way that is not technically a lie, indeed an actual fact. But the words surrounding it is what I call distortion. The drive to generate clicks further fuels this trend. As well as competition: a newspaper failing to mention the “exceptional heat wave” in question, or mentioning the 97.5% chance for this to happen in any year, instead of an “all-time record”, would be erroneously perceived as ignorant.
On the other hand, posting by end of December that 2 days this year, and 3 days last year, in 5 separate events, where the top 5 records over a 30-year time period, would have a lot more weight. Saying that 3 days in a row are top records given that one of them is already a top record, also has little weight, due to very strong autocorrelation in daily temperatures. Yet the public tends to be impressed by statistics like these. In the end, widespread innumeracy is also what allows lotteries to be successful. Incidentally, I don’t play the lottery, I create ones: see here.
About the Author
Vincent Granville is a pioneering data scientist and machine learning expert, founder of MLTechniques.com and co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).
Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.