Home » Technical Topics » Data Science

A quick demonstration of polling confidence interval calculations using simulation

8031873866

At dinner with friends last Sunday, the topic of conversation fixated on — what else — the upcoming presidential election. That morning, a poll had been released by the Washington Post noting a 12 percentage point lead of the democrat over the republican with likely voters, 54% to 42%. Most of the back and forth revolved on the precise meaning of the numbers. Mindful of what happened in 2016, one person correctly opined that much could happen to change the vote in the last three weeks. Also noted was that the mix of voters can evolve quickly, so that some who appeared to be on the sideline might well engage at the end. That also appeared to happen in 2016. And finally, these numbers represent the popular vote, while the electoral college determines elections.

Everyone seemed to have at least a naive sense of the margin of error, it having to do with the murky concepts of variability and sampling. My own $.02 was that the confidence bands represented what would happen if the identical survey had been conducted many times at the same calendar point. Thus, if the same WP survey had been run 1000 times, an average of 54\% of the votes would have been for D, and 95\% of such polls would report D proportions between .50 and .58. Likewise, R’s numbers would average .42 with 95\% in the range .38 to .46.

The results noted by the WP are derived from the sampling distribution of proportions, a basic result covered in Stats 101. The confidence interval becomes “skinnier” the higher the sample size (in this case, 725) and the more extreme the proportion.

Advances in statistical computation over the last 20 years have enabled stats guys like me to easily “conduct” the polling experiment mentioned above many times and track the results using random number generation. In what follows, I first show the simple stats of the normal distribution calculations. I then pivot to a Monte Carlo simulation where I run the poll many times, accumulating the calculations for a summary view. Hopefully, the results confirm the theoretical calculations!

The supporting platform is a Wintel 10 notebook with 128 GB RAM, along with software JupyterLab 1.2.4 and R 4.0.2. The R data.table and tidyverse packages due the heavy lifting below.

Read the remainder of the blog here.