Subscribe to DSC Newsletter

Johns Hopkins, Covid-19 and R, Part III, World Data.

Summary: This blog is part III of a series showcasing management and analytics of the daily World Covid-19 case/death data published by the Center for Systems Science and Engineering at Johns Hopkins University. Whereas parts I & II focused on U.S. data, part III looks at the World as well. Of particular interest are moving averages of new cases and deaths, in addition to the case fatality rate, the percentage of deaths to total cases. The technology deployed is R driven by its splendid data.table package. Analysts with several months of R experience should benefit from the notebook below.

A little over a month ago, 45 made the provocative assertion that 99% of covid-19 cases are benign. “We have tested over 40 million people. By so doing, we show cases, 99 percent of which are totally harmless.” Press secretary Kayleigh McEnany attempted to cover the potus by proffering “What the president is noting is that, at the height of this pandemic, we were at 2,500 deaths per day. We are now at a place where, on July Fourth, there were 254; that’s a tenfold decrease in mortality.”

Apparently, the argument was that since daily fatalities were just around 1% of new daily cases at that point, the other 99% were innocuous. Indeed, Dr. Anthony Fauci interpreted the argument that way, countering though: "I'm trying to figure out where the president got that number," Fauci said. "What I think happened is that someone told him that the general mortality is about 1%. And he interpreted, therefore, that 99% is not a problem, when that's obviously not the case."

The percentage of a disease’s cases that result in death is called the case fatality rate and is often computed as simply the ratio of total fatalities to total cases. A better CF rate would tie individual cases to deaths, but that’s generally, as now, impractical. And there’s usually a lag between new cases and fatalities, so a calculation that accounted for that difference would be welcome. In the end, though, the CF is often just computed as cumulative fatalities/cumulative cases.

The current day's Johns Hopkins CSSE data are available for download at midnight CDT daily. For both the U.S. and the World, there are case and death files, each of which has a similar structure. The granularity is geography such as country or county within states. A new column is added each day detailing the cumulative counts for each geography. Data munging revolves on pivoting or melting the data into R data.tables and computing daily counts as differences of successive cumulative records.

One problem with the data for both the U.S. and the World is that cases/fatalities tend to be underreported on weekends which, when coupled with an often one day lag in reporting, produces significantly lower counts on Sunday and Monday. I work around this problem by emphasizing moving averages over daily counts.

After loading and munging the data, I assemble functions to report on cases/deaths using powerful data.table syntax. Some of these functions then feed ggplot visuals that demonstrate the disease's workings over time. The grouping power of data.table allows country/state-level case-death reports to be generated in a few statements.

The supporting platform is a Wintel 10 notebook with 128 GB RAM, along with software JupyterLab 1.2.4 and R 4.0.2. The R data.table, tidyverse, pryr, plyr, fst, feather, and knitr packages are featured, as well as functions from my personal stash, detailed below. Read the entire blog here.

Views: 831

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service