Summary: This blog showcases the handling of daily data of cases/deaths from Covid-19 in the U.S. published by the Center for Systems Science and Engineering at Johns Hopkins University. The technology deployed to manage and explore the data is R along with its splendid data.table package. Analysts with several months of R experience should benefit from the notebook below.
It's pretty hard to consume any analytics' media these days without seeing explorations of Covid-19 data. I was late to Covid EDA, but am now all in, hoping I can make even a small contribution to the pandemic response. A good starting point for Covid data is the Center for Systems Science and Engineering at Johns Hopkins University, my alma mater. The CSSE maintains a Covid-19 dashboard and posts confirmed case and fatality files daily for the U.S. and the world.
I started looking at that data about a week ago using R, planning later to examine the same data with Python and Julia. The downloadable case and death files hint of spreadsheets, with an ever-expanding date repeating group holding the case/death cumulative counts. The granularity of the data is at county or other jurisdiction within state, so ultimately a normalized relational structure would key on the combination of state, jurisdiction, and date. A problem with the data, noted on the website, is that "The time series tables are subject to be updated if inaccuracies are identified in our historical data. The daily reports will not be adjusted in these instances to maintain a record of raw data." In other words, there are some anomalies in the data that must be accounted for. I try to manage around them best I can with summarization and moving averages.
Any data management work I do in R is built on the nonpareil data.table package, which adds immeasurable functionality to R's native data.frame. A newbie serious about learning R for analytics should make an investment in data.table. It'll take some time, but the rewards are well worth the effort. Python programmers are starting to see the Python data.table as a competitor to the venerable Pandas.
This is the first of a two-part series on R with the CSSE case/fatality data. Part I here details the loading/shaping/grouping of the data, while Part II will explore the data using ggplot. My hope is that readers will find some of the code useful in their own work.
The supporting platform is a Wintel 10 notebook with 128 GB RAM, along with software JupyterLab 1.2.4 and R 3.6.2. The R data.table, tidyverse, pryr, plyr, fst, and knitr packages are featured, as well as functions from my personal stash, detailed below.