As part of my PhD work, I recently had to analyze any dataset(s) of my interest and present findings. I ended up conducting a study on US County-wise Covid-19 data. I wanted to share my key findings through this blog.
The primary question I wanted to address through data analysis was “Do counties’ socioeconomic factors such as population size, poverty rate, unemployment rate, education percent and urbanization rate have any direct impact on the Covid-19 outbreak across counties within United States?”
The method used in this analysis was to find correlations between US counties’ socioeconomic factors and the number of county-wise Covid-19 cases reported. The socioeconomic factors studied included population density, poverty rate, unemployment rate, education percent and urbanization percent. Correlation analysis is a statistical method that gives insights about existence of connections between quantitative variables and provides a metric to infer the strength of such relationships.
Jupyter Notebook was used to script the analysis steps using Python 3 programming language. While Python data analysis libraries such as NumPy, Pandas and Sklearn were used for data cleansing, integration, wrangling and executing correlations, Python’s plotting libraries such as Matplotlib and Seaborn were used to plot the results.
The steps involved extracting the required data fields from each dataset, cleansing the data, wrangling, additional fields computation and integration of data records to prepare data ready for executing correlational analysis. Cleansing of data involved imputation of missing values in different datasets and standardization of state and county fields so that records from different datasets can be mapped. Data wrangling involved collapsing and consolidating metric values and reshaping the datasets. Integration step involved matching records from all the processed datasets using state and county fields as unique keys to prepare consolidated dataset that is ready for analysis. Each dataset had county and state name fields. These two fields are used as primary keys to map records from different datasets. The final step involved executing the correlational analysis on the integrated dataset.
While poverty rate and unemployment rate were used directly from the respective datasets, the factors population density, education percent and urbanization percent were computed and derived. Population density was calculated as population size per square miles by using counties’ population estimates from population dataset and the counties’ land area from land area dataset. Education percent of a county was computed by adding percent with some college degree and percent with bachelor’s or higher degree found in the counties’ education rates dataset. Urban percent of a county was deduced by subtracting rural percent value from 100.
The state and county fields in different datasets were represented in different formats. For example, the county field in different datasets had values such as ‘Baldwin County, Alabama’, ‘Baldwin County, AL’, ‘Baldwin, AL’ where state was either indicated as full name or using state code. In other datasets, the county field had values such as ‘Baldwin County’ and ‘Baldwin’ with state captured in separate field either using full name or state code.
As part of data preparation, county field was standardized to contain only the actual name part such as ‘Baldwin’. Similarly, the state field was standardized to contain only the state code such as ‘AL’. In case of datasets where state field had full name, the full name was transformed into a 2 letter state code.
Metric values in different datasets were on different scales. For example, total number of covid-19 cases was in hundreds of thousands, while population density values were in hundreds scale. It was challenging to visualize the data because data values with different scales could not fit the chart area. As such, there was a need to normalize and scale the data values specifically for visualization purposes. For example, logarithmic scale was used for total number of covid-19 cases, while min-max normalization coupled with a scaling factor used to visualize other metrics.
Correlations were evaluated for all counties in US and separately for counties from top 4 states where the reported cases were higher. The top 4 states are New York, New Jersey, California and Illinois. Figure 2 below captures the correlations between the counties’ socioeconomic factors and total number of covid-19 cases.
Results also showed correlations among the socioeconomic factors themselves. The following are key non-covid related inferences.
Figure 3 below shows a scaled view of county-wise population density vs. covid-19 case in New York state.
If anyone is interested in the analysis artifacts, please message me on LinkedIn.
Coronavirus (Covid-19) data in the United States [Data set]. (2020). The New York Times. Retrieved from https://github.com/nytimes/covid-19-data
Counties’ urban-rural classification data in United States [Data set]. (2020). United States Census Bureau. Retrieved from https://www.census.gov/programs-surveys/geography/guidance/geo-area...
County-level land area data in United States [Data set]. (2020). United States Census Bureau. Retrieved from https://www.census.gov/library/publications/2011/compendia/usa-coun...
County-level socio-economic data in United States [Data set]. (2020). United States Department of Agriculture Economic Research Service. Retrieved from https://www.ers.usda.gov/data-products/county-level-data-sets/
Garattini, C., Raffle, J., Aisyah, D. N., Sartain, F., & Kozlakidis, Z. (2019). Big data analytics, infectious diseases and associated ethical impacts. Philosophy & Technology, 1, 69. http://dx.doi.org/10.1007/s13347-017-0278-y
Morgan, O. (2019). How decision makers can use quantitative approaches to guide outbreak responses. Philosophical Transactions of the Royal Society B: Biological Sciences, 374(1776), 1
Redding, D. W., Atkinson, P. M., Cunningham, A. A., Lo Iacono, G., Moses, L. M., Wood, J. L. N., & Jones, K. E. (2019). Impacts of environmental and socio-economic factors on emergence and epidemic potential of Ebola in Africa. Nature Communications, 10(1), 4531. http://dx.doi.org/10.1038/s41467-019-12499-6