There’s no such thing as perfect data, but there are several factors that qualify data as good [1]:
Following a few best practices will ensure that any data you collect and analyze will be as good as it gets.
1. Collect Data Carefully
Good data sets will come with flaws, and these flaws should be readily apparent. For example, an honest data set will have any errors or limitations clearly noted. However, it’s really up to you, the analyst, to make an informed decision about the quality of data once you have it in hand. Use the same due diligence you would take in making a major purchase: once you’ve found your “perfect” data set, perform more web-searches with the goal of uncovering any flaws.
Some key questions to consider [3] :
Three great sources to collect data from
US Census Bureau
U.S. Census Bureau data is available to anyone for free. To download a CSV file:
The wide range of good data held by the Census Bureau is staggering. For example, I typed “Institutional” to bring up the population in institutional facilities by sex and age, while data scientist Emily Kubiceka used U.S. Census Bureau data to compare hearing and deaf Americans [5].
Data.gov
Data.gov [6] contains data from many different US government agencies including climate, food safety, and government budgets. There's a staggering amount of information to be gleaned. As an example, I found 40,261 datasets for "covid-19" including:
Kaggle
Kaggle [7] is a huge repository for public and private data. It’s where you’ll find data from The University of California, Irvine’s Machine Learning Repository, data on the Zika virus outbreak, and even data on people attempting to buy firearms. Unlike the government websites listed above, you'll need to check the license information for re-use of a particular dataset. Plus, not all data sets are wholly reliable: check your sources carefully before use.
2. Analyze with Care
So, you’ve found the ideal data set, and you’ve checked it to make sure it’s not riddled with flaws. Your analysis is going to be passed along to many people, most (or all) of whom aren’t mind readers. They may not know what steps you took in analyzing your data, so make sure your steps are clear with the following best practices [3]:
3. Don’t be the weak link in the chain
Bad data doesn’t appear from nowhere. That data set you started with was created by someone, possibly several people, in several different stages. If they too have followed these best practices, then the result will be a helpful piece of data analysis. But if you introduce error, and fail to account for it, those errors are going to be compounded as the data gets passed along.
References
Data set image: Pro8055, CC BY-SA 4.0 via Wikimedia Commons
[2] Learning from reproducing computational results: introducing three ...
[3] How to avoid trouble: principles of good data analysis
[4] United States Census Bureau
[5] Better data lead to better forecasts
[6] Data.gov
[7] Kaggle
Posted 12 April 2021
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central