We are all very fortunate to be alive during this exciting time in history. Some truly disruptive technologies are on the verge of exploding into reality and it is difficult to imagine what the future holds. With these new technologies, however, we must not ignore the technically sound practices that allowed us to reach this point – managing data integrity is one of those practices.
As promised from my last post, I will discuss the importance of data integrity in the context of Big Data. I started to gloss over the subject by demonstrating some basic data cleansing concepts, using data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. I was then going to demonstrate how this newly cleansed data could be used with some adroitly applied R code to show weather patterns of storms, and plot results on google maps, . . . . . blah, blah, blah.
This has been done, and I don’t think it would get my point across about what it takes to transform data into becoming relative, and actionable information. These concepts apply whether you are talking about Big Data, Data Warehouses, Online Analytic Processing (OLAP), or Online Transaction Processing (OLTP), and whether you are using an Oracle ExaData, Hadoop, or Teradata Aster/Hadoop, platform.
So, how does one go about maintaining data quality when the source is web scraping, streaming data from a local network feed, or being applied to data acquired from a 3rd party API? The problem is the same as data acquired from your own corporate site, from a transaction database, or local data warehouses. If the collection and maintenance of data is not performed properly, then the quality of the results from any analysis performed must be called into question.
The objective here is to demonstrate what is required to obtain valid results, and to demonstrate what a pain it is to tidy data – especially since it is mostly avoidable if a little engineering work is done on the front end. Obviously, when you are depending on others for your data, you take the data as you can get it. However, *the most important component of maintaining data integrity is the data collection process. If you do it right from the beginning, then the manual cleansing process is unnecessary.
So rather than providing links to a data source alone (like NOAA’s storm database), and demonstrating the process of cleansing, or tidying the data, followed by applying analysis techniques, I am going to walk through a use case where the NOAA data is going to be routinely downloaded and analyzed. Given the state of the data from a preliminary analysis, it has been determined that cleansing the data monthly would cost more than what the information was worth. Therefore, we are going to walk through a somewhat contrived use case to demonstrate principles, and processes in great detail. These demonstrated concepts should be applied to all data when the ROI warrants the time and effort. A quick caveat here, most times the ROI does warrant the additional effort and cost. Unfortunately, all too frequently, the easiest path is too tempting and rationalized as the best path forward.
As we walk through this exercise, you will see that there are a lot of checks that must be performed to ensure that the data is accurate. The NOAA data is only a few million records. Imagine performing all of these checks on a much larger data set. Even a few million records from a streaming data feed could be overwhelming if checks were not put in place to enforce the necessary structure.
Regardless of whom I quote to lend credibility to my argument, the basic concepts remain the same. From the fathers of relational modeling like EF Codd and CJ Date, to more contemporary authorities like data warehouse guru WH Inmon, or Chief Scientist at RStudio and an adjunct Assistant Professor of statistics at Rice University Hadley Wickham, they all acknowledge the importance of data quality. Using the more modern terminology of tidy data (quoted from Hadley Wickham in my last post):
- Each column should contain one variable. In other words, you don’t want a column that contains both work and home telephone numbers in the same field.
- Each observation of that variable should be in a different row.
- There should be a different table for each type entity.
- Multiple tables that are related should have a column that links the tables together (referential integrity).
There are more details, but this is the basic idea. Lastly all of the authorities concur, there should be a data dictionary that details each entity and attribute of the data in detail. They may call it something different, but the contents remain the same. A Data Dictionary, or Code Book usually consists of the entity/table name, column/attribute name, data type, length, whether the attribute is a required value or not (nullability), whether the attribute is a primary, foreign, or alternate key, and a description of the entity and/or attribute.
The data I have chosen to work with does not require Hadoop, and the R code I present can be run on most modern personal computers with 8 GB of RAM or more (sessionInfo() provided at the end of the discussion). The principles still apply though, whether you are working with petabytes, or megabytes of data. The number one rule in data management and analysis in general is that you must know your data.
So once you know what the data are, how well do the values represent what you think should be there? Does the data refer to a whole other set of attributes residing in another table or file? Is it reference data? When you are looking at clickstream data, what do all of those columns of data mean, and how will you extract meaningful information from all that data? If you have performed your due diligence on the front end, you already know the answer to these questions.
Understanding your data is absolutely essential, and if you don’t know your data, it will remain just that, data. Harnessing the power of data as meaningful, and actionable information will remain out of reach without this knowledge.
Enough of the talking, let’s look at some examples of good and bad data and see what the difference in results might look like.
Use the following links to download the data used in this post:
NOAA Storm Events Repository - http: http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
According to NOAA, “NCDC has performed data reformatting and standardization of event types but has not changed any data values for locations, fatalities, injuries, damage, narratives and any other event specific information. Please refer to the Database Details page for more information: http://www.ncdc.noaa.gov/stormevents/details.jsp”
Before heading into any project analysis, it is best to have some idea of what you are looking for. What questions are you trying to answer?
The objective of this analysis is to:
- Compare weather related events that resulted in fatalities or injuries over 6 consecutive decades: 1950-1959, 1960-1969, 1970-1979, 1980-1989, 1990-1999, and 2000-2009.
- Plot the events with the most fatalities, per state, per decade, color coded by decade on a US Map.
This information can then be used to answer questions like:
- Have weather patterns changed?
- Are the leading causes of fatalities changing from region to region?
- Have the number of storms increased? (There is already discussion on NOAA’s web site that the increase in the number of storms reported by NOAA is likely due to the improved reporting, and detection systems installed in recent decades, like Doppler Radar)
- . . . . .
As more questions are asked, the requirements for specific data elements will change. Therefore, all available storm data will be caputred and cleansed for potential future use.
I will stop here and continue on the next post. In that post I will take a look at the data, discuss some methods for managing the data, and implement constraints that enforce the principles of tidy data. When complete, we will have a defined data structure, data dictionary, and some R code to perform some analysis. More importantly, we will have a repeatable process for managing a recurring influx of data that can be trusted and used to obtain reliable, consistent, and actionable information.