The figure titled "Data Pipeline" is from an article by Jeffrey T. Leek & Roger D. Peng titled, "Statistics: P values are just the tip of the iceberg. These are both well known scientists in the field of statistics and data science, and for them, there is no need to debate the importance of data integrity; it is a fundamental concept. Current terminology uses the term "tidy data", a phrase coined by Hadley Wickham from an article by the same name. Whatever you call it, as scientist, they understand the consequences of bad data. Business decisions today are frequently driven by results from data analysis, and, as such, this requires today's executives to also understand these same consequencese. Bad data leads to bad decisions.
Data Management Strategy
Ok, case closed. There is nothing more to discuss, or debate. Right? On the surface, this is an obvious conclusion, and you would think there would be no need to discuss it any further. I have been accused of having a stranglehold on the obvious on more than one occasion. However, if this is so obvious, why, in my 20+ years of working as an information architect and data engineer, do I continue to see bad data? When I am engaged to help a company with their data, the first thing I should be handed is documentation that defines the company's data management strategy. However, this has seldom happened (of course, they probably wouldn't need me if they handed me their data management strategy documentation).
Typically, the first thing I do is obtain access to "at least one" of their major databases and reverse engineer it using a tool like Erwin to see how they are managing their most important data. Invariably, I see very nicely arranged rows, and columns of 100's of tables without a single relationship, or primary key assigned. In addition, you hardly ever find a data dictionary. If you have a question about the data, generally you are required to schedule an appointment with a very busy individual in the company that is the keeper of this data and considered the local subject matter expert.
I quoted "at least one" in the paragraph above to highlight that there are usually several major databases, and numerous lesser databases. Just the fact that there are numerous databases siloed throughout the company is a good indicator that there is a lot of work to do. There are large companies that literally have thousands of data stores of duplicated data, and a massive ETL team that is busy moving data from one database to another.
Relative to Big Data?
This is a "Big Data" forum, so what does this discussion have to do with big data? If your company is anything like described above, then you are not ready to manage a big data project. Organizations that successfully implement a big data strategy have a documented corporate data management strategy, and big data is simply a part of the overall strategy to properly manage this valuable asset. We have all heard of the failed Big Data projects, and there are numerous reasons for that. The lack of a corporate data management startegy, and just a general lack of understanding of data management can explain most failed projects.
When data is received from a third party, as was discussed in my previous post,Data Integrity: The Rest of the Story Part II, there needs to be a process in place for managing this data upstream. A huge mistake, in my opinion, is to put data into a Data Lake, or any other type of data store without putting it through the learning and cleansing process. It is far too easy to rationalize shortcuts, and it is far too difficult to justify revisiting the same work. Clean the data upstream before it is allowed in your data stores, and then the manipulation and analysis of that data will always serve its purpose.
This gets into the subject of Data Governance, and Data Quality management (a component of a Data Management Strategy). We will leave this for another forum discussion, but I didn't want to discuss the data pipeline, and data integrity without at least mentioning the key component of governance.
NOAA Storm Data Analysis
Using the example from my last post regarding the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, if NOAA had properly maintained their data, it would be of far more value to the consumer than it is in its current state. Let me restate that I am not picking on NOAA. These examples are everywhere and in many cases much worse than NOAA's storm database. Since my last post, I have worked on this data very little, and in this short time I found more errors than were discussed last month. Not only were there 2013 duplicates in the FATALITIES data, and 28 FATALITIES records without a storm, but there are 28,332 LOCATION records that refer to a storm event that does not exist in the DETAILS table.
As you will recall from the previous post, the problems identified in NOAA's storm database included:
The last problem area identified in the NOAA data is exactly why you need referential integrity on a normalized data set with properly defined constraints. All of these controls protects your data from sloppy data management practices. The duplicates discovered in the NOAA data were probably caused during the data cleansing process they went through in 2014. The file dates where the duplicates occurred were all from 1997 through 2014, and was probably a process where human error introduced the duplicates during a batch update. This happens to the best of us, and it is exactly why we need stringent controls on our data -- to protect us from ourselves.
The errors discussed in this data set are not complete. I addressed some of the major issues, and the quality of the data is greatly enhanced at this point. However, as I started developing exploratory plots to demonstrate the errors, it was discovered that there are 248,982 records in the LOCATIONS table that contain no values for the LATITUDE and LONGITUDE, and of these, 187,367 records contain no value for location (most of those with a LOCATION are very general, like CountyWide, yet there is no County name provided). I guess my question here is, why create a LOCATION record with no LOCATION?
The example in the previous paragraph highlights the importance of understanding your data. Someone could mistakenly think that there are valid locations for 1,001,608 storms, when in reality, it is closer to 724,294. Regarding the LOCATION data, violation of normalization rules is that there is LAT/LONG data in the DETAILS table, as well as the LOCATIONS table. Which one is correct? Do the LAT/LONG provided in the DETAILS table match those in the LOCATIONS table? I will leave that for someone like NOAA to fix. Lastly, the existing LAT/LONG are not all in the correct format, nor are their values within a valid range. All longitudes East of the Prime Meridian should be negative numbers, yet the longitudes are positive for LON2 variables in LOCATIONS table. The range of valid latitude and longitudes values for the 48 contiguous United States are:
+48.987386 is the northern most latitude
+18.005611 is the southern most latitude
-124.626080 is the west most longitude
-62.361014 is a east most longitude
In the NOAA LOCATIONS table, the range of values for the beginning latitude of the storms are:
Latitude Range = -14.46 to 97.10Longitude Range = -170.92 to 171.37
And for the ending location of the storm, the value ranges are:
LAT2 Range = -1427 to 6457000
LON2 Range = -17122 16012816
As you can see, these are far outside of the ranges for the United States. Let's quickly take a look a the NOAA DETAILS range of values for the beginning and ending latitudes and longitudes:
Range of values for the beginning latitudes = "" to "REDLANDS" (that's correct - these are values in the latitude columns of the details table.
Range of values for the beginning longitudes = "" to " FL."
Range of values for the ending latitudes = "" to "RALEIGH"
Range of values for the ending longitudes = "" to " APEX"
The values are quite different once the data has been coerced into being the correct data type. Just showing the BEGIN_LAT ranges, instead of a range from "" to "REDLANDS", you get a range of values from -14.4 to 70.5. Still not all exactly in the United States. Values less than 0 for latitude would be somewhere South of the Equator, but we can assume that the 70th parallel is somewhere up in Alaska (you can Google the 70th parallel and it goes through the Arctic, and the Northern tips of Alaska. Positive values for longitude would be East of the Prime Meridian (somewhere in Europe).
Nevertheless, this once again would be a very simple thing to control with a numeric data type field, and a domain constraint on the acceptable range of values.
To Normalize or Not, that is the question!
Over the years I have had many debates with colleagues over the value of a rigidly controlled database versus a loosely controlled database. I think what some fail to understand is that well defined standards and structures enable flexibility, extensibility, reuse, and resilience of a database. I like to think of it in terms of plug-and-play. The hardware standards that evolved in the 90's revolutionized the computer industry, and it was all because of very well defined standards. Who recalls trying to find a sound card that would work with your Compaq computer?
The same applies to data management best practices and standards; they enable flexibility. Relational data maintained at the ATOMIC level, can easily present data in multiple views of the same data, dependent on the requirements. Aggregate, dimensional, and fact tables can be created as views, and then easily modified to accommodate changing requirements. The same applies to Big Data analysis.
In my opinion, there are very few legitimate reasons for removing constraints, indexes, and denormalization. Complexity and performance are the two arguments I hear most frequently, and neither hold water in my opinion. Yes, there are use cases where some exceptions are made, but they are few and far between. I also believe that
your data has to get really big, and come at you really fast to warrant non-standard data practices.
Big Data is simply a lot of small data. That said, there are legitimate use cases for a Hadoop platform (e.g., sensor data, clickstream analysis, realtime predictive analysis, AKA Amazon, and Netflix), but my personal opinion is that integrated platforms with a reputable RDBMS like Oracle, is the way to go. My next post will be on the topic of using Oracle R Advanced Analytics for Hadoop on Oracle's Big Data platform.
To wrap up this series of posts on data integrity, I hope you have taken something positive away from the discussion, and if nothing else have a greater appreciation for the importance of data integrity, as well as a better understanding of what is involved in maintaining clean data. And please, if you disagree, or find errors, or whatever, please feel free to leave a comment. I would love to hear your opinion on the subject. I will provide a link to the code, exploratory plots, and maps on RPubs later this week.