Buzz words are one of my least favorite things, but as buzz words go, I can appreciate the term “Data Lake.” It is one of the few buzz words that communicates a meaning very close to its intended definition. As you might imagine, with the advent of large scale data processing, there would be a need to name the location where lots of data resides, ergo, data lake. I personally prefer to call it a series of redundant commodity servers with Direct-Attached Storage, or hyperscale computing with large storage capacity, but I see the need for a shorter name. Also, the person that first coined the phrase could just have easily called it Data Ocean, but they were modest and settled for Data Lake. I like that.
Not to be outdone, and technology vendors being who they are, came up with their own “original” buzz word, “Data Reservoir”, implying a place where clean, processed data resides. Whatever you want to call it is fine with me as long as you understand that unless you follow a solid data management process, even your data reservoir could end up being a data landfill (also a popular buzz word). The point of this post is to demonstrate what a real data management process might look like by providing real examples, with real code, and, most importantly, to emphasize the importance of having a good process in place. Once implemented, you will have good data, from which solid analysis can be performed, and from which good decisions can be made.
The reason data quality suffers is that everyone is in a rush, it is costly, and can actually be very difficult and complex as you will see when you follow through the whole exercise below. It is much easier to take the easy way out. After all, executives don’t realize how bad their data is until they’ve been bitten, and even then it is difficult to trace it back to the root cause.
Have you ever wondered how many events were caused by poor data management practices that were subsequently blamed on the executive’s decision making process? Do you think the executives at AIG, or Lehman Brothers knew the full scope of their exposure to the subprime mortgage loans? Sure they were aware of problems, and they may have been given reports that were “directionally correct”, they were just “wrong.” Maybe the slope of that line should have been much steeper. Of course, greed can easily change the slope of a line. No way of telling, but when critical decisions are being made, being “directionally correct” is not good enough.
Information Data Flow Process
The flow of data into an organization employs many stages and processes along the way. Each phase is critical to achieving the ultimate goal, and this goal is to provide reliable information that enables accurate conclusions and decisions. This discussion will focus on knowing the data, and cleaning, or tidying of data. This does not mean that other phases are less critical. Even with tidy data that is well understood, improper data collection procedures, faulty analysis and misleading reports can be just as devestating as bad data. Every phase is important, and errors at any phase can cause a break in the information data flow process.
Whether it is a Big Data scenario, or not, data quality must be managed to the maximum extent possible. I am not stating that all Big Data, or any other data, must be managed as closely as what I demonstrate below. However, I am stating that any data that is to be used for analysis upon which decisions will be made could, and, in most cases, should be managed with rigorous data integrity rules enforced.
A quick side note here, these principles apply to more than just data flowing in and out of a database for analysis. The same rigor must be applied in managing data of all kinds. All successful Service Oriented Architecture (SOA) and Enterprise Service Bus implementations have a mature Governance process in place. Service contracts are managed rigorously just as you would a normalized database. Authoritative data sources are defined, security for sensitive data is managed closely, and well defined data structures and schemas (Canonical models and XML Schemas) are defined to enable seamless information exchange, and so on. Managing data, at all levels is a complicated matter and requires a great deal of hard work and diligence to ensure its security and trustworthiness.
With the terms Velocity, Volume, and Variety used to define Big Data, images are conjured up of data coming in fast and furious, with little, or no time to plan for this flood of streaming data. However, it does not work that way, or at least it shouldn’t. Yes, a company should have an infrastructure and processes in place that allow for rapid and accurate responses to a fast paced economy. However, this does not mean that data should be integrated without proper due diligence.
Being flexible, agile, and responsive should not translate to being reactionary, or careless. Before Big Data, or any other data, is incorporated into a company’s knowledge stores, planning must occur. This planning involves knowing the data, understanding its value, and ensuring its proper management, and security. This discussion will hopefully highlight the importance of managing data to ensure its integrity.
Data Processing Exercise:
The use case for this exercise is that a company will routinely download data from a 3rd party vendor. For the purposes of this discussion, the 3rd party data will come from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This data was selected beacuse it is relatively simple, yet complicated enough to demonstrate more advanced topics, and relatively small in size. It consists of three tables: 1) Details, 2) Fatalities, and 3) Locations. These three tables are broken up into 65 files each (one for each year starting in 1950 and ending in 2014). Of the three table types, there are a total of 73 columns and approximately 2.3M records across all three tables. There is also a fourth table called UGC that provides State and County codes. However, there are numerous problems with how this reference data has been used. It will be loaded and discussed briefly.
Obviously, the scope of this effort would be quite different at the enterprise level. Multiply what is done below by an appropriate factor and you will get an idea of the scope of work involved when you are dealing with several hundred distinct tables (not three) with several thousand columns. Getting to know and understand data is not a trivial task.
For this exercise the NOAA storm database was selected. Currently, this data set covers storm activity across the United States starting in January 1950 through December 2014. According to NOAA, “NCDC has performed data reformatting and standardization of event types but has not changed any data values for locations, fatalities, injuries, damage, narratives and any other event specific information. Please refer to the Database Details page for more information: http://www.ncdc.noaa.gov/stormevents/details.jsp”
Given this disclosure, the expectation was that there would be one or two types of errors discovered, bringing me to the rapid conclusion: "even with data that has undergone a thorough review, checks should always be performed." Unfortunately, in very little time significant problems were discovered, thus the length of this post. For the sake of brevity, some solutions are just mentioned, and the code does not cover a production level solution.
This is not intended to be critical of NOAA. It is simply pointing out that all data can be corrupt, and enforcement of data integrity is a constant battle if not managed properly. NOAA is now on version 3.0 as May 12, 2014. http://www.ncdc.noaa.gov/stormevents/versions.jsp and it appears there is still room for improvement.
Use the following link to download the data used in this post:
NOAA Storm Events Repository - http: http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
The data was downloaded, and loaded using the R Programming language. The data store for this data is an Oracle 12c XE RDBMS. However, this could just as easily be a Hadoop Data Lake, Teradata, DB2, MySQL, or PostgreSQL database (part of the reason the RJDBC package was chosen instead of ROracle was to accommodate various databases - hopefully, this will at least minimize the required modifications).
Know and Tidy the Data:
Note: With data, there are many terms used to define data elements. For instance, in logical modeling the terms ENTITY and ATTRIBUTE are used, whereas in physical modeling TABLE and COLUMNS are used. To avoid confusion, we will use the physical modeling terminology for the remainder of the discussion.
Know – the Data:
The NOAA data will serve as a good example of how detailed the data study must be. The data could just as easily be bank, lending, financial markets and investment, commodities, vendor inventory lists, marketing data, . . . When data is being incorporated into a company infrastructure, whether it is a data lake, or just a couple of millon records going into an RDBMS, the concepts proposed could, and should, be implemented in most scenarios.
The first task is to develop a data dictionary (or acquire one from the 3rd Party Source if it exists). NOAA provides a good start to the data dictionary, but it is not complete.
A data dictionary typically consists of a document in spreadsheet, or table format with the following minimum information:
Here is a sample of the dictionary created for this project. This particular format does not show the column size, but that information is displayed in the next figure, the Physical Data Model:
Depending on the project, it is usually very helpful to develop a data model that depicts the table relationships. I prefer to use Computer Associates’ Erwin Data Modeler. It is an industry standard for enterprise modeling, although there are many tools that would be suitable for smaller projects like this one, and XMLSPY is an excellent tool for depicting XML schemas. The model depicted below reflects how the data is modeled after having run through the exercise below. Previously, all character columns were varchar2(255), and there were no primary, or foreign keys assigned.
To develop the data model and dictionary, I could have used the documents provided by NOAA exclusively, and entered each table and column name. However, I tend to be lazy and find it much easier to load the data, tidy it as much as possible, which includes assigning primary, foreign, and alternate keys as necessary, modifying the data types through the use of R (detailed in the code below), and then reverse engineering the database using Erwin. This provides most of what is needed for the data dictionary. If a logical model was required, the names would need to be changed to a more human readable format. Then all I have to do is copy over the definitions and I’m done. So far, so good, and this did not require a lot of work, and exposed a lot of problems with the data as we will discuss now. Also, please keep in mind, this is four tables total, not a hundred.
Note: The model depicted is not a fully normalized model. The lack of normalization causes a few issues that are discussed below, but this model will suffice for our purposes.
Before heading into any data project, it is best to have some idea of what you are trying to accomplish. What are the requirements? What questions are you trying to answer? Defining the use case is an important step and no data project should be undertaken without clear objectives in mind. Believe it or not, the lack of well defined requirements is a common problem in data management. Agile has come to mean no documentation, and I don’t believe that was the intent. So, for no other reason than to remind everyone of its importance (including myself), we will at least use the following as our high level requirements for this project.
The objective of this analysis is to:
Due to space limitations on the post, if you would also like to see the associated code for this exercise, please go to the following link: http://rpubs.com/rshane/81297
However, the following provides a summary of the issues discovered:
A design like this is just asking for trouble and is really unnecessary when you have a date data type that enforces all of those rules for you. Why did Hadley Wickham develop lubridate if we were going to keep our date fields so neatly parsed? :) For this reason, the attributes pertaining to Dates (YEARMONTH, YEAR, MONTH, HOUR, . . . ) were removed from the data. This type of data is readily available from the DATE and TIME fields provided.
Note: Two duplicates do not in themselves seem to be troubling, out of 1301016 rows. However, keep in mind that all duplicates essentially cause a Cartesian join. Without contraints and checks, errors are magnified. In the Fatalities table there are 2013 duplicates, and as we will see there are many more problems introduced by the lack of integrity constraints.
I am sure this list is not exhaustive, but you get the idea. The fatalities table, if you look at the data, only provides a few details about the fatality, like date, age, and sex. The rest of the information about the type of weather event is contained in the Details table. There are 28 records in the Fatalities table with no matching record in the Details table. In other words, a fatality without a storm event.
The following Venn diagram gives you a visual of the differences in the data before and after removing duplicates and orphans, and enforcing referential integrity. Notice that there is some intersection between Fatalities and Locations. However, this just means that the EVENT that occurred at a specific location, caused fatalities. The data does not provide the location of the fatality. Both direct and indirect deaths could have occurred miles away in a hospital a month later.
So, how many of these match? Is every fatality record with a "D" for FATALITY_TYPE represented in the details table's DEATHS_DIRECT column? You can do the checks, but hopefully you get the point. This is denormalized, and unenforceable, which is not good. The difference in counts can be explained by missing historical data. The sex, age and date of every storm victim is probably unknown. However, inconsistencies are not a result of missing data, just mismanaged data.
There would still be some work to do on this data, such as property and crop damages are not in a good format for analysis, the latitude and longitude in the locations table is inconsistent and needs standardization, and one thing that NOAA did not do was to categorize the storms more narrowly. However, I will stop here. If you have made it this far, you are probably a data geek like me, and you get my point(s). Hopefully, you have also developed a better understanding of what’s involved in maintaining data integrity. If you have something to add where I have overlooked, or misstated a point, or made an error, I would love to hear from you. Data geeks love data, and for the rare "non-data geek" that made it this far, its probably a lot like watching paint dry. I appreciate you hanging in there. Next time!