Data Integrity: The Rest of the Story Part II

Buzz words are one of my least favorite things, but as buzz words go, I can appreciate the term “Data Lake.” It is one of the few buzz words that communicates a meaning very close to its intended definition. As you might imagine, with the advent of large scale data processing, there would be a need to name the location where lots of data resides, ergo, data lake. I personally prefer to call it a series of redundant commodity servers with Direct-Attached Storage, or hyperscale computing with large storage capacity, but I see the need for a shorter name. Also, the person that first coined the phrase could just have easily called it Data Ocean, but they were modest and settled for Data Lake. I like that.

Not to be outdone, and technology vendors being who they are, came up with their own “original” buzz word, “Data Reservoir”, implying a place where clean, processed data resides. Whatever you want to call it is fine with me as long as you understand that unless you follow a solid data management process, even your data reservoir could end up being a data landfill (also a popular buzz word). The point of this post is to demonstrate what a real data management process might look like by providing real examples, with real code, and, most importantly, to emphasize the importance of having a good process in place. Once implemented, you will have good data, from which solid analysis can be performed, and from which good decisions can be made.

The reason data quality suffers is that everyone is in a rush, it is costly, and can actually be very difficult and complex as you will see when you follow through the whole exercise below. It is much easier to take the easy way out. After all, executives don’t realize how bad their data is until they’ve been bitten, and even then it is difficult to trace it back to the root cause.

Have you ever wondered how many events were caused by poor data management practices that were subsequently blamed on the executive’s decision making process? Do you think the executives at AIG, or Lehman Brothers knew the full scope of their exposure to the subprime mortgage loans? Sure they were aware of problems, and they may have been given reports that were “directionally correct”, they were just “wrong.” Maybe the slope of that line should have been much steeper. Of course, greed can easily change the slope of a line. No way of telling, but when critical decisions are being made, being “directionally correct” is not good enough.

Information Data Flow Process

The flow of data into an organization employs many stages and processes along the way. Each phase is critical to achieving the ultimate goal, and this goal is to provide reliable information that enables accurate conclusions and decisions. This discussion will focus on knowing the data, and cleaning, or tidying of data. This does not mean that other phases are less critical. Even with tidy data that is well understood, improper data collection procedures, faulty analysis and misleading reports can be just as devestating as bad data. Every phase is important, and errors at any phase can cause a break in the information data flow process.

Whether it is a Big Data scenario, or not, data quality must be managed to the maximum extent possible. I am not stating that all Big Data, or any other data, must be managed as closely as what I demonstrate below. However, I am stating that any data that is to be used for analysis upon which decisions will be made could, and, in most cases, should be managed with rigorous data integrity rules enforced.

A quick side note here, these principles apply to more than just data flowing in and out of a database for analysis. The same rigor must be applied in managing data of all kinds. All successful Service Oriented Architecture (SOA) and Enterprise Service Bus implementations have a mature Governance process in place. Service contracts are managed rigorously just as you would a normalized database. Authoritative data sources are defined, security for sensitive data is managed closely, and well defined data structures and schemas (Canonical models and XML Schemas) are defined to enable seamless information exchange, and so on. Managing data, at all levels is a complicated matter and requires a great deal of hard work and diligence to ensure its security and trustworthiness.

Common Misconception

With the terms Velocity, Volume, and Variety used to define Big Data, images are conjured up of data coming in fast and furious, with little, or no time to plan for this flood of streaming data. However, it does not work that way, or at least it shouldn’t. Yes, a company should have an infrastructure and processes in place that allow for rapid and accurate responses to a fast paced economy. However, this does not mean that data should be integrated without proper due diligence.

Being flexible, agile, and responsive should not translate to being reactionary, or careless. Before Big Data, or any other data, is incorporated into a company’s knowledge stores, planning must occur. This planning involves knowing the data, understanding its value, and ensuring its proper management, and security. This discussion will hopefully highlight the importance of managing data to ensure its integrity.

Data Processing Exercise:

The use case for this exercise is that a company will routinely download data from a 3rd party vendor. For the purposes of this discussion, the 3rd party data will come from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This data was selected beacuse it is relatively simple, yet complicated enough to demonstrate more advanced topics, and relatively small in size. It consists of three tables: 1) Details, 2) Fatalities, and 3) Locations. These three tables are broken up into 65 files each (one for each year starting in 1950 and ending in 2014). Of the three table types, there are a total of 73 columns and approximately 2.3M records across all three tables. There is also a fourth table called UGC that provides State and County codes. However, there are numerous problems with how this reference data has been used. It will be loaded and discussed briefly.

Obviously, the scope of this effort would be quite different at the enterprise level. Multiply what is done below by an appropriate factor and you will get an idea of the scope of work involved when you are dealing with several hundred distinct tables (not three) with several thousand columns. Getting to know and understand data is not a trivial task.

For this exercise the NOAA storm database was selected. Currently, this data set covers storm activity across the United States starting in January 1950 through December 2014. According to NOAA, “NCDC has performed data reformatting and standardization of event types but has not changed any data values for locations, fatalities, injuries, damage, narratives and any other event specific information. Please refer to the Database Details page for more information: http://www.ncdc.noaa.gov/stormevents/details.jsp

Given this disclosure, the expectation was that there would be one or two types of errors discovered, bringing me to the rapid conclusion: "even with data that has undergone a thorough review, checks should always be performed." Unfortunately, in very little time significant problems were discovered, thus the length of this post. For the sake of brevity, some solutions are just mentioned, and the code does not cover a production level solution.

This is not intended to be critical of NOAA. It is simply pointing out that all data can be corrupt, and enforcement of data integrity is a constant battle if not managed properly. NOAA is now on version 3.0 as May 12, 2014. http://www.ncdc.noaa.gov/stormevents/versions.jsp and it appears there is still room for improvement.

Use the following link to download the data used in this post:

NOAA Storm Events Repository - http: http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/

The data was downloaded, and loaded using the R Programming language. The data store for this data is an Oracle 12c XE RDBMS. However, this could just as easily be a Hadoop Data Lake, Teradata, DB2, MySQL, or PostgreSQL database (part of the reason the RJDBC package was chosen instead of ROracle was to accommodate various databases - hopefully, this will at least minimize the required modifications).

Know and Tidy the Data:

Note: With data, there are many terms used to define data elements. For instance, in logical modeling the terms ENTITY and ATTRIBUTE are used, whereas in physical modeling TABLE and COLUMNS are used. To avoid confusion, we will use the physical modeling terminology for the remainder of the discussion.

Know – the Data:

The NOAA data will serve as a good example of how detailed the data study must be. The data could just as easily be bank, lending, financial markets and investment, commodities, vendor inventory lists, marketing data, . . . When data is being incorporated into a company infrastructure, whether it is a data lake, or just a couple of millon records going into an RDBMS, the concepts proposed could, and should, be implemented in most scenarios.

The first task is to develop a data dictionary (or acquire one from the 3rd Party Source if it exists). NOAA provides a good start to the data dictionary, but it is not complete.

A data dictionary typically consists of a document in spreadsheet, or table format with the following minimum information:

  1. All table and coulumn names.
  2. Definitions for all tables and columns.
  3. Each column is further defined by:
    • Data type (character, number, date, timestamp, CLOB, . . . )
    • The data size (now many bytes of information can the column hold?)
    • Domain value contraints (Limited list of acceptable values: 0/1, T/F, blue/brown/black, . . .)
    • Whether the column can contain a null or not
    • Whether the column is a primary, foriegn, or alternate key

Here is a sample of the dictionary created for this project. This particular format does not show the column size, but that information is displayed in the next figure, the Physical Data Model:

Depending on the project, it is usually very helpful to develop a data model that depicts the table relationships. I prefer to use Computer Associates’ Erwin Data Modeler. It is an industry standard for enterprise modeling, although there are many tools that would be suitable for smaller projects like this one, and XMLSPY is an excellent tool for depicting XML schemas. The model depicted below reflects how the data is modeled after having run through the exercise below. Previously, all character columns were varchar2(255), and there were no primary, or foreign keys assigned.

To develop the data model and dictionary, I could have used the documents provided by NOAA exclusively, and entered each table and column name. However, I tend to be lazy and find it much easier to load the data, tidy it as much as possible, which includes assigning primary, foreign, and alternate keys as necessary, modifying the data types through the use of R (detailed in the code below), and then reverse engineering the database using Erwin. This provides most of what is needed for the data dictionary. If a logical model was required, the names would need to be changed to a more human readable format. Then all I have to do is copy over the definitions and I’m done. So far, so good, and this did not require a lot of work, and exposed a lot of problems with the data as we will discuss now. Also, please keep in mind, this is four tables total, not a hundred.

Note: The model depicted is not a fully normalized model. The lack of normalization causes a few issues that are discussed below, but this model will suffice for our purposes.

Data Analysis:

Before heading into any data project, it is best to have some idea of what you are trying to accomplish. What are the requirements? What questions are you trying to answer? Defining the use case is an important step and no data project should be undertaken without clear objectives in mind. Believe it or not, the lack of well defined requirements is a common problem in data management. Agile has come to mean no documentation, and I don’t believe that was the intent. So, for no other reason than to remind everyone of its importance (including myself), we will at least use the following as our high level requirements for this project.

The objective of this analysis is to:

  1. Download, and automate the loading and storage of all weather data as frequently as made available by NOAA.
  2. Compare weather related events that resulted in peronal injuries, or damanges.
  3. Check for changes in patterns by season, and location by in-depth data analysis.

Due to space limitations on the post, if you would also like to see the associated code for this exercise,  please go to the following link: http://rpubs.com/rshane/81297

However, the following provides a summary of the issues discovered:

  • First Problem: Transitive Dependency. BEGIN_DAY in table NOAA_DETAILS has a maximum length of 186 characters. In other words, the BEGIN_DAY is a non-key attribute and is dependent on other non-key attributes (month and year). For this to be allowed, there would have to be constraints in place to enforce several date rules (e.g. the number cannot be greater than 31 on months Jan, Mar, May, Jul, . . . , or greater than 30 on Apr, Jun, . . . , or greater than 28 in Feb except for leap years where it cannot be greater thant 29, . . . ) – basically what we call a DATE field.

A design like this is just asking for trouble and is really unnecessary when you have a date data type that enforces all of those rules for you. Why did Hadley Wickham develop lubridate if we were going to keep our date fields so neatly parsed? :) For this reason, the attributes pertaining to Dates (YEARMONTH, YEAR, MONTH, HOUR, . . . ) were removed from the data. This type of data is readily available from the DATE and TIME fields provided.

  • Second Problem: Lack of integrity constraints.
    • There are two duplicates in the Details table which indicates that the key is not currently enforced. In the Oracle database we will set the primary key to be the EVENT_ID and remove the two duplicates. As we will see below, there are many more problems with duplications, and they become much more significant. The point here is that any duplications indicate a lack of contraints to enforce integrity and the error is repeated throughout the data.
    • b There are 2013 duplicates in the Fatalities table. These duplicates indicate that there is no primary key constraint being enforced. Per the definition in the “Storm Data Export, Field Names” file: “fatality_id Ex: 17582, 17590, 17597, 18222 (ID assigned by NWS to denote the individual fatality that occurred within a storm event)” Individual implies uniqueness, and a primary key is a unique key that identifies a specific row, or observation.

Note: Two duplicates do not in themselves seem to be troubling, out of 1301016 rows. However, keep in mind that all duplicates essentially cause a Cartesian join. Without contraints and checks, errors are magnified. In the Fatalities table there are 2013 duplicates, and as we will see there are many more problems introduced by the lack of integrity constraints.

  • Third Problem: Lack of Referential Integrity Constraints. Constraints (e.g., data types, uniqueness, primary keys) and referential integrity contraints enforce rules. Without these rules being enforced, the data loses it meaning and value. The implied rules from reviewing the definitions and studying the data so far state:
    • There is one EPISODE, and this EPISODE can consist of multiple EVENTS
    • Not every EVENT is associated with an EPISODE
    • Each EVENT can appear multiple times within an EPISODE with a unique INDEX providing an azimuth, range, latitude, and longitude reading (the NOAA definition provides little information, but I assume the azimuth is to allow for more precise location reading through intersection and resection)
    • Each EPISODE should have a minimum of one EVENT
    • Each EVENT is associated with a LOCATION (some are missing which I assume is the result of missing historical data)
    • Each FATALITY is associated with one EVENT. There should be one record in the Details table for every record in the Fatalities table.

I am sure this list is not exhaustive, but you get the idea. The fatalities table, if you look at the data, only provides a few details about the fatality, like date, age, and sex. The rest of the information about the type of weather event is contained in the Details table. There are 28 records in the Fatalities table with no matching record in the Details table. In other words, a fatality without a storm event.

The following Venn diagram gives you a visual of the differences in the data before and after removing duplicates and orphans, and enforcing referential integrity. Notice that there is some intersection between Fatalities and Locations. However, this just means that the EVENT that occurred at a specific location, caused fatalities. The data does not provide the location of the fatality. Both direct and indirect deaths could have occurred miles away in a hospital a month later.

  • Fourth Problem: Denormalized Data. This is a vague problem statement that Implies a multitude of problems. There are multiple reasons for enforcing what is called the 3rd Normal Form (3NF) in data and the subject is far too broad to cover here. However, a couple of examples of how the lack of normalization causes problems in the NOAA data.
    • The fatalities table provides a column called FATALITY_TYPE. The acceptable values are “D” for DEATHS_DIRECT, and “I”, for DEATHS_INDIRECT. However, there is nothing enforcing these values. The value in the details table for DEATHS_DIRECT, and DEATHS_INDIRECT are cumulative – total number of deaths from the storm event. This implies that there will be the same number of individual records in fatalities table, as the value provided in the DEATHS_DIRECT column in the details table. There are 8942 events in the Details table where a fatality occurred either directly or indirectly. The total number of fatalities reported in the Details table for deaths caused directly from a weather event comes to 14939, and the total number from indirect causes is 1579, for a total of 16518. Yet, there are only 11884 observations listed in the Fatalities table after the duplicates are removed. Of these, 10584 are deaths caused directly from a storm, and 1300 are deaths caused indirectly form the storm.

So, how many of these match? Is every fatality record with a "D" for FATALITY_TYPE represented in the details table's DEATHS_DIRECT column? You can do the checks, but hopefully you get the point. This is denormalized, and unenforceable, which is not good. The difference in counts can be explained by missing historical data. The sex, age and date of every storm victim is probably unknown. However, inconsistencies are not a result of missing data, just mismanaged data. 

    • There are 464,047 records in the NOAA_DETAILS table without a matching record in NOAA_LOCATIONS. This could be acceptable since the data probably was lost, or not available for some reason. However, there are 134 in the NOAA_LOCATIONS table with no matching EVENT_ID in the NOAA_DETAILS table (a location without an Event). This is not acceptable. Again, it is in perspective of storm data, but it could be Accounts and Balance, or Transcation data.
    • There are numerous other examples, some more severe than others. The UGC data that I alluded to earlier is not referenced by its primary key, and I am not even addressing the fact that columns that have nothing to do with the EVENT are in each table. A tornado does not have a FLOOD_CAUSE, and a flood does not have a Fujita Scale. Tidy rule number 3, a separate table for entity type. There are other problems caused by this issue which I will not address, but the best rule of thumb is to enforce as many integrity constraints as you can. Performance arguments are often a red herring disguising laziness. 
  • Fifth Problem: Sloppy Data Management Practices. Data collection, management, and analysis is difficult, and mistakes are easy to make. This is why processes, testing, Quality Assurance, Data Quality programs, and the like are established. An example of poor data management processes, is when errors are introduced during the processing. The 2013 duplicates in the NOAA_FATALITIES table would have never occured had there been referential integrity in place, but barring that, when you know there is a place where errors can be introduced, develop checks to validate the data. In the Fatalities csv files, there are only a handful of files that contain duplicate records and it appears that during the processing on August 24, September 15, September 30, 2014, and February 24, and March 26, 2015 there were some errors introduced. For a further discussion on this error, and to see the code that exposed this error please us the following link:  http://rpubs.com/rshane/81297

There would still be some work to do on this data, such as property and crop damages are not in a good format for analysis, the latitude and longitude in the locations table is inconsistent and needs standardization, and one thing that NOAA did not do was to categorize the storms more narrowly. However, I will stop here. If you have made it this far, you are probably a data geek like me, and you get my point(s). Hopefully, you have also developed a better understanding of what’s involved in maintaining data integrity. If you have something to add where I have overlooked, or misstated a point, or made an error, I would love to hear from you. Data geeks love data, and for the rare "non-data geek" that made it this far, its probably a lot like watching paint dry. I appreciate you hanging in there. Next time!

Views: 3316

Tags: Analytics, Big, Data, Hadoop, IPv6, IRR, NPV, Network, Programming, R, More…ROI


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Steven Neumersky on May 25, 2015 at 2:27pm
Years ago, Gartner published a statistic: in order to obtain good data quality one must spend, on average, about 5 hours per column. While I am not sure the context is the same here, you definitely make a good case for it!

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service