Subscribe to DSC Newsletter

Data Integrity - A Sequence of Words Lost in the World of Big Data

The subject of this blog might seem rather rudimentary for those who fully understand the importance of properly managing data. For those people, hopefully, you will find the post worth reading and provide constructive feedback and augment the discussion. The purpose of this post is to highlight the necessity to keep data clean and orderly so that the results of the analysis are reliable and trustworthy - if data integrity is intact, information derived from this data will be trustworthy resulting in actionable information.

The title for this post is meant to be a bit tongue-and-cheek.  The word "data," over the years has seldom been used singularly to mean anything other than the standard definition:

  • "data |ˈdatəˈdātənoun treated as sing.or pl. ] facts and statistics collected together for reference or analysis. See also datum.• Computing the quantities, characters, or symbols on which operations are performed by a computer, being stored and transmitted in the form of electrical signals . . . ."

However, the word data takes on different meanings when coupled with other words. In some cases, the coupling of two words conjures up a fairly accurate meaning, and in other cases, I believe it can be very confusing. For instance, Data Analysis makes sense; values are looked at and analyzed to extrapolate some sort of meaningful information. Data Modeling is a bit ambiguous if the concepts of Normal Forms, Relational Integrity, Dimension and Fact Tables, Stars, Snowflakes, and Cubes  are not readily understood. Still others, like Big Data, add just as much, if not more confusion. What is Big Data?  Is it a really big number, or very voluminous data, with a very high velocity data set, and a lot of  variety (I had to stick with the V's, no matter how contrived)

image00

Still, Big Data gives you an idea that you are dealing with a lot of data, and with a little imagination an IT professional can surmise with just a little reading, that Big Data poses great challenges on how to obtain near real-time values with existing operating systems and analytical tools. Big Data might possibly require a different way of looking at data all together. In fact, Big Data does require us to look at data differently. There is no question about it, Big Data that satisfies the definition of at least the three V's listed above, requires a clustered operating system, with distributed storage, with distributed processing across commodity hardware (bring in Hadoop). With a little imagination you can see how taming these large data sets can provide a wealth of meaningful information, and be truly a disruptive technology.

That said, the quality of the results derived from Big Data, Data Analytics, and Data Mining technologies are only as good as the data. Properly managing, and gathering the data in the first place will save money and time, reduce errors, and simplify the process of gleaning knowledge from a data set.  Properly managing data involves a variety of approaches that maintain Data Integrity throughout the lifecycle.

Before I go further, I want to explain that this is not an attempt to revive the teachings and virtues of EF Codd, CJ Date, and WH Inman (the list could be much longer). Each of these made significant contributions to the world of data management and engineering, and their contributions are still applicable today. To bolster that claim, I would refer you to more contemporary subject matter experts like Hadley Wickham and his paper on "Tidy Data", and numerous other postings from professors, data scientists, and companies that are producing tools to work with messy data. Performing any proper data analysis requires a thorough understanding of the data being analyzed. Furthermore, there are many data sets that are simply not suitable for analysis until extensive processing of the data has occurred. Why not do what is necessary to ensure data integrity and avoid the hassle and costs associated with cleaning the data afterwards?

Data is everywhere today. You can go to numerous online sites where data is made available, and you will find a thoroughly defined Application Program Interface (API), and a data dictionary that defines each data element in the set. Take a look at some of the following sites and you will find either an API for extracting data, or a csv file, along with a definition of the attributes. Some of the data is in better shape than others, but at least some basics are provided to get you started:

(For examples on how to access some of these data sets using R, see Hadley Wickham's Git Repo)

Once the data is acquired, the quality of that data must be evaluated:

  1. Are there missing values?
  2. Is there more than one variable per column?
  3. Does a variable depend on another value?
  4. Does each attribute of a row belong to that one observation?
  5. Does each group of data provide attributes only for that group?
  6. Are the business rules enforced in the data?
  7. Are values outside of acceptable domain values?
  8. Is there referential integrity?
  9. . . .

Data Integrity should, in my opinion, be enforced at the source to the greatest extent possible, to avoid unnecessary work at the end. I hate to bring it up in this day of Hadoop, but Relational Database Management Systems (RDBMS) still have their place in the world of data management. As stated earlier, there is no question about it, Big Data that satisfies the definition of at least the three V's listed above, requires a new approaches, e.g., Hadoop. However, if your data is not that big, consider more traditional means of managing your data to still obtain cutting edge advantages, like predictive analysis. Analysis of good data is worth a ton of analysis on bad data. Even in a Big Data environment, there are principles that, when enforced, will ensure the quality of the data, and enhance its  value.

All of this discussion should be taken in the context of the business, or scientific use cases. Not all data needs to be perfect at a precise moment in time. However, there are proven technologies that enforce the integrity of data, and unless there is a good reason for not implementing them, it is my belief that it is better to err on the side of integrity.

Let me provide some examples of what I am referring to regarding maintenance of data integrity. Data in and of itself is simply data.  Values that stand alone have very little meaning without a given context for meaning. For instance, what does the data in the following table tell you?

Table

It would appear that the first seven fields refer to a person's first, middle, last name, street address, City, State, and zip code.  The 8th field appears to contain the Social Security Number, while the 9th field appears to be some sort of code, all with a hyphen separating two digits from three digits. The last three columns contain some information that provide some information that we assume relates to the individual. Because the last field is a date, and appears to be all dates from some time back, we might assume this to be the date of birth. 

A closer look at just those first seven fields reveals a first name as "Widgets R' Us'. Apparently the table does not just refer to individuals, but to organizations as well. Also, you will notice that some data is missing, and, in what appears to be a column for State abbreviations, there is a code BD while all the remaining values in that column contain valid abbreviations for State codes. 

There are numerous problems with this data, and if there were thousands, or millions of rows of this data, there would be a considerable amount of work involved.  Just to simplify, here is what each column, or attribute is supposed to be:

  1. First Name
  2. Middle Name
  3. Last Name
  4. Street Address
  5. City
  6. State
  7. Zip Code
  8. Social Security Number
  9. Height in Inches and Weight
  10. Age
  11. Marital Status
  12. Date of Birth

There are too many problems to address all of them here, but for starters:

  1. There are no Variable/Attribute names for the Values until I provided them above
  2. It is uncertain what the Organization would be doing in this table. Organizations do not have SSN, Height/Weight, Marital Status, or Date of Birth. An organization would have an incorporation date, and Employer Identification Number (EIN), and possibly Doing Business As (DBA),  Gross Revenues, and might play the role of being an employer. Clearly, the same attributes, or variables do not apply to an organization as apply to an individual. Organizations should be in their own table.
  3. While the State Codes could just have the one bad code (BD), it could be that this is a valid code in another country. If this is the case, then an alternate key requiring the State and Country code to be unique should be enforced. If BD is just a typo for ND, then referential integrity would have prevented this value from being added. Alternatively, a constraint of 50+ valid values would work as well.
  4. A domain constraint on the Zip Code field would have prevented the 6 digit zip code from being entered.
  5. The SSN and EIN fields are mixed, but there are cases where this can happen, but in this context it is impossible to know if it is a typo, or not.
  6. The ninth field is one of the worst violations. The first two digits are supposed to be the person's height in inches, and the last three digits are the person's weight. There are two values stored in one variable. To use this data would require parsing the information and would be very problematic.
  7. The tenth column violates several rules as well. This field provides the individual's age. As of this writing, the date is March 8, 2015.  In that case, the second individual has an Age of 45, however, that person will not be 45 until March 31, 2015.   Therefore, the Age field is not only dependent on the DOB, but on the current date which is not provided anywhere. If the actual date today was 11-Jun-2015, then the last individual, in row 8 would actually be 58 years old, not 57.
  8. The last observation I would like to address regards the marital status field. There are numerous ways to enforce integrity on this field and is a very simple task, e.g, referential integrity, or constraint.

Long way to go just to state that Data Integrity is still, and will always be a critical part of data management. No matter what technologies are in play, if the data is bad, then the information coming out cannot be trusted.  Hopefully, we start hearing the words "Data" and "Integrity" used together in the context of Big Data discussions. 

If you made it this far, you either agree with me and understand the point(s), or you are thinking this guy is stuck in the 90's.  If you agree, good! If you are of the latter group, please wait for the next post where I will discuss how some of the above concepts apply in the context of Big Data solutions.  

 

 

Views: 2219

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service