The subject of this blog might seem rather rudimentary for those who fully understand the importance of properly managing data. For those people, hopefully, you will find the post worth reading and provide constructive feedback and augment the discussion. The purpose of this post is to highlight the necessity to keep data clean and orderly so that the results of the analysis are reliable and trustworthy - if data integrity is intact, information derived from this data will be trustworthy resulting in actionable information.
The title for this post is meant to be a bit tongue-and-cheek. The word "data," over the years has seldom been used singularly to mean anything other than the standard definition:
However, the word data takes on different meanings when coupled with other words. In some cases, the coupling of two words conjures up a fairly accurate meaning, and in other cases, I believe it can be very confusing. For instance, Data Analysis makes sense; values are looked at and analyzed to extrapolate some sort of meaningful information. Data Modeling is a bit ambiguous if the concepts of Normal Forms, Relational Integrity, Dimension and Fact Tables, Stars, Snowflakes, and Cubes are not readily understood. Still others, like Big Data, add just as much, if not more confusion. What is Big Data? Is it a really big number, or very voluminous data, with a very high velocity data set, and a lot of variety (I had to stick with the V's, no matter how contrived)
Still, Big Data gives you an idea that you are dealing with a lot of data, and with a little imagination an IT professional can surmise with just a little reading, that Big Data poses great challenges on how to obtain near real-time values with existing operating systems and analytical tools. Big Data might possibly require a different way of looking at data all together. In fact, Big Data does require us to look at data differently. There is no question about it, Big Data that satisfies the definition of at least the three V's listed above, requires a clustered operating system, with distributed storage, with distributed processing across commodity hardware (bring in Hadoop). With a little imagination you can see how taming these large data sets can provide a wealth of meaningful information, and be truly a disruptive technology.
That said, the quality of the results derived from Big Data, Data Analytics, and Data Mining technologies are only as good as the data. Properly managing, and gathering the data in the first place will save money and time, reduce errors, and simplify the process of gleaning knowledge from a data set. Properly managing data involves a variety of approaches that maintain Data Integrity throughout the lifecycle.
Before I go further, I want to explain that this is not an attempt to revive the teachings and virtues of EF Codd, CJ Date, and WH Inman (the list could be much longer). Each of these made significant contributions to the world of data management and engineering, and their contributions are still applicable today. To bolster that claim, I would refer you to more contemporary subject matter experts like Hadley Wickham and his paper on "Tidy Data", and numerous other postings from professors, data scientists, and companies that are producing tools to work with messy data. Performing any proper data analysis requires a thorough understanding of the data being analyzed. Furthermore, there are many data sets that are simply not suitable for analysis until extensive processing of the data has occurred. Why not do what is necessary to ensure data integrity and avoid the hassle and costs associated with cleaning the data afterwards?
Data is everywhere today. You can go to numerous online sites where data is made available, and you will find a thoroughly defined Application Program Interface (API), and a data dictionary that defines each data element in the set. Take a look at some of the following sites and you will find either an API for extracting data, or a csv file, along with a definition of the attributes. Some of the data is in better shape than others, but at least some basics are provided to get you started:
(For examples on how to access some of these data sets using R, see Hadley Wickham's Git Repo)
Once the data is acquired, the quality of that data must be evaluated:
Data Integrity should, in my opinion, be enforced at the source to the greatest extent possible, to avoid unnecessary work at the end. I hate to bring it up in this day of Hadoop, but Relational Database Management Systems (RDBMS) still have their place in the world of data management. As stated earlier, there is no question about it, Big Data that satisfies the definition of at least the three V's listed above, requires a new approaches, e.g., Hadoop. However, if your data is not that big, consider more traditional means of managing your data to still obtain cutting edge advantages, like predictive analysis. Analysis of good data is worth a ton of analysis on bad data. Even in a Big Data environment, there are principles that, when enforced, will ensure the quality of the data, and enhance its value.
All of this discussion should be taken in the context of the business, or scientific use cases. Not all data needs to be perfect at a precise moment in time. However, there are proven technologies that enforce the integrity of data, and unless there is a good reason for not implementing them, it is my belief that it is better to err on the side of integrity.
Let me provide some examples of what I am referring to regarding maintenance of data integrity. Data in and of itself is simply data. Values that stand alone have very little meaning without a given context for meaning. For instance, what does the data in the following table tell you?
It would appear that the first seven fields refer to a person's first, middle, last name, street address, City, State, and zip code. The 8th field appears to contain the Social Security Number, while the 9th field appears to be some sort of code, all with a hyphen separating two digits from three digits. The last three columns contain some information that provide some information that we assume relates to the individual. Because the last field is a date, and appears to be all dates from some time back, we might assume this to be the date of birth.
A closer look at just those first seven fields reveals a first name as "Widgets R' Us'. Apparently the table does not just refer to individuals, but to organizations as well. Also, you will notice that some data is missing, and, in what appears to be a column for State abbreviations, there is a code BD while all the remaining values in that column contain valid abbreviations for State codes.
There are numerous problems with this data, and if there were thousands, or millions of rows of this data, there would be a considerable amount of work involved. Just to simplify, here is what each column, or attribute is supposed to be:
There are too many problems to address all of them here, but for starters:
Long way to go just to state that Data Integrity is still, and will always be a critical part of data management. No matter what technologies are in play, if the data is bad, then the information coming out cannot be trusted. Hopefully, we start hearing the words "Data" and "Integrity" used together in the context of Big Data discussions.
If you made it this far, you either agree with me and understand the point(s), or you are thinking this guy is stuck in the 90's. If you agree, good! If you are of the latter group, please wait for the next post where I will discuss how some of the above concepts apply in the context of Big Data solutions.