Subscribe to DSC Newsletter

When a data glitch turns great data into worthless gibberish

Data screwed up when imported or exported. Poor architecture. How to detect and fix this?

What I found with Access is its inability to properly import/export data types. Automated and even forced type casting, from a database table to text or CSV files (or the other way around) is poor, resulting in many errors. It is because this environment is too rigid, not flexible enough to do smart type detection and/or casting. Ever had to deal with a date that was recorded as an integer, while the date in the next row was recorded as mmddyy? Or or comma (inside a data field value), that screws up your CSV file?

I've received many data files that had a data corruption rate above 5%. Sometimes two different types of values (advertiser keyword vs, user query, or referral domain vs. ad network domain) are stored in the same field (solution: add an additional field that tells you the type of keyword or referrer). Sometimes the field are not properly aligned. Sometimes it looks like the people who produced the data were not experts in regular expressions, or used separators very badly. Sometimes it's the software itself that screws things up - that happened to me many time with Access.

Sometimes these screw-ups have undesirable side effects, such as one user in your database becoming a kind of "garbage collector", rendering data insights worthless. Sometimes it's not a big deal.

Case Studies

  • Wells Fargo: TeaLeaf users sessions were broken down into multiples sessions, as each server had its own web log and the blending of all these web logs was not done properly. Mostly, an architecture design issue in this case. Fortunately, I found the problem, and it got fixed, and TeaLeaf fixed its bug. And we were able to derive great insights once the bug was fixed.
  • eBay: A data transfer removed special characters from French and German keywords, making reconciliation impossible. We had to built a temporary lookup table of foreign keywords, with correct and erroneous spellings, to fix the problem.
  • Click fraud detection: The keyword field sometimes represented a user query (like a Google query), sometimes a prespecified keyword category, depending on the type of ad network affiliate. Those where keyword = keyword category were erroneously heavily penalized, because the keyword distribution was (by design, not by fraud) very poor. Adding a new field that specifies the type of keyword helped fix the problem. 

Questions

  • How do you address these issues?
  • How do you automatically detect these screw-ups? (quality assurance issue - read also my article below)
  • Is it worse with big data? (detecting or fixing it more difficult?) 
  • How much bad data can we tolerate? One percent (maybe less than that in problems such as fraud detection)?
  • What proportion of your time is wasted on fixing these issues?
  • How to design smart type casting?

Related articles

Views: 2739

Reply to This

Replies to This Discussion

Criticizing Access for being too "rigid" when importing data seems odd to me.  Any time one wants to load data into a relational database it always involves three basic steps.  We've come to associate these steps with a certain class of tools created for that purpose:  Extraction, Transform, and Load.  Expecting to perform complex transformations using the facility of Access intended to merely load the data will always lead to disappointment.  On the other hand, spending the "big dollars" for an ETL tool, just to load data into a relatively inexpensive desktop database, would seem excessive to most of us.  Therefore, usually, when I'm preparing data to load into an Access database I first transform it using the "Practical Extraction and Reporting Language".  Yes, I use "PERL".  It is free, incredibly powerful, and once mastered falls into the category of a "quick and dirty" language.  By this last statement I mean that the development of a transform program of even moderate complexity can be done in minutes.  The "dirty" part comes from the unfortunate property of PERL that renders its source impenetrable to the author within a few hours of having written it.  But, that's seldom a problem, because you can always re-write it. (I jest.)

Or, if PERL isn't available for some obscure reason, and one wants to limit the processing entirely to Access, I first bring the data into a "raw" table where every field is simply text.  Then, I can bring the power of various SQL queries or Visual Basic for Applications (VBA) programs to bear on the raw table to transform and load the data into the final normalized relational tables against which the "reports" run.

 

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service