Subscribe to DSC Newsletter

Last weekend, I was waiting in New York’s Penn Station, when the public announcer gave the familiar “See Something Say Something” message. It took a minute to sink in, but I had to laugh. Midtown Manhattan IS suspicious and unusual activity.

Speaking of outliers

In practice, data is dirty and big data is filthy.  Analysts munge, wrangle and clean their sources, and a good analysis will recognize the rejected observations. In August, the NY Times joined the recent crowd calling this "janitorial" work and claimed that data scientists spend "50 percent to 80 percent of their time mired in this more mundane labor".  It is not glamorous, and it is getting more difficult.  But, it is necessary, even priceless.

Suppressing data can be argued well with

  • Materiality - Observations can be dropped if their absence would be insignificant to aggregates and would not change the directional conclusion of the analysis.
  • Statistics - Formal methods can be applied for rejecting data. Look at Peirce's criterion, Grubb's test, Chauvenet's criterion, Dixon's Q test or, frankly, propose a new one that sounds as serious.
  • Reasonableness - Some elements just don't make sense.  If one attribute is wrong, the observation may be considered suspicious and discarded.
  • Completeness - Most databases and statistical tools expect NA's, nulls or NANs (not a number).  Data can be optional, and processes can be incomplete.  So, dropping empty data is tempting.
  • Error - The observation violates some stated business rule.  Software captures data and software can have bugs. So, we expect and ignore data as defective.

Missed opportunities

All those dropped observations have value, though.

First, when we find a problem, we should tell someone.  We don't have to, but we should. Like that "See Something, Say Something" announcement, communicating exceptions is an analyst's responsibility.  Software gets fixed, other analysts save time, lessons get learned, customers get a better experience. 

Second, this data may deserve some digging.  If there's a process, people will find a workaround.  Machine generated data shows that computers do the same thing with controls. Data exceptions have stories that lead to new business rules and pattern discoveries.  As with data errors, we don't have to pursue these stories, but we should.  Researching outliers has a poor "a priori" business case.  You don't know what you'll find. Tracking the value of what you have already learned is almost as good.  That's an anecdotal business case.

The next time a package promises to automatically clean data, report that suspicious and unusual activity to anyone who will listen.

Views: 260

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service