Last weekend, I was waiting in New York’s Penn Station, when the public announcer gave the familiar “See Something Say Something” message. It took a minute to sink in, but I had to laugh. Midtown Manhattan IS suspicious and unusual activity.
Speaking of outliers
In practice, data is dirty and big data is filthy. Analysts munge, wrangle and clean their sources, and a good analysis will recognize the rejected observations. In August, the NY Times joined the recent crowd calling this "janitorial" work and claimed that data scientists spend "50 percent to 80 percent of their time mired in this more mundane labor". It is not glamorous, and it is getting more difficult. But, it is necessary, even priceless.
Suppressing data can be argued well with
All those dropped observations have value, though.
First, when we find a problem, we should tell someone. We don't have to, but we should. Like that "See Something, Say Something" announcement, communicating exceptions is an analyst's responsibility. Software gets fixed, other analysts save time, lessons get learned, customers get a better experience.
Second, this data may deserve some digging. If there's a process, people will find a workaround. Machine generated data shows that computers do the same thing with controls. Data exceptions have stories that lead to new business rules and pattern discoveries. As with data errors, we don't have to pursue these stories, but we should. Researching outliers has a poor "a priori" business case. You don't know what you'll find. Tracking the value of what you have already learned is almost as good. That's an anecdotal business case.
The next time a package promises to automatically clean data, report that suspicious and unusual activity to anyone who will listen.