Full disclaimer: I'm not a data scientist, but am working with a large data set requiring lots of analytics, and was hoping for your help with outlier detection.
Each month we receive about 12 million transactions that we have to process in about two weeks. Each one of those transactions has a number of fields: company, commodity code, country of departure, country of arrival, value, weight, mode of transportation, etc. Unfortunately, those fields can be filled in incorrectly, which can have a significant impact on our data quality. These problems can be particularly obvious when we look at disaggregated data, such as shipments of a commodity from a particular country in a given month. So, our main problem is one of outlier detection and correction. For analytical purposes, we may be able to retain multiple months, perhaps two to three years' worth, with which to build up estimators and such.
For example, one of the variables of particular interest is the physical weight of the shipment, for which we've developed a roundabout way of detecting outliers for correction. We divide the value of the transaction by the weight of the shipment, and if the resulting unit value (i.e., price) is "very different" from unit values for the same commodity shipped from the same country by the same company, we remove the outlier and replace it with a more typical value for that commodity/country/company combination. There are a few obvious issues: what is "very different" -- what is the method for detecting outliers -- and how do we go about developing a large enough distribution against which to compare the unit value (typically we will aggregate across countries and / or companies to build up a large enough distribution against which to compare)? For that matter, how do we choose a more “typical” value?
More generally, I am looking for ways to perform automated outlier detection for a number of variables for this large dataset, to be performed in a few days. Some of the variables are numerical, but some of them can be text-based (e.g. a company name incorrectly entered). Of course not all companies (there might be about 70,000 of them) will be active every month, and if they are, not necessarily for the same combination of commodity (there are about 11,000) /geography (200) / etc.
Ideally, the analyst would be presented with a log of all outliers that were detected, their original values, and their corrected values. The analyst might then go in and reverse some of those corrections, based on further analysis. Ideally, some supervised learning would go on here, so that the application would learn from those corrections that were reversed, and no longer make those corrections, even if the outlier continued to be detected and reported in the log.
We have applications built on top of Oracle but, going forward, are looking to other solutions as well, in terms of software but also possible methodologies. We do have programmers and statisticians on staff and would be looking to take control of the applications (i.e., nothing canned). Since we’re a government agency, we and have to be mindful of existing security and procurement policies. Incidentally, one of our tools for doing some of our current analysis is SAS. We do know that SAS has some data analytics capabilities, but still have some questions about using it in a production environment for large data sets.
Any and all help would be greatly appreciated!
Have you used Trifacta Wrangler ? You might need the enterprise version of the tool and it is pretty quick.
Thanks: I've started looking at some of the clips. I like the some of the features for describing the data, including data visualization. I've also come upon another link that talks about anomaly detection, which I think is a lot of what I need to be doing (on top of supervised machine learning):
Anyway, the hunt continues!