Full disclaimer: I'm not a data scientist, but am working with a large data set requiring lots of analytics, and was hoping for your help with outlier detection.
Each month we receive about 12 million transactions that we have to process in about two weeks. Each one of those transactions has a number of fields: company, commodity code, country of departure, country of arrival, value, weight, mode of transportation, etc. Unfortunately, those fields can be filled in incorrectly, which can have a significant impact on our data quality. These problems can be particularly obvious when we look at disaggregated data, such as shipments of a commodity from a particular country in a given month. So, our main problem is one of outlier detection and correction. For analytical purposes, we may be able to retain multiple months, perhaps two to three years' worth, with which to build up estimators and such.
For example, one of the variables of particular interest is the physical weight of the shipment, for which we've developed a roundabout way of detecting outliers for correction. We divide the value of the transaction by the weight of the shipment, and if the resulting unit value (i.e., price) is "very different" from unit values for the same commodity shipped from the same country by the same company, we remove the outlier and replace it with a more typical value for that commodity/country/company combination. There are a few obvious issues: what is "very different" -- what is the method for detecting outliers -- and how do we go about developing a large enough distribution against which to compare the unit value (typically we will aggregate across countries and / or companies to build up a large enough distribution against which to compare)? For that matter, how do we choose a more “typical” value?
More generally, I am looking for ways to perform automated outlier detection for a number of variables for this large dataset, to be performed in a few days. Some of the variables are numerical, but some of them can be text-based (e.g. a company name incorrectly entered). Of course not all companies (there might be about 70,000 of them) will be active every month, and if they are, not necessarily for the same combination of commodity (there are about 11,000) /geography (200) / etc.
Ideally, the analyst would be presented with a log of all outliers that were detected, their original values, and their corrected values. The analyst might then go in and reverse some of those corrections, based on further analysis. Ideally, some supervised learning would go on here, so that the application would learn from those corrections that were reversed, and no longer make those corrections, even if the outlier continued to be detected and reported in the log.
We have applications built on top of Oracle but, going forward, are looking to other solutions as well, in terms of software but also possible methodologies. We do have programmers and statisticians on staff and would be looking to take control of the applications (i.e., nothing canned). Since we’re a government agency, we and have to be mindful of existing security and procurement policies. Incidentally, one of our tools for doing some of our current analysis is SAS. We do know that SAS has some data analytics capabilities, but still have some questions about using it in a production environment for large data sets.
Any and all help would be greatly appreciated!
Have you used Trifacta Wrangler ? You might need the enterprise version of the tool and it is pretty quick.
Thanks: I've started looking at some of the clips. I like the some of the features for describing the data, including data visualization. I've also come upon another link that talks about anomaly detection, which I think is a lot of what I need to be doing (on top of supervised machine learning):
Anyway, the hunt continues!
There is a packaged solution called Anodot that may fit your needs as described. We at DSC did a webinar with them a few months back and you can find that under the webinar tab at the top of the page.
Let me know if you need contact information.
Here is a list of vendors:
My disclaimer: I do work for one of the companies on this list.
Have you considered Data Robot's interface for developing your AI? https://www.datarobot.com/
I just saw their representative present at a meeting, and the interface makes it so you can put your data in, specify variables, click "submit", and then it runs a bunch of different models on it. I gave a better description on my blog post about the meeting: https://dethwench.com/review-of-snowflakes-data-for-breakfast-boston/
I heard a VP from Data Robot speak on a podcast (sponsored by Data Science Central - see my blog post about it: https://dethwench.com/cautionary-tale-data-science-ai-healthcare/), and then I saw this presenter, and I realized that the company is really hands-on in helping you figure out the parameters to put in (and how to transform them), and also, in model selection.Therefore, as long as you get the concepts of data science, you don't need to be a bona fide data scientist (or know how to program) to use it.
If your goal is identify outliers in big data, Data Robot could help you set up the rules for this. It rides on Snowflake, so your big data will fit. If you are unfamiliar with Snowflake, read my blog post above about the presenter. It was at a Snowflake "Data for Breakfast" meeting where they were introducing Snowflake as a big data cloud solution. Unlike Data Robot, with Snowflake, you have to be a SQL programmer and know query language.
Good luck with your project!