Subscribe to DSC Newsletter

If you want to estimate the value of each home in US, you would benefit from having big data: the square footage, age, acreage, value of neighboring houses based on recent sales etc. for each house.

But what about fraud detection where you are looking at detecting macro-patterns (fraud cases - that is, large buckets of frauds) rather than micro-patterns (whether a single transaction isolated from its context is fraudulent or not).

Interestingly, more data is sometimes worse than less data. An example where "more data" failed is the spiralup Botnet (http://www.datashaping.com/ppc7.shtml) where fraudulent activity was detected on a very small data set using very few metrics. Google, despite it's gigantic data collected on trillions of clicks, failed to detect this massive fraud. Sometimes, big data is very sparse and filled with noise, which can make the real signal more difficult to detect than in a carefully selected sample.

However, I think it is very hard to find statisticians who can do good sampling on huge data: they exist but are rare and expensive. It is also difficult to find business analysts / data architects who can create data architecture using the minimum number of fields and minimum granularity level relevant to the business - to meet current and future business needs. This is why data is exploding faster than it should. What do you think? 

Related articleHow do your quantify data as large, big, or huge?

Views: 1521

Reply to This

Replies to This Discussion

Funny coincidence. I wrote this last night in relation to another article:

"All the quoted applications already exist. We notice moreover a convergence with the political speech about fraud. Do methodologies have to evolve ? One can indeed treat bigger volumes of data compared with samples, but doesn't it also require more qualification of the data, however automatizable, of which added value it is necessary to measure? It is true that an example of early detection of a weak signal by a Social network or Google is sometimes mentioned. Wasn'it before the mini-crash of August, 2011?... "

   Rare events like small-scale fraud are usually hard to detect.   Getting a good sample of a rare event simply means that you have to hypothesize the event and then pull a relevant sample to confirm your hypothesis.   Defenses for the obvious types of fraud have already been built into the system, and  so thinking outside the box (i.e. thinking like a scammer) is often necessary to discover the less obvious ones.

    In general, small scale data discovery is a controlled experiment, or at least a retrospective study with fairly reliable data.  This is often better than large scale, especially with fraud where the scammer may purposefully attempt to hide the fraud as something legitimate.

RSS

Videos

  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service