As the world is getting more tech savvy and advancements made in the information technology especially in the healthcare industry has opened areas in data mining and machine learning. Within the area of data mining one technique which has gained a lot of popularity as well as skepticism among the auditors and fraud detectives is Benford’s Law or “The Law of First digit.
In the past some researchers in Canada used the Benford’s Law distribution to detect anomalies within the claims amount data for one of the healthcare organization. In this article we will understand the mechanics of this technique and will also look at its practical usage on some random claims amount data.
What is Benford’s Law?
Benford’s Law is an observation about the frequency distribution of leading digits and per the Benford’s Law the digit 1 tends to occur with the probability of ~30% much greater than the expected 11.1% (1/9). This law can be observed by probing the tables of logarithms and noting that the first pages are more worn off than the last pages (Newcomb 1881). There is no question about the usability of this law in real life situations especially where finances are involved.
This phenomenon of “First Digit Law” also gained a lot of popularity and attention when it was used in television crime dramas like Numbers and Running Man Season 2.
This law can be often used with as an indicator of fraudulent data and can assist with auditing financial data. Benford’s distribution is non-uniform, with digits starting with 1 is more likely to than the larger digits like 9.
Benford's Distribution Table
Best Data Types for Benford’s Law
Testing the usability of this law in the real life scenario of Medicare claim submission amounts
Considering the fact that there is no specific measure of health care fraud exists, the perpetrators can cost billions of dollars to the health care programs while putting the recipient’s health and wellbeing at risk.
We will look at the applicability of this technique based on one of the Medicare abuse definition which is charging excessively for the services and supplies. In this article we will look at the sample of 1033 Medicare Claim Amount (submitted) to understand whether they are spurious in nature (if they fall under the Benford’s Law Curve or not).
Understanding the Distribution
The above graphical summary conforms to the Benford’s Law data characteristics requirement.
Post establishing that this data is apt for the Benford’s Law technique. Let’s see how this technique can uncover some interesting patterns within this claims submission data set. The goal of applying Benford’s Law is to understand how “natural” these claim submissions are.
The process:
Sample the data – “The more the merrier” as this expression says the more observations the better. However, for illustration purposes I am using 1033 claims submission out of ~100K claim submission data.
Parse the leading digit – As discussed above that Benford’s Law focuses on the leading digits in sets of naturally occurring numbers. The actual claims amount, whether it is $100, $200, $300 etc. is unimportant and this can be achieved by using the Excel “Left” formula to get the lead digit for each dollar amount.
Create Frequency Distribution – The next step is to create the frequency distribution of the leading digit that have been parsed from the sample data. This can be achieved by either using the “count if” formula or by using the pivot function within MS Excel.
Compute the Distribution – Per the Benford’s Law ~30.1% percent of lead digits should be a 1 and 9 should be the least i.e. ~5% keeping this as a standard in mind compute the actual distribution of the leading digits. Once the distribution is computed compare it with Benford’s Law distribution and identify any potential outliers. Refer to the image below to see how the end results will look like.
The above graphs clearly indicate that there is an unusual amount of claim submissions with leading digits 1,2 & 3. This clearly highlights a potential manipulation, error or even a fraud. Auditors can further apply tests like Chi Square test which acts as a “goodness of fit” statistic that measures how well the data distribution complies with the hypothetical distribution explained in the theory. Outputs such as 90% indicates a good fit whereas small percentages such as 3% indicates a poor fit.
Most business data, such as count of sales, costs, accounts receivables, payments, and even the buyer’s street addresses, can be considered as logical or naturally occurring numbers. By connecting the first-digit frequency distribution of naturally occurring data with Benford's probability curve, auditors can easily spot possible data flaws or fraudulent transactions. Hence, when used appropriately, Benford's law can be a valuable and low-costing tool for identifying spurious transactions for advance analysis.
Views: 3533
Tags: Analytics, Best, Cp, Data, Healthcare, Mallows, Pattern, Prediction, R-Sqaured, Recognition, More…Regression, analysis, subset
Comment
Dear Raul,
Thanks for your comment and I totally agree with your comment "Any serious work should cite those references". I hope this will help. I would also suggest you to read about Mark Nagrini who is best known for his work on Benford's Law and its usage.
Mark Nagrini: https://www.nigrini.com/
http://www.ottawacitizen.com/news/Thinking+about+tricking+Beware+lo...
Dear Sunil
You wrote "In the past some researchers in Canada used the Benford’s Law distribution to detect anomalies within the claims amount data for one of the healthcare organization."
Please,who are those people?.
Any serious work should cite those references
The chi-square test has excessive power when used with big data sets. Even for a sample of 1k elements, it may lead to false positives, that is, flagging as suspicious empirical distributions that could just as well be a draw from an actual Benford's distribution.
There are some better techniques based on bootstrap procedures to perform such tests, which can be found in these two following articles:
SUH, I.; HEADRICK, T.C. A comparative analysis of the bootstrap ver...
SUH, I., HEADRICK, T.C.; MINABURO, S. An Effective and Efficient An...
These approaches lead to much more accurate results, but the downside is the expensive computation they require.
With that in mind, I have developed a package for Apache Spark which implements both methods, allowing for assessing the goodness of fit of large datasets. I also keep it running on a Spark local cluster at AWS EC2 for demonstration purposes: Benford Analysis for Spark.
Thanks Ted Dunning for your comment. However, please keep in mind that I am looking at the sample of 1030 claims out of the ~100k claims transactions for this article which can possibly result in things that one would not want to see while utilizing the Benford's Law technique (mentioned under the Best Data Types section of this article). Also, Benford’s Law is a Test to examine if the resulting frequency distribution follows the “natural” flow when compared to the Benford’s Law distribution and even the slightest deviation will interest the auditors to dig deeper when the stakes are high. It will be helpful to know that Frank Benford never limited his study to the lead digits of naturally occurring number. He also developed frequency distributions for secondary distributions to explore second and third digit anomalies in such numbers and I believe the classic study that you referred to in your comment used the second digit distribution where the 7,8,9 etc. digits emerged. This article is aimed to introduce a way to identify fraud and use other statistical tests in conjunction with this test to make a solid decision. Once again thanks a lot for sharing your views and reading this article.
This is unlikely to be enough deviation to signal blatant fraud.
In the classic demonstration of fraud and structuring, the leading digits were 7, 8 or 9 about 90% of the time. This is wildly different from the Benson distribution. Your results are only slightly different and there could easily be process structure that explains the issue such as a round number required amount for certain procedures.
Most unsupervised fraud detection schemes find structure even more easily than fraud. Co-pay limits, required reimbursements, and such are examples of such structure.
© 2019 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central