Guest blog post by Vijay Rajan.
One of the best books that I started to read in recent times is "Numbers Rule your World" by Kaiser Fung. The book talks about the use of Statistical Methods used in daily life for decisions to be made that affect living & businesses. The book has numerous case studies where Statistics is used.
One such case study spoke about the Ecoli outbreak epidemic in the US in 2006. Declaration of Epidemics is done by Epidemiologists when the number of cases of a particular disease reported across hospitals is over a predefined threshold in a region at a certain time of the year. This constitutes the detection phase. The harder phase is finding the root cause of the outbreak which can be understood in various steps. Five patients in a hospital who were detected with Ecoli were interviewed for everything that they consumed and where they consumed it in the last week. This varied from both bottled water to beef, meat, veggies, fruits etc. This exercise would present analysts in the Epidemiological department with so much of data. The next step of analysis would involve taking an "intersection" of the common things that at least 3 of the Ecoli infected people consumed which was to discount for omissions made by the infected patients. It was found that many of the folks recall consuming bottled water, spinach, chicken, beef among many other things. This list of minimum-3 intersect was really large. Deeper analysis ruled out bottled water because if it was indeed the cause, the epidemic would have been far worse. Then came the turn for "spinach". A known prior was that at most 5% of people consume spinach in a given week in that state at that time of the year when the epidemic broke out. From the interview of the patients, 4 out of the 5 recall eating spinach. Here comes the formula for Bernoulli's trial.
Assume spinach was not the culprit, what is the likelihood of finding 4 people out of 5 who eat spinach? This is given by the formula below taken from Wikipedia https://en.wikipedia.org/wiki/Bernoulli_trial
which is [5 chose 4] * (likelihood of a random person who ate spinach)^4 * (likelihood of a random person not eating spinach) ^(5-4)
This ends up being a very small number. So if the assumption that spinach was not a culprit would give such a small likelihood of finding 4 out of 5 folks who ate spinach, it would mean that spinach had a strong chance of being the cause. Epidemiologists got it right and the root cause which ended with lab tests did take it to spinach from a field in California.
The example above is so simple and it awaken's data analysts who are investigating fraud. A close friend who is an ex-colleague of mine is a very able and astute senior data analyst who used Bernoulli's Trials to catch fraud very successfully in a Cab Company. As I write this post, it occurs to me that I am playing the Dr. Watson while he is the Sherlock Holmes of this find.
This colleague was looking at app Version Numbers being used for Cab Booking. He then "grouped by" via SQL drivers getting too many bookings from older(suspect) versions of the app. It must be noted that older versions of the app may take different paths in the server code to make bookings and this code being old and largely unread & untouched would have ticking fraud bombs for the can drivers to explore and exploit. This colleague found a set of drivers who had made many trips with older version of the app. This was suspect enough but not evidence conclusive enough. This colleague has prior info that no more than 3% of the users were using apps with older versions and applied Bernoulli's Trial to discover that the likelihood that some of these drivers would get so many bookings from older apps was much smaller than 0.00008% . The suspicion turned into becoming a strong case and the drivers were interrogated and found guilty.
Knowing distributions aka standard deviation and means across various dimensions against a given metric is really important to business data analysts. However not knowing due processes and handy tools like Bernoulli's trials will NEVER take analysis to the end zone. So many other problems and fraudsters can be caught if Bernoulli's Trials are applied to more than one Dimension and Metric in various business and even in the Cab Business. Everything that you ever learn in Probability and Statistics needs to be studied with 30 case studies and worded problems and discussed in forums and class rooms so that application of gems like Bernoulli's Trials becomes common knowledge and is used everywhere.