Contributed by Stephen Penrice. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his fourth class project(due at 8th week of the program).
Here’s some quick background for readers who are not familiar with lotteries. In the games I studied, the lottery draws 5 or 6 distinct numbers from a set of about 40 integers, and the order in which the numbers are drawn has no effect on prize amounts. For example, New Jersey Cash 5 draws 5 numbers from 1 to 43. The set from which the numbers are selected is called a “matrix” (not to be confused with the mathematical object with the same name). The Cash 5 games have several hundred thousand possible outcomes, and the Oregon game has about 12 million outcomes. The odds of winning the prizes I discuss range from about 1 in 100 to 1 in 1,000. The target quantity for each model is the prize amount that the lottery will pay to each winner given a set of drawn numbers.
The “Data” box in the upper portion of the diagram represents the tables holding the data I had scraped from the various lottery websites, with a table for each of the games. The “Games” box is a table that holds the key information for each game: how many numbers are selected, the size of the matrix, the earliest drawing date that should be should be included in the analyses, and the name of the table that holds the data for that game. The “Analyses” box represents a table that contains the necessary information about each analysis: the id for the game in the previous table, the prize that is being analyzed, and any filters that need to be included when querying the data tables. This structure enables R to retrieve the data it needs for a given analysis by using just the id from the analysis table, and after pulling the data it is ready to calculate features for each draw.
I kept a uniform feature structure across all games and analyses. In order to discuss these features generally, let’s say we’re drawing \(k\) distinct numbers from the set \(\{1, ..., n\}\). The most basic features are the numbers selected, \(n_1\), \(n_2\), …, \(n_k\), where \(n_1 \lt n_2 \lt ... \lt n_k\). I also derived various features from these numbers. In order to have a summary of the magnitudes of the numbers drawn, I calculated the sum \(n_1 + n_2 + ... + n_k\). I also wanted to model the possibility that players choose numbers from a small range, so I included \(n_k - n_1\), the difference between the largest and smallest numbers drawn. In order to test the effect of evenly spaced numbers, I used the standard deviation of the gaps between consecutive numbers, i.e. the standard deviation of \(\{n_2-n_1, n_3-n_2, ,,,, n_k-n_{k-1}\}\). Finally, in order to capture aspects of the numbers that are related to players’ preferences, superstitions, etc. I included flags \(F_1, F_2, ..., F_N\) where \(F_i = 1\) if \(i\) was drawn and 0 otherwise.
The other potential source of complication in this project was the variety of machine learning models I wanted to apply to all of my analyses:
- regression
- elastic net
- k nearest neighbors
- random forests
- boosting applied to random forests with trees of depth up to 3
Fortunately, the R package “caret” (“classification and regression training”) uses standrdized functions to make it easy to tune and train a variety of models.
Once I had everything standardized, training the models was straightforward. I cut off the data at July 31, 2015 so that I would have a set of recent data that had been untouched by any training, validation, or model selection processes. I split the training/test sets in 75/25 proportions and used root mean squared error on the test set as the criterion for final model selection. I used 5-fold cross-validation to tune the models, and I generally used caret’s default grids for the possible tuning parameters.
Now let’s look at the results of the best models to emerge from this process.
## fl_fantasy_5 prize3
## MAPE: 0.0361
## actual
## predict 7 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5
## 7 1 0 0 0 0 0 0 0 0 0 0
## 8 0 1 1 0 0 0 0 0 0 0 0
## 8.5 0 2 5 1 0 0 0 0 0 0 0
## 9 0 0 1 3 4 0 0 0 0 0 0
## 9.5 0 0 0 3 2 1 0 0 0 0 0
## 10 0 0 1 1 3 5 3 0 0 0 0
## 10.5 0 0 0 0 0 5 4 0 0 0 0
## 11 0 0 0 0 0 2 4 2 4 0 0
## 11.5 0 0 0 0 0 0 0 1 0 5 0
## 12 0 0 0 0 0 0 0 0 0 1 1
## nj_cash_5 prize3
## MAPE: 0.0526
## actual
## predict 9 10 11 12 13 14 15 16 17 18 19 20
## 10 2 0 1 0 0 0 0 0 0 0 0 0
## 11 0 2 3 0 0 0 0 0 0 0 0 0
## 12 0 1 1 3 4 1 0 0 0 0 0 0
## 13 0 0 0 0 3 1 1 0 0 0 0 0
## 14 0 0 0 0 2 5 2 1 0 0 0 0
## 15 0 0 0 0 0 7 6 4 0 0 0 0
## 16 0 0 0 0 0 0 4 5 2 0 0 0
## 17 0 0 0 0 0 0 0 2 5 3 1 0
## 18 0 0 0 0 0 0 0 1 6 1 1 1
## 19 0 0 0 0 0 0 0 0 1 2 2 1
## 20 0 0 0 0 0 0 0 1 0 2 1 0
## pa_cash_5 prize3
## MAPE: 0.0419
## actual
## predict 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15
## 8 1 1 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 8.5 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## 9.5 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 1 5 6 0 0 0 0 0 0 0 0 0 0
## 10.5 0 0 0 0 0 0 1 2 8 3 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 11.5 0 0 0 0 0 0 0 0 1 1 6 3 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0 0 2 4 2 0 0 0 0 0
## 12.5 0 0 0 0 0 0 0 0 0 0 1 1 1 3 0 0 0 0
## 13 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 0 0
## 13.5 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 3 0
## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0
## 14.5 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1
## actual
## predict 15.5 16 16.5 17
## 8 0 0 0 0
## 8.5 0 0 0 0
## 9 0 0 0 0
## 9.5 0 0 0 0
## 10 0 0 0 0
## 10.5 0 0 0 0
## 11 0 0 0 0
## 11.5 0 0 0 0
## 12 0 0 0 0
## 12.5 0 0 0 0
## 13 0 0 0 0
## 13.5 0 0 0 0
## 14 0 0 0 0
## 14.5 1 1 0 0
## 15 0 1 1 1
## nc_cash_5 prize3
## MAPE: 0.0235
## actual
## predict 3 4 5 6 7
## 3 2 1 0 0 0
## 4 1 24 2 0 0
## 5 0 3 49 1 0
## 6 0 0 0 3 1
## tx_cash_5 prize3
## MAPE: 0.0515
## actual
## predict 7 8 9 10 11 12 13
## 7 2 0 0 0 0 0 0
## 8 0 1 1 0 0 0 0
## 9 0 4 8 3 0 0 0
## 10 0 0 4 9 3 0 0
## 11 0 0 1 9 17 9 2
## 12 0 0 0 0 2 3 1
## fl_fantasy_5 prize4
## MAPE: 0.0865
## nj_cash_5 prize4
## MAPE: 0.1264
## pa_cash_5 prize4
## MAPE: 0.1237
## nc_cash_5 prize4
## MAPE: 0.1487
## tx_cash_5 prize4
## MAPE: 0.1393
## tx_cash_5 prize4
## MAPE: 0.1596
## or_megabucks prize4
## MAPE: 0.0605
More formally, given a model \(P\) for estimating the expected prize for \(m\) matches, the following expression gives the expected prize amount for a given selection \(S\):
\[f_P(S) = \frac{1}{|N(S)|}\sum_{T \in N(S)} P(T)\]
where
\[ N(S) = \{T: |S\cap T| = m\} \]
In general, \(N(S)\) is large: \(\binom{k}{m}\binom{n-k}{k-m}\) when the game selects \(k\) numbers from \(\{1,2,...,n\}\). For example, in the 3-match analysis of New Jersey Cash 5 \(|N(S)| = 7030\). Since the are \(\binom{n}{k}\) selections to evaluate (962,598 in the New Jersey example), we need to make the model calculations as efficient as possible. One tactic is to precompute the model \(P\) on all \(\binom{n}{k}\) combinations and simply look up these values when evaluating \(f_P(S)\). And the list of precomputed values will be most efficient if it is in lexicographic order, because then there is a fast algorithm for finding the position of a given value \(P(T)\) on the list using just the elements of \(T\).
Unfortunately, I found that even with these efficiencies, it takes about 0.8 second to calculate \(f_P(S)\) for a single selection \(S\). At this rate it would take 8 to 9 days to evaluate all the expected 3-match prizes for New Jersey Cash 5, and that’s just one of my twelve analyses! So I needed to find a faster implementation.
I was willing to sacrifice some accuracy in order to speed up the calculations of \(f_P(S)\), and it occurred to me that using a linear function might be helpful. If \(P\) has the form \[P(T) = \beta_0 + \sum_{i = 1}^{p} \beta_i{X_i}(T)\] then \[f_P(S) = \frac{1}{|N(S)|}\sum_{T \in N(S)} \left(\beta_0 + \sum_{i = 1}^{p} \beta_i{X_i}(T)\right) = \beta_0 + \sum_{i = 1}^{p} \beta_i\bar{X_i}\] where \(\bar{X_i}\) is the average of \({X_i}(T)\) over all \(T\) in \(N(S)\). This will not necessarily speed up the calculations, because we still need to average \(X_i\) over all of \(N(S)\). But it does help when we do a regression on the flags \(F_i\). If
\[P(T) = \beta_0 + \sum_{i = 1}^{n-1} \beta_i{F_i}(T)\]
then
\[ f_P(S) = \beta_0 + \sum_{i=1}^{n-1}w_i \beta_i \]
where \[
\begin{equation}
w_i =
\begin{cases}
\frac{m}{k}& \text{if $i \in S$},\\
\frac{k-m}{n-k}& \text{if $i \not\in S$}.
\end{cases}
\end{equation}
\]
which can be evaluated very quickly: all 12 of my analyses ran in about one hour. (See the Appendix for a proof that \(\bar{F_i} = w_i\).)
So we are finally in a position to find the selections for each game that have the 10 lowest expected prize amounts. Here are the results for each 3-match analysis.
FL Fantasy 5:
## n1 n2 n3 n4 n5 avgprize
## [1,] 3 5 7 9 11 7.943926
## [2,] 5 7 9 10 11 7.955978
## [3,] 3 7 9 10 11 7.962647
## [4,] 5 7 8 9 11 7.963678
## [5,] 3 7 8 9 11 7.970347
## [6,] 5 7 9 11 12 7.973561
## [7,] 3 7 9 11 12 7.980230
## [8,] 7 8 9 10 11 7.982399
## [9,] 7 9 10 11 12 7.992283
## [10,] 7 8 9 11 12 7.999983
New Jersey Cash 5:
## n1 n2 n3 n4 n5 avgprize
## [1,] 3 5 7 8 12 10.96077
## [2,] 3 5 7 9 12 10.98105
## [3,] 3 5 7 8 9 10.99249
## [4,] 5 7 8 9 12 10.99942
## [5,] 3 7 8 9 12 11.02162
## [6,] 3 5 7 11 12 11.02523
## [7,] 3 5 7 8 11 11.03667
## [8,] 5 7 8 11 12 11.04359
## [9,] 3 5 7 9 11 11.05695
## [10,] 5 7 9 11 12 11.06387
Pennsylvania Cash 5:
## n1 n2 n3 n4 n5 avgprize
## [1,] 5 7 9 11 12 7.803290
## [2,] 3 5 7 11 12 7.818146
## [3,] 5 7 8 11 12 7.830267
## [4,] 5 7 10 11 12 7.862501
## [5,] 3 5 7 9 11 7.866244
## [6,] 5 7 8 9 11 7.878364
## [7,] 3 5 7 8 11 7.893220
## [8,] 5 7 9 10 11 7.910598
## [9,] 3 5 7 10 11 7.925455
## [10,] 3 5 7 9 12 7.937129
North Carolina Cash 5:
## n1 n2 n3 n4 n5 avgprize
## [1,] 5 7 8 9 11 3.576551
## [2,] 3 5 7 9 11 3.589897
## [3,] 3 7 8 9 11 3.600528
## [4,] 3 5 7 8 11 3.611937
## [5,] 5 7 9 11 12 3.612730
## [6,] 7 8 9 11 12 3.623360
## [7,] 5 7 8 11 12 3.634770
## [8,] 3 7 9 11 12 3.636706
## [9,] 3 5 8 9 11 3.638946
## [10,] 5 7 9 10 11 3.641547
Texas Cash 5:
## n1 n2 n3 n4 n5 avgprize
## [1,] 3 5 7 9 11 7.859660
## [2,] 5 7 8 9 11 7.879023
## [3,] 5 7 9 10 11 7.889834
## [4,] 3 7 8 9 11 7.904373
## [5,] 3 7 9 10 11 7.915184
## [6,] 3 5 7 8 9 7.920539
## [7,] 3 5 7 8 11 7.926993
## [8,] 5 7 9 11 12 7.929643
## [9,] 3 5 7 9 10 7.931350
## [10,] 7 8 9 10 11 7.934547
The level of agreement across the different data sets is truly remarkable. The numbers are low, all less than 12, but 2, 4, and 6 do not appear on any of the lists. Meanwhile, 7 and 11 appear in almost every combination.
There is still the question of how much players are disadvantaged when they choose these popular combinations. To quantify this, we can look at the smallest expected prizes (already shown above), the average expected prize, and the largest expected prize. Here are the results for the 3-match analyses.
## Game Minimum Average Maximum
## 1 FL Fantasy 5 7.94 9.97 12.27
## 2 NJ Cash 5 10.96 15.08 21.55
## 3 PA Cash 5 7.80 11.55 16.55
## 4 NC Cash 5 3.58 4.66 5.96
## 5 TX Cash 5 7.86 10.14 12.67
We should also scale these numbers to the probability of winning a 3-match prize. This also allows for an apples-to-apples comparison across games and puts the differences on the same scale as the expected prize payout, typically about $0.50.
## Game Minimum Average Maximum
## 1 FL Fantasy 5 0.0979 0.1230 0.1513
## 2 NJ Cash 5 0.0800 0.1101 0.1574
## 3 PA Cash 5 0.0570 0.0844 0.1209
## 4 NC Cash 5 0.0349 0.0454 0.0581
## 5 TX Cash 5 0.0894 0.1154 0.1442
So we can see that the difference in expected payouts between the most and least popular selections is often around 10% of the total expected prize payout, and this is not even considering the 4-match prizes, or the fact that players who hit the jackpot with a popular combination have a high likelihood of having to share that prize. Whether it was a conscious design choice or not, it would seem parimutuel lotteries give greater reinforcement to their casual players, i.e. the one who don’t select their own numbers.
Any set in \(N(S)\) consists of an \(m\)-element subset of \(S\) and a \((k-m)\)-element subset of \(S' = \{1,2,...,n\} - S\). In the case where \(i \in S\) we only need to find the fraction of \(m\)-element subsets of \(S\) that contain \(i\). There are \(\binom{k-1}{m-1}\) such sets because that is the number of ways we can choose the elements other than \(i\). So the fraction that contain \(i\) is
\[\frac{\binom{k-1}{m-1}}{\binom{k}{m}} = \frac{m}{k} \]
Similarly, in the case where \(i \not\in S\) we only need to find the fraction of \((k-m)\)-element subsets of \(S'\) that contain \(i\). There are \(\binom{k-m-1}{n-k-1}\) such sets because that is the number of ways we can choose the elements other than \(i\). So the fraction that contain \(i\) is
\[\frac{\binom{k-m-1}{n-k-1}}{\binom{k-m}{n-k}} = \frac{k-m}{n-k} \]
Posted 1 March 2021
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central