Subscribe to DSC Newsletter

Contributed by Stephen Penrice. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).


In a lottery game, the numbers that the lottery selects are random, but the numbers that players choose to play are not. To the best of my knowledge, data on player selections are not publicly available. However, lotteries do publish data on the numbers they draw and the amounts of the prizes they award. In games where prizes are parimutuel, that is when a certain percentage of sales is divided equally among the winners, one can infer the popularity of the numbers drawn from the prize amounts: popular numbers result in smaller prizes because there are more winners splitting the prize money. The primary component of this project is scraping a variety of lottery websites using a variety of techniques in order to gather data for an analysis that relates prizes amounts to the numbers drawn. Ultimately, I would like to build machine learning models that predict prize amounts as a function of the numbers drawn. However, here I simply present some visualizations and do some hypothesis tests to investigate whether there is a relationship between prize amounts and the sum of the numbers drawn.

Scraping Strategies


In this project a single observation is a lottery drawing, with the data comprising a date, the numbers drawn by the lottery, the number of winners at each prize level, and the prize amount at each level. In order to get all of these data components, one has to visit a separate page for each drawing. Beautiful Soup can easily scrape each of these pages, so the primary challenge was visiting each page within a site in an automated fashion.

Since I was accessing several different websites, I had to employ several different strategies. In increasing order of complexity they were: encoding dates into URL’s, using Selenium to click a link, and using Selenium to fill in a form.

Encoding Dates into a URL


Florida’s Fantasy 5 game is a typical example of a website well sutied to this strategy. A typical results page looks like this.

While it is possible to access individual pages using menus, visiting one of these pages reveals that the URL’s have a particular format that encodes game name and the date of the drawing. For example,

http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=dat...


is the URL for the page that displays the data for the Fantasy 5 drawing that occurred on October 13, 2015, the key portion of the address being the string

10%2F13%2F2015


The following code uses the datetime library to create a date object that it uses to iterate through a specified range of dates, creating a URL string for each one that can be used to access a page which is then processed using Beautiful Soup.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

def encodeDate(dateob):
answer = dateob.strftime('%m') + '%2F'
answer = answer + dateob.strftime('%d') + '%2F'
answer = answer + dateob.strftime('%Y') + '&submitForm=Submit'
return answer

fl5 = open('fl_fant_5.csv','w')
fl5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n')
url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=FANTASY5&singleDateIn='
start_date = date(2007,1,1)
end_date = date(2015,10,26)
current = start_date
while current /span> end_date:
url = url_stem + encodeDate(current)
page = requests.get(url).text
bsPage = BeautifulSoup(page)
numbers = bsPage.find_all("div",class_="winningNumbers")
temp = numbers[0].get_text()
draws = re.split('[-\n]',temp)
draws = draws[1:6]
winners = bsPage.find_all("td",class_="column2")
winners = [tag.get_text().replace(',','') for tag in winners[:-1]]
prizes = bsPage.find_all("td", class_="column3 columnLast")
prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]]
fl5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
print current.strftime('%Y-%m-%d')
current = current + timedelta(1)

fl5.close()
print 'done'


The code for Florida’s Lucky Money game is very similar. The only meaningful difference is that Lucky Money draws happen on Tuesdays and Fridays only, so the code checks the day of the week before building the URL in order to avoid getting an error caused by trying to access a non-existent page.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

def encodeDate(dateob):
answer = dateob.strftime('%m') + '%2F'
answer = answer + dateob.strftime('%d') + '%2F'
answer = answer + dateob.strftime('%Y') + '&submitForm=Submit'
return answer

fllm = open('fl_lucky_money.csv','w')
fllm.write(','.join(['drawdate','n1','n2','n3','n4','luckyball','win41','win40','win31','win30','win21','win11','win20','prize41','prize40','prize31','prize30','prize21','prize11','prize20'])+'\n')
url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=LUCKYMONEY&singleDateIn='
start_date = date(2014,7,4)
end_date = date(2015,10,24)
current = start_date
while current /span> end_date:
while current.strftime('%w') not in ['2','5']:
current = current + timedelta(1)
url = url_stem + encodeDate(current)
page = requests.get(url).text
bsPage = BeautifulSoup(page)
numbers = bsPage.find_all("div",class_="winningNumbers")
temp = numbers[0].get_text()
draws = re.split('[-\n]',temp)
draws = draws[1:6]
winners = bsPage.find_all("td",class_="column2")
winners = [tag.get_text().replace(',','') for tag in winners[:-1]]
prizes = bsPage.find_all("td", class_="column3 columnLast")
prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]]
fllm.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
print current.strftime('%Y-%m-%d')
current = current + timedelta(1)

fllm.close()


North Carolina’s Cash 5 game requires the same strategy. The structure of the code is the same as the Fantasy 5 code, with the differences coming from the differences in the page structures and tags. A sample data page can be found here.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

ncc5 = open('nc_cash_5.csv','w')
ncc5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n')
url_stem = 'http://www.nc-educationlottery.org/cash5_payout.aspx?drawDate='
start_date = date(2006,10,27)
end_date = date(2015,10,27)
current = start_date
p = re.compile('[,$]')
while current /span> end_date:
print current.strftime('%Y-%m-%d')
url = url_stem + current.strftime('%m/%d/%Y')
page = requests.get(url).text
bsPage = BeautifulSoup(page)

draws = []
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num1")[0].get_text()))
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num2")[0].get_text()))
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num3")[0].get_text()))
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num4")[0].get_text()))
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num5")[0].get_text()))

winners = []
winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5")[0].get_text()))
winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4")[0].get_text()))
winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3")[0].get_text()))

prizes = []
prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5Prize")[0].get_text()))
prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4Prize")[0].get_text()))
prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3Prize")[0].get_text()))
if prizes[0] == 'Rollover':
prizes[0] = '0'
ncc5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
current = current + timedelta(1)

ncc5.close()
print 'finished'

Using Selenium to Fill in a Form


Past results from the Oregon Lottery website can be accessed only by using a form on the results page. Once again, Selenium is up to the challenge. Like in the Florida and North Carolina cases, the code iterates through a date object and checks for a valid day of the week (Monday, Wednesday, or Saturday.) However, here Selenium enters the date into the form in two places, “Start Date” and “End Date.” (Using the same date in both parts of the form simplifies both the iteration and the Beautiful Soup processing.) Then Selenium clicks the submit button.

While testing this I noticed that sometimes the code repeats results from a previous selection, most likely due to a failure of the new page to load fast enough. The code deals with this issue in two ways. First, the sleep function from the date module pauses the code for 30 seconds, greatly reducing the likelihood of the problem occuring. As a extra safety measure, the also checks that the date on the page matches the one entered into the form before writing the results to a file. If the dates don’t match, the desired date, i.e. the one Selenium entered on the form, is written to an error log.

from selenium import webdriver
from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re
from time import sleep

ormb = open('or_megabucks.csv','a')
ormb_err = open('or_megabucks_errors.csv','w')
ormb.write(','.join(['drawdate','n1','n2','n3','n4','n5','n6','winners6','winners5','winners4','prize6','prize5','prize4'])+'\n')
start_date = date(2012,10,30)
end_date = date(2015,10,29)
current = start_date

driver = webdriver.Firefox()
driver.get('http://www.oregonlottery.org/games/draw-games/megabucks/past-results')

while current /span> end_date:
while current.strftime('%w') not in ['1','3','6']:
current = current + timedelta(1)
driver.find_element_by_id("FromDate").clear()
driver.find_element_by_id("ToDate").clear()
driver.find_element_by_id("FromDate").send_keys(current.strftime('%m/%d/%Y'))
driver.find_element_by_id("ToDate").send_keys(current.strftime('%m/%d/%Y'))
driver.find_element_by_css_selector(".viewResultsButton").click()
sleep(30)
soup = BeautifulSoup(driver.page_source)
test1 = soup.find_all("td")
numbers = [test1[i].get_text() for i in range(2,8)]
test2 = soup.find_all("strong")
winners = [test2[1].get_text().replace(',','')]
prizes = [test2[0].get_text().replace('$','').replace(',','')]
for i in range(0,2):
winners.append(test2[4*i+3].get_text().replace(',',''))
prizes.append(test2[4*i+2].get_text().replace('$','').replace(',',''))
testdate = test1[0].get_text().split('/')
testdate = date(int(testdate[2]),int(testdate[0]),int(testdate[1]))
if current.strftime('%Y-%m-%d') == testdate.strftime('%Y-%m-%d'):
ormb.write(','.join([testdate.strftime('%Y-%m-%d')] + numbers + winners + prizes)+'\n')
else:
ormb_err.write(current.strftime('%Y-%m-%d') + '\n')

current = current + timedelta(1)

ormb.close()
ormb_err.close()

Visualizations


Any number of visualizations of the scraped data are possible, but here let’s focus on a type of plot that not only suggests an association between the numbers drawn and the prize amounts but also motivates a statistical test to be performed later.

The summary statistic that we will use is simply the sum of the numbers drawn. The plots will show histograms of this sum for two sets of drawings: those where the prize amounts were less than the 25th percentile for all draws (labelled “Small Prizes”) and those where the prize amounts were greater than the 75th percentile for all draws (labelled “Large Prizes”).

North Carolina Cash 5


ncc5_blog

Oregon Megabucks


ormb_blog

Tennessee Cash


tnc_blog

Florida Lucky Money


fllm_blog

Florida Fantasy 5


ff5_blog

Conclusion


The visualizations presented here provide multiple examples of parimutuel lotteries where there seems to be a relationship between the numbers drawn and the prize amounts. Therefore the project of predicting prize amounts from the drawn numbers is likely to produce some results, and using the sum of the drawn numbers appears to be a great starting point.

Views: 4287

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Max Galka on August 10, 2016 at 10:09am

Intriguing analysis. Am I understanding right that the difference between the distributions comes down to the prizes being split between multiple players?

Comment by Kenneth C Black on December 21, 2015 at 8:51am

A couple of years ago in August 2013, I used Tableau to help me make some Powerball selections. I scraped the historical data from the powerball site and then looked at the "hotness" of the balls from each slot (i.e., ball 1, ball 2...powerball). I then used this "probabilistic" approach to buy 5 tickets. I was able to match 2 of 6 balls on the 8/7/13 drawing. Since I don't normally play the lottery, I didn't continue data collection after that week. It was a fun exercise. Here is a link to my Tableau workbook: https://public.tableau.com/profile/3danim8#!/vizhome/Ball_Analysis_2012_2013/Ball1

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service