Contributed by Stephen Penrice. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).
Since I was accessing several different websites, I had to employ several different strategies. In increasing order of complexity they were: encoding dates into URL’s, using Selenium to click a link, and using Selenium to fill in a form.
While it is possible to access individual pages using menus, visiting one of these pages reveals that the URL’s have a particular format that encodes game name and the date of the drawing. For example,
http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=dat...
is the URL for the page that displays the data for the Fantasy 5 drawing that occurred on October 13, 2015, the key portion of the address being the string
10%2F13%2F2015
The following code uses the datetime
library to create a date object that it uses to iterate through a specified range of dates, creating a URL string for each one that can be used to access a page which is then processed using Beautiful Soup.
from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re
def encodeDate(dateob):
answer = dateob.strftime('%m') + '%2F'
answer = answer + dateob.strftime('%d') + '%2F'
answer = answer + dateob.strftime('%Y') + '&submitForm=Submit'
return answer
fl5 = open('fl_fant_5.csv','w')
fl5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n')
url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=FANTASY5&singleDateIn='
start_date = date(2007,1,1)
end_date = date(2015,10,26)
current = start_date
while current /span> end_date:
url = url_stem + encodeDate(current)
page = requests.get(url).text
bsPage = BeautifulSoup(page)
numbers = bsPage.find_all("div",class_="winningNumbers")
temp = numbers[0].get_text()
draws = re.split('[-\n]',temp)
draws = draws[1:6]
winners = bsPage.find_all("td",class_="column2")
winners = [tag.get_text().replace(',','') for tag in winners[:-1]]
prizes = bsPage.find_all("td", class_="column3 columnLast")
prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]]
fl5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
print current.strftime('%Y-%m-%d')
current = current + timedelta(1)
fl5.close()
print 'done'
The code for Florida’s Lucky Money game is very similar. The only meaningful difference is that Lucky Money draws happen on Tuesdays and Fridays only, so the code checks the day of the week before building the URL in order to avoid getting an error caused by trying to access a non-existent page.
from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re
def encodeDate(dateob):
answer = dateob.strftime('%m') + '%2F'
answer = answer + dateob.strftime('%d') + '%2F'
answer = answer + dateob.strftime('%Y') + '&submitForm=Submit'
return answer
fllm = open('fl_lucky_money.csv','w')
fllm.write(','.join(['drawdate','n1','n2','n3','n4','luckyball','win41','win40','win31','win30','win21','win11','win20','prize41','prize40','prize31','prize30','prize21','prize11','prize20'])+'\n')
url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=LUCKYMONEY&singleDateIn='
start_date = date(2014,7,4)
end_date = date(2015,10,24)
current = start_date
while current /span> end_date:
while current.strftime('%w') not in ['2','5']:
current = current + timedelta(1)
url = url_stem + encodeDate(current)
page = requests.get(url).text
bsPage = BeautifulSoup(page)
numbers = bsPage.find_all("div",class_="winningNumbers")
temp = numbers[0].get_text()
draws = re.split('[-\n]',temp)
draws = draws[1:6]
winners = bsPage.find_all("td",class_="column2")
winners = [tag.get_text().replace(',','') for tag in winners[:-1]]
prizes = bsPage.find_all("td", class_="column3 columnLast")
prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]]
fllm.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
print current.strftime('%Y-%m-%d')
current = current + timedelta(1)
fllm.close()
North Carolina’s Cash 5 game requires the same strategy. The structure of the code is the same as the Fantasy 5 code, with the differences coming from the differences in the page structures and tags. A sample data page can be found here.
from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re
ncc5 = open('nc_cash_5.csv','w')
ncc5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n')
url_stem = 'http://www.nc-educationlottery.org/cash5_payout.aspx?drawDate='
start_date = date(2006,10,27)
end_date = date(2015,10,27)
current = start_date
p = re.compile('[,$]')
while current /span> end_date:
print current.strftime('%Y-%m-%d')
url = url_stem + current.strftime('%m/%d/%Y')
page = requests.get(url).text
bsPage = BeautifulSoup(page)
draws = []
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num1")[0].get_text()))
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num2")[0].get_text()))
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num3")[0].get_text()))
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num4")[0].get_text()))
draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num5")[0].get_text()))
winners = []
winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5")[0].get_text()))
winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4")[0].get_text()))
winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3")[0].get_text()))
prizes = []
prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5Prize")[0].get_text()))
prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4Prize")[0].get_text()))
prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3Prize")[0].get_text()))
if prizes[0] == 'Rollover':
prizes[0] = '0'
ncc5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
current = current + timedelta(1)
ncc5.close()
print 'finished'
There are two types of links that are of interest here. First there are the “details” links to the right. I chose to deal with these by having Beautiful Soup read the URL’s encoded in the tags and use them to access each page. A more challenging problem is to use the “Next Page” link at the bottom of the page to access the next set of 40 “details” links. For this I used the Selenium package. (Read the documentation here.) Fortunately, the link has an id that remains the same no matter how many times we click, so the code is straightforward.
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from datetime import date
from time import sleep
def GetTnCashData(url):
page = requests.get(url).text
bsPage = BeautifulSoup(page)
temp = bsPage.find_all("td",class_="SmallBlackText")
winners = []
prizes = []
for i in range(1,8):
winners.append(temp[3*i+1].get_text())
prizes.append(temp[3*i+2].get_text().replace('$','').replace(',',''))
return winners + prizes
def cleanDate(strdate):
temp = strdate.split('/')
return date(int(temp[2]),int(temp[0]),int(temp[1])).strftime('%Y-%m-%d')
tnc = open('tn_cash.csv','w')
tnc.write(','.join(['drawdate','n1','n2','n3','n4','n5','cashball','win51','win50','win41','win40','win31','win30','win21','prize51','prize50','prize41','prize40','prize31','prize30','prize21'])+'\n')
driver = webdriver.Firefox()
driver.get('https://www.tnlottery.com/winningnumbers/TennesseeCashlist.aspx?TCShowall=y#TennesseeCashball')
html = driver.page_source
nextLink = "navTennesseeCashNextPage"
soup = BeautifulSoup(html)
for pg in range(0,20):
temp = soup.find_all("td",align="center")
top = (len(temp)-4)/3 + 1
print pg, len(temp)
for i in range(1,top):
drawDate = [cleanDate(temp[3*i].get_text())]
NumsDrawn = temp[3*i+1].get_text().replace('-',' ').split(' ')
drawID = temp[3*i+2].a.get('href')
drawID = drawID[drawID.index('=')+1:]
drawID = drawID[:drawID.index("'")]
drawData = GetTnCashData('https://www.tnlottery.com/winningnumbers/TennesseeCashdetails_popup.aspx?id='+drawID)
tnc.write(','.join(drawDate + NumsDrawn + drawData) + '\n')
driver.find_element_by_id(nextLink).click()
sleep(30)
soup = BeautifulSoup(driver.page_source)
tnc.close()
print 'Done'
Note that this code builds each data point from two different sources: the date and numbers drawn are read from the main page while the winner counts and prize amounts are read from the pop-up window you see when you click a “details” link.
While testing this I noticed that sometimes the code repeats results from a previous selection, most likely due to a failure of the new page to load fast enough. The code deals with this issue in two ways. First, the sleep
function from the date
module pauses the code for 30 seconds, greatly reducing the likelihood of the problem occuring. As a extra safety measure, the also checks that the date on the page matches the one entered into the form before writing the results to a file. If the dates don’t match, the desired date, i.e. the one Selenium entered on the form, is written to an error log.
from selenium import webdriver
from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re
from time import sleep
ormb = open('or_megabucks.csv','a')
ormb_err = open('or_megabucks_errors.csv','w')
ormb.write(','.join(['drawdate','n1','n2','n3','n4','n5','n6','winners6','winners5','winners4','prize6','prize5','prize4'])+'\n')
start_date = date(2012,10,30)
end_date = date(2015,10,29)
current = start_date
driver = webdriver.Firefox()
driver.get('http://www.oregonlottery.org/games/draw-games/megabucks/past-results')
while current /span> end_date:
while current.strftime('%w') not in ['1','3','6']:
current = current + timedelta(1)
driver.find_element_by_id("FromDate").clear()
driver.find_element_by_id("ToDate").clear()
driver.find_element_by_id("FromDate").send_keys(current.strftime('%m/%d/%Y'))
driver.find_element_by_id("ToDate").send_keys(current.strftime('%m/%d/%Y'))
driver.find_element_by_css_selector(".viewResultsButton").click()
sleep(30)
soup = BeautifulSoup(driver.page_source)
test1 = soup.find_all("td")
numbers = [test1[i].get_text() for i in range(2,8)]
test2 = soup.find_all("strong")
winners = [test2[1].get_text().replace(',','')]
prizes = [test2[0].get_text().replace('$','').replace(',','')]
for i in range(0,2):
winners.append(test2[4*i+3].get_text().replace(',',''))
prizes.append(test2[4*i+2].get_text().replace('$','').replace(',',''))
testdate = test1[0].get_text().split('/')
testdate = date(int(testdate[2]),int(testdate[0]),int(testdate[1]))
if current.strftime('%Y-%m-%d') == testdate.strftime('%Y-%m-%d'):
ormb.write(','.join([testdate.strftime('%Y-%m-%d')] + numbers + winners + prizes)+'\n')
else:
ormb_err.write(current.strftime('%Y-%m-%d') + '\n')
current = current + timedelta(1)
ormb.close()
ormb_err.close()
The summary statistic that we will use is simply the sum of the numbers drawn. The plots will show histograms of this sum for two sets of drawings: those where the prize amounts were less than the 25th percentile for all draws (labelled “Small Prizes”) and those where the prize amounts were greater than the 75th percentile for all draws (labelled “Large Prizes”).
Comment
Intriguing analysis. Am I understanding right that the difference between the distributions comes down to the prizes being split between multiple players?
A couple of years ago in August 2013, I used Tableau to help me make some Powerball selections. I scraped the historical data from the powerball site and then looked at the "hotness" of the balls from each slot (i.e., ball 1, ball 2...powerball). I then used this "probabilistic" approach to buy 5 tickets. I was able to match 2 of 6 balls on the 8/7/13 drawing. Since I don't normally play the lottery, I didn't continue data collection after that week. It was a fun exercise. Here is a link to my Tableau workbook: https://public.tableau.com/profile/3danim8#!/vizhome/Ball_Analysis_2012_2013/Ball1
© 2019 Data Science Central ®
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central