March Madness officially arrived at 6 PM CDT, Sunday 3/17/2019. 68 D1 schools — 32 league champions and 36 at large selections — received invitations to this year’s tournament, which starts Tuesday.
Today and tomorrow, fans will work feverishly preparing their brackets. Most will use intuition or simply guess to pick game results. Others like me, though, will use analytics to outsmart chance.
One sports analytics expert I’ve followed closely over the years is Ken Pomeroy. Pomeroy’s developed a serious stable of statistical measures for college basketball and adopted a “freemium” business model that avails some content for free while withholding advanced goodies for subscription customers. An analytics geek can get more than she bargained for with KenPom.
2019’s March Madness has piqued my interest in dataset building as well as statistics. In this blog, my challenge is one of data gathering/organizing/munging rather than analytics per se. My self-assigned task is to download 18 years of college hoops data — 2002 through 2019 — from the KenPom site and build a coherent dataset that can be analyzed in Python/Pandas.
The code from remainder of this notebook assembles the dataset starting from web-scraping and advancing to manipulation/wrangling in Python/Pandas. The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, bs4 (BeautifulSoup) 4.6.0, NumPy 1.14.3, and Pandas 0.23.0.
The data are first scraped from the KenPom website using the Python requests library, then “liberated” from HTML using BeautifulSoup functionality. The resulting lists are subsequently wrangled using core Python, NumPy, and Pandas. In the end, 18 years of KenPom data are concatenated in a Pandas dataframe.
The complete blog can be read here.