March Madness officially arrived at 6 PM CDT, Sunday 3/17/2019. 68 D1 schools -- 32 league champions and 36 at large selections -- received invitations to this year's tournament, which starts Tuesday.Today and tomorrow, fans will work feverishly preparing their brackets. Most will use intuition or simply guess to pick game results. Others like me, though, will use analytics to outsmart chance.One sports analytics expert I've followed closely over the years is Ken Pomeroy. Pomeroy's developed a serious stable of statistical measures for college basketball and adopted a "freemium" business model that avails some content for free while withholding advanced goodies for subscription customers. An analytics geek can get more than she bargained for with KenPom.2019's March Madness has piqued my interest in dataset building as well as statistics. In this blog, my challenge is one of data gathering/organizing/munging rather than analytics per se. My self-assigned task is to download 18 years of college hoops data -- 2002 through 2019 -- from the KenPom site and build a coherent dataset that can be analyzed in Python/Pandas.The code from remainder of this notebook assembles the dataset starting from web-scraping and advancing to manipulation/wrangling in Python/Pandas. The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, bs4 (BeautifulSoup) 4.6.0, NumPy 1.14.3, and Pandas 0.23.0.The data are first scraped from the KenPom website using the Python requests library, then "liberated" from HTML using BeautifulSoup functionality. The resulting lists are subsequently wrangled using core Python, NumPy, and Pandas. In the end, 18 years of KenPom data are concatenated in a Pandas dataframe.The complete blog can be read here.See More

Last time, I posted Part 2 of a blog trilogy on data programming with Python. That article revolved on showcasing NumPy, a comprehensive library created in 2005 that extends the Python core to accommodate "large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays." In addition to introducing a wealth of highly-performant new data structures and mathematical functions, NumPy changed the data programming metaphor in Python from procedural to specification.Part 1 demonstrated basic data programming with Python. There, I resurrected scripts written 10 years ago that deployed core Python data structures, functions, and looping-like code to assemble a Python list for analyzing stock market returns. Fun, but a lot more work to perform the same tasks in Part 2.This Part 3 post replicates the work done in Parts 1 and 2, using the even more productive Pandas library. In Pandas, core Python data structures such as lists/dictionaries and functionals like list comprehensions serve mainly to feed the Pandas beast.Since the Part 3 code is simpler than the NumPy of Part 2, and much less involved than the list processing of Part 1, I've added a few graphs at the end, implemented with the productive Seaborn statistical data visualization library, built on top of Python mainstay matplotlib. Seaborn's grown by leaps and bounds recently and is now a legitimate competitor to R's ggplot2 for statistical graphics.In the analysis that follows, I focus on performance of the Russell 3000 index, a Wilshire 5000-like portfolio for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data using Pandas to get to the desired end state.The technology used for all three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.See the remainder of the blog here.See More

]]>

]]>

Last week I posted the first of a three-part series on basic data programming with Python. For that article, I resurrected scripts written 10 years ago that deployed core Python data structures and functions to assemble a Python list for analyzing stock market returns. While it was fun refreshing and modernizing that code, I'm now pretty spoiled working with advanced libraries like NumPy and Pandas that make data programming tasks much simpler.This second post revolves on a brief showcasing of NumPy, a comprehensive library created in 2005 that extends the Python core to accommodate "large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays."In addition to introducing a wealth of highly-performant new data structures and mathematical functions, NumPy changed the data programming metaphor in Python from procedural to specification. In Part 1, I detail looping-like code for building the final lists; in Part 2, I pretty much simply invoke array functions and structure subscripting to complete the tasks.Though I'm the first to acknowledge not being a NumPy expert, I had little trouble figuring out what to do with the help of stackoverflow. Indeed, those familiar with the relatively recent Pandas library for data analysis will readily adapt to the foundational NumPy programming style. Core Python structures such as lists, dictionaries, comprehensions, and iterables serve primarily to feed the NumPy/Pandas beasts.For the analysis that follows, I focus on performance of the Russell 3000 index, a competitor to the S&P 500 and Wilshire 5000 for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data using NumPy to get to the desired end state.The technology used for all three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.Read the remainder of the blog here.See More

Last week I posted the first of a three-part series on basic data programming with Python. For that article, I resurrected scripts written 10 years ago that deployed core Python data structures and functions to assemble a Python list for analyzing stock market returns. While it was fun refreshing and modernizing that code, I'm now pretty spoiled working with advanced libraries like NumPy and Pandas that make data programming tasks much simpler.This second post revolves on a brief showcasing of NumPy, a comprehensive library created in 2005 that extends the Python core to accommodate "large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays."In addition to introducing a wealth of highly-performant new data structures and mathematical functions, NumPy changed the data programming metaphor in Python from procedural to specification. In Part 1, I detail looping-like code for building the final lists; in Part 2, I pretty much simply invoke array functions and structure subscripting to complete the tasks.Though I'm the first to acknowledge not being a NumPy expert, I had little trouble figuring out what to do with the help of stackoverflow. Indeed, those familiar with the relatively recent Pandas library for data analysis will readily adapt to the foundational NumPy programming style. Core Python structures such as lists, dictionaries, comprehensions, and iterables serve primarily to feed the NumPy/Pandas beasts.For the analysis that follows, I focus on performance of the Russell 3000 index, a competitor to the S&P 500 and Wilshire 5000 for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data using NumPy to get to the desired end state.The technology used for all three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.Read the remainder of the blog here.See More

I had an interesting discussion with one of my son's friends at a neighborhood gathering over the holidays. He's just reached the halfway point of a Chicago-area Masters in Analytics program and wanted to pick my brain on the state of the discipline.Of the four major program foci of business, data, computation, and algorithms, he acknowledged he liked computation best, with Python in the lead against R and SAS for his attention. I was impressed with his understanding of Python, especially given that he'd had no programming experience outside Excel before starting the curriculum.After a while, we got to chatting about NumPy and Pandas, the workhorses of Python data programming. My son's friend was using Pandas a bit now, but hadn't been exposed to NumPy per se. And while he noted the productivity benefits of working with such libraries, I don't think he quite appreciated the magnitude of relief provided for day-to-day data programming challenges. He seemed smitten with the power he'd discovered with core Python. Actually, he sounded a lot like me when I was first exposed to Pandas almost 10 years ago -- and when I first saw R as a SAS programmer back in 2000. As our conversation progressed, I just smiled, fully confident his admiration would grow over time.Our discussion did whet my appetite for the vanilla Python data programming I've done in the past. So I just had to dig up some code I'd written "BP" -- before Pandas. Following a pretty exhaustive search, I found scripts from 2010. The topic was stock market performance. The work stream entailed wrangling CSV files from the investment benchmark company Russell FTSE website pertaining to the performance of its many market indexes. Just about all the work was completed using core Python libraries and simple data structures lists and dictionaries.As I modernized the old code a bit, my appreciation for Pandas/NumPy did nothing but grow. Much more looping-like code in vanilla Python. And, alas, lists aren't dataframes. On the other hand, with Pandas: Array orientation/functions? Check. Routinized missing data handling? Check. Tables/dataframes as core data structures? Check. Handling complex input files? Check. Powerful query capability? Check. Easy updating/variable creation? Check. Joins and group by? Check. Et al.? Check.For the analysis that follows, I focus on performance of the Russell 3000 index, a competitor to the S&P 500 and Wilshire 5000 for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested.Once the file data are in memory, they're munged and combined into a single Python multi-variable list. The combined data are then sorted by date, at which point duplicates are deleted. After that, I compute daily percent change variables from the index levels, ultimately producing index performance statistics. At the end I write the list to a CSV file.My take? Even though the code here is no more than intermediate-level, data programming in Python without Pandas seems antediluvian now. The array orientations of both Pandas and NumPy make this work so much simpler than the looping idioms of vanilla Python. Indeed, even though I programmed with Fortran, PL/I, and C in the past, I've become quite lazy in the past few years.This is the first of three blogs pretty much doing the same tasks with the Russell 3000 data. The second uses NumPy, and the final, Pandas.The technology used for the three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.Read the remainder of the blog here.See More

]]>

Like most Chicago football fans, I was pretty distraught after the Bears lost last Sunday's playoff game courtesy of a missed field goal at the end -- a kick that first hit the goalpost and then the crossbar before ultimately failing miserably. While most local fans were grief-stricken like me, some were irrationally inconsolable, demanding the scalp, and worse, of the Chicago kicker.Ever the stats and randomness guy, I tried to talk my rabid friends off the ledge -- with little success. The 40+ yard attempt was certainly not a gimme, even for the very best kickers, I opined. What if the gods had allowed the kick to bounce through; would the kicker have then been a hero? Why not blame the defense for allowing the late touchdown? Even with the final miss, the Bears kicker made 3 or 4 attempts for the game -- certainly a credible day's work. Would there be such venom if he'd missed the kick with 5 minutes left rather than 5 seconds? Alas, it seemed there was no convincing the hard core, especially regarding a guy who'll almost assuredly not be with the team next year.The kicker hit on 77% of his attempts this season and has made 84% over his career. These numbers compare favorably with the much-beloved kicker of the Super Bowl-winning team of 1986. Indeed the 1986 kicker made just 50% of attempts in the 40-49 yard range that season, and only slightly more than 50% of such attempts for his career. So, he'd have been just 50-50 to make the same kick this year's guy missed.Comparing kickers of different eras is tricky though. Performance and hence standards change over time. What was "par" 30 years ago may be bogey today.What's a grief-stricken, data-driven Bears fan to do? Find data to analyze of course. It didn't take long to get started. ESPN has clean field goal numbers from 2002-2018 by team and kick length category. Though I'd ultimately like to dig back even further in NFL history, I thought this'd be a convenient point of departure. The remainder of this article revolves on scraping and wrangling the ESPN data, followed by several visuals that show 17 year trends.For the analysis that follows, the technology used is JupyterLab 0.32.1 with Microsoft Open R 3.4.4. As always, the R work is driven by the data.table package and tidyverse ecosystem.Read the entire blog here.See More

]]>