In my many years as a data scientist, I've spent more time doing forecast work than any other type of predictive modeling. Often as not, the challenges have involved forecasting demand for an organization's many products/lines of business a year or more out based on five or more years of actual data, generally of daily granularity. A difficult task indeed and one for which accuracy expectations by the business are seldom met.One thing I've learned about forecasting is not to be a slave to any modeling technique, choosing predictive integrity over model fidelity. And I've become adept at scrambling -- adapting to early forecasting results with appropriate model changes to better predict an ever-evolving future. It turns out that both economists and meteorologists are also in that mode, with a name, "nowcasting", to describe how they modify early predictions based on experience during the forecast period. Meteorologists are constantly changing their weather forecasts, and economists update their annual GDP projections quite often due to evolving inputs.Formally, nowcasting is the "prediction of the present, the very near future and the very recent past. Crucial in this process is to use timely monthly information in order to nowcast key economic variables.....the nowcasting process goes beyond the simple production of an early estimate as it essentially requires the assessment of the impact of new data on the subsequent forecast revisions for the target variable."I've written several blogs over the years on crime in my home city of Chicago, especially after the disturbing uptrend in 2016. I continue to download Chicago crime data daily to look at the frequencies of homicides and violent crimes. The trends are in the right direction, though the pace is not nearly fast enough.After the disastrous 2016, I've been in forecast mode for 2017, 2018, and now 2019. My approach is one of nowcasting -- starting with predictions for 2019 based on the available data from 2001-2018, then changing these forecasts based on the daily experience as 2019 progresses. It turns out, not surprisingly, that using year-to-date experience is quite helpful in forecasting final annual counts. Knowing the number of violent crimes between 1/1/2018 and 2/28/2018 was a big help in predicting the final 2018 violent crime frequencies. And knowing the counts through 6/30/2018 was even more valuable.The remainder of this blog examines how the first four months of frequencies for homicide and violent crimes can assist in forecasting final annual 2019 numbers. I explore the relationships between year-to-date and final counts for homicides and violent crimes in Chicago from 2001-2018, then attempt to forecast 2019's final frequencies. I'll continue to do the analytics as 2019 progresses, hopefully nowcasting more accurate (and declining) crime over time.The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0, and Microsoft R 3.4.4 with ggplot and rmagic. The cumulative daily Chicago crime file from 2001 through to-date 2019 (a week in arears) drives the analysis. Data munging is done with Python/Pandas. The crime frequency dataframes are then fed to R for visualization using ggplot.Find the entire blog here.See More

A little less than a year ago, I posted a blog on generating multivariate frequencies with the Python Pandas data management library, at the same time showcasing Python/R graphics interoperability. For my data science work, the ability to produce multidimensional frequency counts quickly is sine qua non for subsequent data analysis.Pandas provides a method, value_counts(), for computing frequencies of a single dataframe attribute. By default, value_counts excludes missing (NaN) values, though they're included with the dropna=False option. As noted in that blog, however, the multivariate case is more problematic. "Alas, value_counts() works on single attributes only, so to handle the multi-variable case, the programmer must dig into Pandas's powerful split-apply-combine groupby functions. There is a problem with this though: by default, these groupby functions automatically delete NA's from consideration, even as it's generally the case with frequencies that NA counts are desirable. What's the Pandas developer to do?There are several work-arounds that can be deployed. The first is to convert all groupby "dimension" vars to string, in so doing preserving NA's. That's a pretty ugly and inefficient band-aid, however. The second is to use the fillna() function to replace NA's with a designated "missing" value such as 999.999, and then to replace the 999.999 later in the chain with NA after the computations are completed. I'd gone with the string conversion option when first I considered frequencies in Pandas. This time, though, I looked harder at the fillna-replace option, generally finding it the lesser of two evils."Subsequent to that posting, I detailed a freqs1 function for single attribute dataframe frequencies and freqsdf for the multivariate case. The functions have worked pretty well for me on Pandas numeric, string, and data datatypes. Simple versions of these functions are included below.Last summer I started experimenting with the Pandas categorical datatype. Much like factors in R, "A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales." The categorical datatype can be quite useful in many instances, both in signaling to analytic functions a specific role for the attribute, and also for saving memory by storing integers instead of more consuming representations such as string.Unfortunately, as I started to incorporate categorical attributes in my work, I found that my trusty freqsdf no longer worked, tripped up by the internal representation of the new datatype. So it was back to the drawing board to expand freqsdf functionality to handle categorical data. In the remainder of this blog, I present proof of concepts for two competing functions, freqscat and freqsC, that purport to satisfy all datatypes of freqsdf plus categorical attributes. Hopefully useful extensions, these functions should be seen as POC only.The cells that follow exercise Pandas frequencies options for historical Chicago crime data with over 6.8M records. I first load feather files produced from prior data munging into Pandas dataframes, then build freqscat and freqsC functions from freqs1 and freqsdf foundations.The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.Read the entire post here.See More

March Madness officially arrived at 6 PM CDT, Sunday 3/17/2019. 68 D1 schools -- 32 league champions and 36 at large selections -- received invitations to this year's tournament, which starts Tuesday.Today and tomorrow, fans will work feverishly preparing their brackets. Most will use intuition or simply guess to pick game results. Others like me, though, will use analytics to outsmart chance.One sports analytics expert I've followed closely over the years is Ken Pomeroy. Pomeroy's developed a serious stable of statistical measures for college basketball and adopted a "freemium" business model that avails some content for free while withholding advanced goodies for subscription customers. An analytics geek can get more than she bargained for with KenPom.2019's March Madness has piqued my interest in dataset building as well as statistics. In this blog, my challenge is one of data gathering/organizing/munging rather than analytics per se. My self-assigned task is to download 18 years of college hoops data -- 2002 through 2019 -- from the KenPom site and build a coherent dataset that can be analyzed in Python/Pandas.The code from remainder of this notebook assembles the dataset starting from web-scraping and advancing to manipulation/wrangling in Python/Pandas. The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, bs4 (BeautifulSoup) 4.6.0, NumPy 1.14.3, and Pandas 0.23.0.The data are first scraped from the KenPom website using the Python requests library, then "liberated" from HTML using BeautifulSoup functionality. The resulting lists are subsequently wrangled using core Python, NumPy, and Pandas. In the end, 18 years of KenPom data are concatenated in a Pandas dataframe.The complete blog can be read here.See More

March Madness officially arrived at 6 PM CDT, Sunday 3/17/2019. 68 D1 schools -- 32 league champions and 36 at large selections -- received invitations to this year's tournament, which starts Tuesday.Today and tomorrow, fans will work feverishly preparing their brackets. Most will use intuition or simply guess to pick game results. Others like me, though, will use analytics to outsmart chance.One sports analytics expert I've followed closely over the years is Ken Pomeroy. Pomeroy's developed a serious stable of statistical measures for college basketball and adopted a "freemium" business model that avails some content for free while withholding advanced goodies for subscription customers. An analytics geek can get more than she bargained for with KenPom.2019's March Madness has piqued my interest in dataset building as well as statistics. In this blog, my challenge is one of data gathering/organizing/munging rather than analytics per se. My self-assigned task is to download 18 years of college hoops data -- 2002 through 2019 -- from the KenPom site and build a coherent dataset that can be analyzed in Python/Pandas.The code from remainder of this notebook assembles the dataset starting from web-scraping and advancing to manipulation/wrangling in Python/Pandas. The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, bs4 (BeautifulSoup) 4.6.0, NumPy 1.14.3, and Pandas 0.23.0.The data are first scraped from the KenPom website using the Python requests library, then "liberated" from HTML using BeautifulSoup functionality. The resulting lists are subsequently wrangled using core Python, NumPy, and Pandas. In the end, 18 years of KenPom data are concatenated in a Pandas dataframe.The complete blog can be read here.See More

Last time, I posted Part 2 of a blog trilogy on data programming with Python. That article revolved on showcasing NumPy, a comprehensive library created in 2005 that extends the Python core to accommodate "large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays." In addition to introducing a wealth of highly-performant new data structures and mathematical functions, NumPy changed the data programming metaphor in Python from procedural to specification.Part 1 demonstrated basic data programming with Python. There, I resurrected scripts written 10 years ago that deployed core Python data structures, functions, and looping-like code to assemble a Python list for analyzing stock market returns. Fun, but a lot more work to perform the same tasks in Part 2.This Part 3 post replicates the work done in Parts 1 and 2, using the even more productive Pandas library. In Pandas, core Python data structures such as lists/dictionaries and functionals like list comprehensions serve mainly to feed the Pandas beast.Since the Part 3 code is simpler than the NumPy of Part 2, and much less involved than the list processing of Part 1, I've added a few graphs at the end, implemented with the productive Seaborn statistical data visualization library, built on top of Python mainstay matplotlib. Seaborn's grown by leaps and bounds recently and is now a legitimate competitor to R's ggplot2 for statistical graphics.In the analysis that follows, I focus on performance of the Russell 3000 index, a Wilshire 5000-like portfolio for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data using Pandas to get to the desired end state.The technology used for all three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.See the remainder of the blog here.See More

]]>

]]>

Last week I posted the first of a three-part series on basic data programming with Python. For that article, I resurrected scripts written 10 years ago that deployed core Python data structures and functions to assemble a Python list for analyzing stock market returns. While it was fun refreshing and modernizing that code, I'm now pretty spoiled working with advanced libraries like NumPy and Pandas that make data programming tasks much simpler.This second post revolves on a brief showcasing of NumPy, a comprehensive library created in 2005 that extends the Python core to accommodate "large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays."In addition to introducing a wealth of highly-performant new data structures and mathematical functions, NumPy changed the data programming metaphor in Python from procedural to specification. In Part 1, I detail looping-like code for building the final lists; in Part 2, I pretty much simply invoke array functions and structure subscripting to complete the tasks.Though I'm the first to acknowledge not being a NumPy expert, I had little trouble figuring out what to do with the help of stackoverflow. Indeed, those familiar with the relatively recent Pandas library for data analysis will readily adapt to the foundational NumPy programming style. Core Python structures such as lists, dictionaries, comprehensions, and iterables serve primarily to feed the NumPy/Pandas beasts.For the analysis that follows, I focus on performance of the Russell 3000 index, a competitor to the S&P 500 and Wilshire 5000 for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data using NumPy to get to the desired end state.The technology used for all three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.Read the remainder of the blog here.See More

Last week I posted the first of a three-part series on basic data programming with Python. For that article, I resurrected scripts written 10 years ago that deployed core Python data structures and functions to assemble a Python list for analyzing stock market returns. While it was fun refreshing and modernizing that code, I'm now pretty spoiled working with advanced libraries like NumPy and Pandas that make data programming tasks much simpler.This second post revolves on a brief showcasing of NumPy, a comprehensive library created in 2005 that extends the Python core to accommodate "large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays."In addition to introducing a wealth of highly-performant new data structures and mathematical functions, NumPy changed the data programming metaphor in Python from procedural to specification. In Part 1, I detail looping-like code for building the final lists; in Part 2, I pretty much simply invoke array functions and structure subscripting to complete the tasks.Though I'm the first to acknowledge not being a NumPy expert, I had little trouble figuring out what to do with the help of stackoverflow. Indeed, those familiar with the relatively recent Pandas library for data analysis will readily adapt to the foundational NumPy programming style. Core Python structures such as lists, dictionaries, comprehensions, and iterables serve primarily to feed the NumPy/Pandas beasts.For the analysis that follows, I focus on performance of the Russell 3000 index, a competitor to the S&P 500 and Wilshire 5000 for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data using NumPy to get to the desired end state.The technology used for all three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.Read the remainder of the blog here.See More

I had an interesting discussion with one of my son's friends at a neighborhood gathering over the holidays. He's just reached the halfway point of a Chicago-area Masters in Analytics program and wanted to pick my brain on the state of the discipline.Of the four major program foci of business, data, computation, and algorithms, he acknowledged he liked computation best, with Python in the lead against R and SAS for his attention. I was impressed with his understanding of Python, especially given that he'd had no programming experience outside Excel before starting the curriculum.After a while, we got to chatting about NumPy and Pandas, the workhorses of Python data programming. My son's friend was using Pandas a bit now, but hadn't been exposed to NumPy per se. And while he noted the productivity benefits of working with such libraries, I don't think he quite appreciated the magnitude of relief provided for day-to-day data programming challenges. He seemed smitten with the power he'd discovered with core Python. Actually, he sounded a lot like me when I was first exposed to Pandas almost 10 years ago -- and when I first saw R as a SAS programmer back in 2000. As our conversation progressed, I just smiled, fully confident his admiration would grow over time.Our discussion did whet my appetite for the vanilla Python data programming I've done in the past. So I just had to dig up some code I'd written "BP" -- before Pandas. Following a pretty exhaustive search, I found scripts from 2010. The topic was stock market performance. The work stream entailed wrangling CSV files from the investment benchmark company Russell FTSE website pertaining to the performance of its many market indexes. Just about all the work was completed using core Python libraries and simple data structures lists and dictionaries.As I modernized the old code a bit, my appreciation for Pandas/NumPy did nothing but grow. Much more looping-like code in vanilla Python. And, alas, lists aren't dataframes. On the other hand, with Pandas: Array orientation/functions? Check. Routinized missing data handling? Check. Tables/dataframes as core data structures? Check. Handling complex input files? Check. Powerful query capability? Check. Easy updating/variable creation? Check. Joins and group by? Check. Et al.? Check.For the analysis that follows, I focus on performance of the Russell 3000 index, a competitor to the S&P 500 and Wilshire 5000 for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested.Once the file data are in memory, they're munged and combined into a single Python multi-variable list. The combined data are then sorted by date, at which point duplicates are deleted. After that, I compute daily percent change variables from the index levels, ultimately producing index performance statistics. At the end I write the list to a CSV file.My take? Even though the code here is no more than intermediate-level, data programming in Python without Pandas seems antediluvian now. The array orientations of both Pandas and NumPy make this work so much simpler than the looping idioms of vanilla Python. Indeed, even though I programmed with Fortran, PL/I, and C in the past, I've become quite lazy in the past few years.This is the first of three blogs pretty much doing the same tasks with the Russell 3000 data. The second uses NumPy, and the final, Pandas.The technology used for the three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.Read the remainder of the blog here.See More