After posting my most recent blog using census data to illustrate handling "large" dataframes in R exploiting fst and feather file formats, I realized I could have taken the analysis a step further.Recall my then description of "demographic information on both American households and individuals (population). The final household and population data stores are quite large for desktop computing: household consists of almost 7.5M records with 233 attributes, while population is just under 15.8M cases and 286 variables."In addition to working with the two individual dataframes/data.tables, one could also consider the merge of household and population. For each household, there are one or more population records; each population record is, in turn, for one and only one household. Indeed, there's an attribute, serialno, that can be used to join household and population to a resulting data.table of almost 15.8M records and over 500 attributes -- consuming in excess of 32GB memory. This is quite large for desktop R.So, I just had to attempt to produce such a structure, despite the R in-memory limitation, and then write fst and feather files for subsequent use. Alas, the attempt to produce the supersize data.table on my 64GB Wintel notebook failed with a memory allocation error (R's not the most efficient memory manager.). My response? Get the join to work on a 128GB notebook, then "save" the resulting data in fst and feather files. Once produced, transport those files to the 64GB machine to see if they can be read. Turns out that approach works quite well.The remainder of this notebook splits attention between 128GB RAM and 64GB RAM machines, producing the joined data.table on the former and demoing access of the data from fst and feather files on the latter. Once the grand fst file is built, it can be redeployed to a smaller memory machine and accessed like an in-memory data.table with projection and filtering. The performance is excellent.The technology used is JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, and R 3.6.0. The R tidyverse, data.table, fst, and feather packages are featured.See the entire blog here. See More

]]>

I recently downloaded a 5 year Public Use Microsample (PUMS) from the latest release of the American Community Survey (ACS) census data. The data contain a wealth of demographic information on both American households and individuals (population). The final household and population data stores are quite large for desktop computing: household consists of almost 7.5M records with 233 attributes, while population is just under 15.8M cases and 286 variables.In addition to enabling a wealth of demographic analyses, these census data are quite suitable for performance testing functions to ingest, munge, and deliver analytics. That is, if your computer has enough firepower: both R and Python-Pandas constrain data structures by the size of memory. Fortunately, my Wintel notebook, with 64 GB RAM and 2 TB disk/solid state storage, is up to hardware task here.My focus with this blog is on determining how R's dataframe/data.table read and write capabilities measure up to 15+ GB of raw input. Working with data this size can often deliver more clear-cut benchmarks than smaller tests repeated and aggregated. Indeed oftentimes, as is the case in this notebook, the analyst can experience order of magnitude performance differences between various approaches.In the analyses below, I contrast the elapsed time of writing the 18 GB population dataframe to OS files using three different csv functions, R's saveRDS function with and without compression, the interoperable feather library, and the nonesuch fst library. I then, in turn, read these just-produced OS files back into dataframes/datatables and compare timing results. Each of the seven read/write approaches produces files that are portable across disparate R platforms. The feather package, in addition, interoperates between R and Python-Pandas -- a major benefit.At the conclusion of the performance tests, I outline a generic approach to efficiently sourcing data from R to work in both R and Python-Pandas platforms using functionality from a combination of the fst and feather packages. I demonstrate the approach in R and, using the nifty reticulate package, in Python-Pandas as well.The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, Pandas 0.23.0, and R 3.6.0. The R data.table, fst, feather, and reticulate packages are featured.Read the entire post here.See More

In my many years as a data scientist, I've spent more time doing forecast work than any other type of predictive modeling. Often as not, the challenges have involved forecasting demand for an organization's many products/lines of business a year or more out based on five or more years of actual data, generally of daily granularity. A difficult task indeed and one for which accuracy expectations by the business are seldom met.One thing I've learned about forecasting is not to be a slave to any modeling technique, choosing predictive integrity over model fidelity. And I've become adept at scrambling -- adapting to early forecasting results with appropriate model changes to better predict an ever-evolving future. It turns out that both economists and meteorologists are also in that mode, with a name, "nowcasting", to describe how they modify early predictions based on experience during the forecast period. Meteorologists are constantly changing their weather forecasts, and economists update their annual GDP projections quite often due to evolving inputs.Formally, nowcasting is the "prediction of the present, the very near future and the very recent past. Crucial in this process is to use timely monthly information in order to nowcast key economic variables.....the nowcasting process goes beyond the simple production of an early estimate as it essentially requires the assessment of the impact of new data on the subsequent forecast revisions for the target variable."I've written several blogs over the years on crime in my home city of Chicago, especially after the disturbing uptrend in 2016. I continue to download Chicago crime data daily to look at the frequencies of homicides and violent crimes. The trends are in the right direction, though the pace is not nearly fast enough.After the disastrous 2016, I've been in forecast mode for 2017, 2018, and now 2019. My approach is one of nowcasting -- starting with predictions for 2019 based on the available data from 2001-2018, then changing these forecasts based on the daily experience as 2019 progresses. It turns out, not surprisingly, that using year-to-date experience is quite helpful in forecasting final annual counts. Knowing the number of violent crimes between 1/1/2018 and 2/28/2018 was a big help in predicting the final 2018 violent crime frequencies. And knowing the counts through 6/30/2018 was even more valuable.The remainder of this blog examines how the first four months of frequencies for homicide and violent crimes can assist in forecasting final annual 2019 numbers. I explore the relationships between year-to-date and final counts for homicides and violent crimes in Chicago from 2001-2018, then attempt to forecast 2019's final frequencies. I'll continue to do the analytics as 2019 progresses, hopefully nowcasting more accurate (and declining) crime over time.The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0, and Microsoft R 3.4.4 with ggplot and rmagic. The cumulative daily Chicago crime file from 2001 through to-date 2019 (a week in arears) drives the analysis. Data munging is done with Python/Pandas. The crime frequency dataframes are then fed to R for visualization using ggplot.Find the entire blog here.See More

A little less than a year ago, I posted a blog on generating multivariate frequencies with the Python Pandas data management library, at the same time showcasing Python/R graphics interoperability. For my data science work, the ability to produce multidimensional frequency counts quickly is sine qua non for subsequent data analysis.Pandas provides a method, value_counts(), for computing frequencies of a single dataframe attribute. By default, value_counts excludes missing (NaN) values, though they're included with the dropna=False option. As noted in that blog, however, the multivariate case is more problematic. "Alas, value_counts() works on single attributes only, so to handle the multi-variable case, the programmer must dig into Pandas's powerful split-apply-combine groupby functions. There is a problem with this though: by default, these groupby functions automatically delete NA's from consideration, even as it's generally the case with frequencies that NA counts are desirable. What's the Pandas developer to do?There are several work-arounds that can be deployed. The first is to convert all groupby "dimension" vars to string, in so doing preserving NA's. That's a pretty ugly and inefficient band-aid, however. The second is to use the fillna() function to replace NA's with a designated "missing" value such as 999.999, and then to replace the 999.999 later in the chain with NA after the computations are completed. I'd gone with the string conversion option when first I considered frequencies in Pandas. This time, though, I looked harder at the fillna-replace option, generally finding it the lesser of two evils."Subsequent to that posting, I detailed a freqs1 function for single attribute dataframe frequencies and freqsdf for the multivariate case. The functions have worked pretty well for me on Pandas numeric, string, and data datatypes. Simple versions of these functions are included below.Last summer I started experimenting with the Pandas categorical datatype. Much like factors in R, "A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales." The categorical datatype can be quite useful in many instances, both in signaling to analytic functions a specific role for the attribute, and also for saving memory by storing integers instead of more consuming representations such as string.Unfortunately, as I started to incorporate categorical attributes in my work, I found that my trusty freqsdf no longer worked, tripped up by the internal representation of the new datatype. So it was back to the drawing board to expand freqsdf functionality to handle categorical data. In the remainder of this blog, I present proof of concepts for two competing functions, freqscat and freqsC, that purport to satisfy all datatypes of freqsdf plus categorical attributes. Hopefully useful extensions, these functions should be seen as POC only.The cells that follow exercise Pandas frequencies options for historical Chicago crime data with over 6.8M records. I first load feather files produced from prior data munging into Pandas dataframes, then build freqscat and freqsC functions from freqs1 and freqsdf foundations.The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.Read the entire post here.See More

March Madness officially arrived at 6 PM CDT, Sunday 3/17/2019. 68 D1 schools -- 32 league champions and 36 at large selections -- received invitations to this year's tournament, which starts Tuesday.Today and tomorrow, fans will work feverishly preparing their brackets. Most will use intuition or simply guess to pick game results. Others like me, though, will use analytics to outsmart chance.One sports analytics expert I've followed closely over the years is Ken Pomeroy. Pomeroy's developed a serious stable of statistical measures for college basketball and adopted a "freemium" business model that avails some content for free while withholding advanced goodies for subscription customers. An analytics geek can get more than she bargained for with KenPom.2019's March Madness has piqued my interest in dataset building as well as statistics. In this blog, my challenge is one of data gathering/organizing/munging rather than analytics per se. My self-assigned task is to download 18 years of college hoops data -- 2002 through 2019 -- from the KenPom site and build a coherent dataset that can be analyzed in Python/Pandas.The code from remainder of this notebook assembles the dataset starting from web-scraping and advancing to manipulation/wrangling in Python/Pandas. The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, bs4 (BeautifulSoup) 4.6.0, NumPy 1.14.3, and Pandas 0.23.0.The data are first scraped from the KenPom website using the Python requests library, then "liberated" from HTML using BeautifulSoup functionality. The resulting lists are subsequently wrangled using core Python, NumPy, and Pandas. In the end, 18 years of KenPom data are concatenated in a Pandas dataframe.The complete blog can be read here.See More

March Madness officially arrived at 6 PM CDT, Sunday 3/17/2019. 68 D1 schools -- 32 league champions and 36 at large selections -- received invitations to this year's tournament, which starts Tuesday.Today and tomorrow, fans will work feverishly preparing their brackets. Most will use intuition or simply guess to pick game results. Others like me, though, will use analytics to outsmart chance.One sports analytics expert I've followed closely over the years is Ken Pomeroy. Pomeroy's developed a serious stable of statistical measures for college basketball and adopted a "freemium" business model that avails some content for free while withholding advanced goodies for subscription customers. An analytics geek can get more than she bargained for with KenPom.2019's March Madness has piqued my interest in dataset building as well as statistics. In this blog, my challenge is one of data gathering/organizing/munging rather than analytics per se. My self-assigned task is to download 18 years of college hoops data -- 2002 through 2019 -- from the KenPom site and build a coherent dataset that can be analyzed in Python/Pandas.The code from remainder of this notebook assembles the dataset starting from web-scraping and advancing to manipulation/wrangling in Python/Pandas. The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, bs4 (BeautifulSoup) 4.6.0, NumPy 1.14.3, and Pandas 0.23.0.The data are first scraped from the KenPom website using the Python requests library, then "liberated" from HTML using BeautifulSoup functionality. The resulting lists are subsequently wrangled using core Python, NumPy, and Pandas. In the end, 18 years of KenPom data are concatenated in a Pandas dataframe.The complete blog can be read here.See More

Last time, I posted Part 2 of a blog trilogy on data programming with Python. That article revolved on showcasing NumPy, a comprehensive library created in 2005 that extends the Python core to accommodate "large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays." In addition to introducing a wealth of highly-performant new data structures and mathematical functions, NumPy changed the data programming metaphor in Python from procedural to specification.Part 1 demonstrated basic data programming with Python. There, I resurrected scripts written 10 years ago that deployed core Python data structures, functions, and looping-like code to assemble a Python list for analyzing stock market returns. Fun, but a lot more work to perform the same tasks in Part 2.This Part 3 post replicates the work done in Parts 1 and 2, using the even more productive Pandas library. In Pandas, core Python data structures such as lists/dictionaries and functionals like list comprehensions serve mainly to feed the Pandas beast.Since the Part 3 code is simpler than the NumPy of Part 2, and much less involved than the list processing of Part 1, I've added a few graphs at the end, implemented with the productive Seaborn statistical data visualization library, built on top of Python mainstay matplotlib. Seaborn's grown by leaps and bounds recently and is now a legitimate competitor to R's ggplot2 for statistical graphics.In the analysis that follows, I focus on performance of the Russell 3000 index, a Wilshire 5000-like portfolio for "measuring the market". I first download two files -- a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data using Pandas to get to the desired end state.The technology used for all three articles revolves on JupyterLab 0.32.1, Anaconda Python 3.6.5, NumPy 1.14.3, and Pandas 0.23.0.See the remainder of the blog here.See More

]]>

]]>