About a year ago, a young neighbor who’s enrolled in an MS is Data Science program asked my help on an R coding exercise. The challenge was to compute several new category attributes based on columns in an initially loaded dataframe. His solution was to loop through each of the df rows, populating the new vars with basic if/then logic. Kind of reminded me of how I might have coded a Fortran or PL/I program back in the day.

I of course dissuaded him from adopting this approach, selling the merits of vectorized methods such as ifelse and data.table chaining. I also warned that one should almost never use a case looping metaphor in R programming.

A few months later, I pretty much had to eat those words when we discussed a new exercise in Python/Pandas to identify recession periods from a file of quarterly gross domestic product (gdp) figures. The data are organized sequentially by quarter, with a recession identified as two consecutive quarters of negative growth gdp. So, to determine a recession, the program had to work with the sorted file and have “memory” of lagged gdp changes, able to take backwards glances from the current record. Once we agreed on an “algorithm”, he was able to code a solution — one that loops in Pandas much as in R.

This type of control breaks processing also resonated when I downloaded a file of closing levels of the daily S&P 500 stock index recently. Fueled by presidential claims that we’d just set a new S&P 500 high water mark, I was intrigued at how often this might have happened over an historical time frame. The daily levels from the Yahoo Finance file run from 1928 through the latest market closing date, so I had no shortage of data to examine.

My approach was to start from the earliest data and define a “sentinel” or starting record, considering only the second row and beyond as candidates for high water mark closing levels. Once these “record” levels are identified, their whereabouts are saved and new high water levels assigned accordingly.

The code detailing the 1198 found high water marks is detailed below. There are obviously no such levels identified prior to 1928, and the closing figures analyzed are simply market prices without adjustments for dividends.

I considered several variants of record at a time looping in R — the first a for loop across the nrows, the second using the functional map construct. I found a lot of good ideas from this analysis.

The technology used is Windows 10 with JupyterLab 0.35.4, and R 3.6.0, along with R packages data.table 1.12.2, tidyverse 1.2.1, and tidyquant 0.5.8.

Read the entire blog here.

Working with Control Breaks Data in R.

Leave a Reply Cancel reply