Home » Uncategorized

Tutorial Day at Strata Data San Jose, 2018

Tutorial Day at Strata Data San Jose, 2018

Tuesday of Strata Data Conference is my favorite of the four days. The calm before the storm of the keynotes and short presentations of Wednesday-Thursday, Tuesday revolves on half day training sessions that afford reasonably deep dives into technical data science topics. This year my choices were Using R and Python for scalable data science, machine learning, and AI in the AM and Time series data: Architecture and use cases in the afternoon.

I was somewhat wary going into the first session, since the presenters were from Microsoft, which markets a commercial version of open source R, as well as Azure, its comprehensive set of cloud services that competes with AWS. My concern was that the technology presented would be geared to uniquely Microsoft solutions that wouldn’t generalize and thus be of limited value in non-Microsoft environments. It turns out I was both right and wrong: yes, the solutions revolved on Azure and utilized Microsoft extensions to R and Python; but no, at the same time, the material was of significant value for a non-Microsoft-committed developer like myself.

The presentations were on featurization and active supervised learning of data sets with limited percentages of final “y” outputs. In these cases, labeling is often very expensive and otherwise painful, so that labels are often built over time and used to construct training data in drips and drabs. Two interesting use cases, involving both text and image classification, were comprehensively reviewed.

The development environments were Jupyter Notebook for Python/Spark and Rstudio/RMarkdown for R/Spark. The illustrated code was comprehensive and showed both the data build and machine learning processes. While I wasn’t hands-on as others during the session, I was able to follow the thinking, and will download the code and run it on Azure at a later time.

This was a very solid tutorial by senior data scientists from Microsoft. Not only were they knowledgeable presenters, but they also well-covered the large classroom to handle live technical issues. Two thumbs up.

I had the opportunity to take a time series analysis class in grad school with the venerable George Box back in the day. And in the 40 years I’ve been in the data/statistics work world since, I’d guesstimate that 80% of my effort has been devoted to assembling data, with 80% of the remaining statistics work given to forecasting.

I must acknowledge that I sometimes interchange time series with forecasting but, as confirmed this afternoon, the two are quite different. As I now appreciate, time series in its most general usage is much larger than forecasting – having to do with data assembly, management, and analytics.

So the time series session was more about the latest in storage technologies such as S3 and HDFS, streaming products like Apache Flink, Apache Kafka, Apache Storm, and Spark Streaming, and access engines Cassandra and Spark SQL, than it was about specific forecasting algorithms.

But no worry for me. With a background in R, Python, SQL, and Spark as well as the time series data management packages in R, I was able to follow along seamlessly. Especially the first two hours that were devoted to the logic of time series concepts like lead and lag, tumbling windows, sliding windows, session windows, event times vs processing times, and inflection points.

I did struggle with configuring the hands-on Kafka and Spark Streaming tutorials – but what the heck, I’m more on the data/analytics side of data science than I am on adminstration/app dev. Students implementing on streaming platforms in their current jobs hit the technical mother lode with instructor Ted Malaska. Malaska is very accomplished, if a bit disorganized at times. He was at his best discussing specific technologies like Kafka and Spark and answering participant questions. In all a quite productive session, perhaps half a step below the first. 1.5 thumbs up.