A few months ago, I wrote a quite positive blog on the Julia analytics language, reveling in its MIT pedigree, its comprehensible structure, and its interoperability with data science stalwarts R and Python. I demonstrated some of Julia's capabilities with a skinny data set of daily stock index values, and I showed it collaborating with R's powerful ggplot graphics subsystem.
This time around, I decided to test Julia against a much meatier data set -- one that I've already examined extensively with both R and Python. I find time and again that it's critical to push analytics platforms with size to uncover their strengths and weaknesses.
The data set I use here consists of Chicago crime records from 2001-2018 in a csv file posted for download each morning. At present, it consists of over 6.7M records and 20+ attributes on the what, where, and when of all crimes logged by the Chicago Police Department. My tests revolve on loading the "master" data, then enhancing with lookup tables describing Chicago communities and the classifications of crimes. From there, I tally multi-attribute frequencies on type, location, and time -- ultimately graphing the results.
R with it's data.table/tidyverse data management/analysis ecosystem and Python with Pandas have met the challenges with aplomb. Both their notebooks have been easy to construct and plenty fast. How would Julia, a much less mature competitor, stack up?
As an adolescent, the Julia language is a bit of a moving target, making the development process somewhat slower than with R and Python. I was able to do what I needed to, though, adopting a similar development strategy of driving from dataframe/datatable packages. stackoverflow is my best development friend, if, not surprisingly, a bit more helpful with R and Python than Julia.
So what did I find? With the exception of several annoyances such as the absence of a vectorized "in" operator, I was pretty much able to mimic in Julia the programming style I used with Python/Pandas. In fact, Julia was somewhat more facile than Pandas with "by group" processing, as its functions acknowledge missing values, unlike Pandas, which ignores them.
What disappointed me, though, was the relative performance of Julia vs R/Python. I think I'm being charitable noting that the tests I ran in both R and Python run at least twice as fast as comparables in Julia. And, of course, the expectation is that Julia should be faster. So I guess I'm a bit despondent after my second date with Julia -- but not, like some, ready to give up just yet.
The code that follows first downloads the csv file from the Chicago Data Portal. It then reads the data into a Julia dataframe, joining that in turn with several lookup tables -- much like one would do with an RDBMS. I then run a number of frequencies queries involving type, time, and place. Finally, and perhaps most gratifyingly, I use R's ggplot exposed by the RCall package to graph frequency results. The code cells follow.
The software used is Julia 1.0.0, Python 3.6.5, Microsoft Open R 3.4.4, and JupyterLab 0.32.1.
Read the entire blog here.