Recently Mark Ginsburg, a Director and senior data scientist with our firm, wrote a historical perspective on the importance of simplicity and transparency in conducting data analytics. The push for transparency and free/open standards has long been a key component of modern computing. Equally, within data science we find that clients increasingly demand openness and transparency in watching the transition from data into insights. However, we still see vendors in the marketplace pushing their 'proprietary' and 'black box' approaches to analytics. Mark offers a historical perspective why taking this route is almost always a really bad idea.
Below is an excerpt from Mark's piece:
Data Analytics: Keeping it Clean with a Nod to History
By: Mark Ginsburg, Ph.D.
In the Internet’s infancy, Unix shell commands were very terse such as ‘rm’, ‘mv’, ‘cp’ and so on. There was a good reason for this. The poor programmers had to work on so-called ‘typewriters’ (also known as teletypewriters) and it took physical exertion to press the keys down! To exacerbate matters, the devices operated very slowly. For example, the ASR-33 teletype had an input or output rate of only ten characters a second. With this in mind, shell commands and editor commands (such as Ken Thompson’s ‘ed’) were very much to the point.
However this is not to say that the Unix shell, as it evolved in the 1970s and onward, was feeble. It was actually a rich environment since developers could combine the terse shell commands with pipes (the vertical ‘|’ symbol) which can stack commands and redirections such as ‘>’ and ‘<’ to, for example, read in files or send output to files.
For example, this command
executes the ‘ls –al’ command on the left to list files and directories then pipes in the output of that command (see the left-most ‘|’ pipe symbol) as the input to the grep command to the first pipe’s right. The grep pattern ‘^’ character is part of the regular expression notation meaning the start of the line. Thus, since directories have the letter ‘d’ starting the listing, this command picks up directories only. The second pipe sends the output to the ‘head’ command that has a ‘-5‘ parameter. Finally, the top five directories are then redirected, using the ‘>’ symbol, to the file named on the right.
If we had to describe the programming and functional environment, we could use words like ‘minimalistic’ and ‘flexible’ and ‘functional’.
The Unix shell and the protocols such as “FTP” (File Transfer Protocol) and “rsh” (remote shell) and more recently, “ssh” (Secure Shell) and “HTTP” (Hypertext Transport Protocol, the famous Web protocol which gained traction in the 1990s) gained in importance as the early Internet added compute nodes and interconnections. However, there was friction between the scientific community that used the nodes for research and nascent commercial interests. In fact, circa 1981 the National Science Foundation (NSF) enacted an “Acceptable Use Policy” (AUP) on its nationwide backbone to ban activities not in support of research or education. So, for a while, (although in 1995 the NSFNET Backbone was defunded) there were relatively pure research and education projects flowing on the various interconnections. The scientific community used the flexible and minimalistic shell environment to piece together a wide array of software, ranging from astrophysics to chemistry to biology and all areas in between.
As software and hardware became more sophisticated, research communities and their laboratories could accomplish more and more with computing power. However in the laboratory too there was friction between scientific inquiry and commercial motives. A very well known example was when a printer vendor no longer supplied the source code to Richard Stallman’s MIT lab. This meant Stallman and his peers could not modify the printer to do what they needed any longer. This motivated Stallman to launch an initiative of worldwide important, the Free Software movement.
From Stallman’s GNU (launched in 1983) operating system project page, we read these important principles about free software.
“Free software” means software that respects users' freedom and community. Roughly, it means that the users have the freedom to run, copy, distribute, study, change and improve the software. Thus, “free software” is a matter of liberty, not price. To understand the concept, you should think of “free” as in “free speech,” not as in “free beer”.
In a very real sense, scientific research has been greatly aided by both the free software movement and the minimalistic but flexible shell environment.
Notable early GNU General Public License (GPL) successes include Linux (actually only GPL’d in 1992) and the Debian Linux distribution (1993), explicitly committed to the Free Software Foundation (FSF) principles and the Apache HTTP server. More recently the MySQL database and PHP Web scripting language have been added to this list.
In any large enterprise, there are also conflicts between scientific inquiry and commercial interests. In the Big Data space, commercial interests can introduce to the unwary [read more...]