Contributed by Avi Yashchin and Joseph Lee. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on their second class project(due at 4th week of the program).
Our project is centralized around the development of an open source workbench that is focused on providing data scientists with automated tools for exploratory analysis and model selection. The full stack design is made in R, a statistical programming language. Before getting into the low-level details, let's take a step back and think about the trending term "Data Science.”
It seems that everywhere you turn these days there’s someone starting a “Data Science” company. Are you a PhD Dropout from Berkeley? Start a Data Science company. Are you a programmer that knows how to use MongoDB? Start a Data Science company. Did you study english at Yale? That’s right - Data Science. The good people cbinsights.com made a chart about Venture Investment in AI over the last 5 years: What gives, and why now?
There’s been a lot of attention on data science platforms and workbenches that attempt to improve the data scientist’s workflow or allow non data-scientists to perform data science through an immersive user interface. We’re going to show you how to build your very own, open source, machine learning workbench in R. Please steal it.
Everyone has seen some form of the below chart with Computer Processing power rising exponentially. While the fast CPU has driven many innovations, the inexpensive CPU is not the critical factor in Machine Learning. CPU Prices: [caption id="attachment_8306" align="aligncenter" width="622"] The collapse in prices of Hard Disk Space, Memory, and Network Capacity.[/caption] Hard Disk Space: Memory Prices: [caption id="attachment_8309" align="aligncenter" width="476"] Negative trend between the year and memory price.[/caption] Network Prices: It’s not just CPU’s that are dropping in price, but *every part of the PC*. Distributed Machine Learning algorithms depend as much on memory, network speed, and to a smaller degree hard disk speed, as much as they rely on CPU speed. It’s the aggregation of multiple exponential trends that is democratizing access. Many people think, despite these price drops, that tools like AWS are still “too expensive.” This couldn’t be further from the truth, I want to explore the AWS pricing. Old model: Buying “Big Iron”
Modern Alternative - Renting from “The Cloud”
Here’s an AWS pricing list as of 11/20/15. The critical factor here is the $0.126 Per Hour Pricing. Assuming that you live in the NorthEastern corridor, California, or the midwest, and assuming that your computer draws at least 1 kWh, renting server space from amazon is less expensive than just paying for computer electricity in your home state. I live in NYC, pay my own electricity, and I was able to save money by moving my computation demand onto AWS. [caption id="attachment_8311" align="aligncenter" width="546"] Source: http://goanuj.freeshell.org/e/index.html[/caption] With AWS, you can get a server with up to 244 GB of main memory and up to 40 CPUs; no more limitations by hardware and computation time for your R-based analyses. While hardware price reductions are nice, we will see that Machine Learning software prices have collapsed even further.
Open Source machine learning libraries has been a revolution for the machine learning community. The once obscure and specialized topics of machine learning and statistical learning can now be leveraged by a much larger demographic, globally. Machine learning workbenches have been implemented in academia and industry. The following is a short list of many popular platforms. Old Software Pricing: SAS enterprise miner ($140,000 for the first year) IBM SPSS Statistics ($51,000 per user) Alteryx Server ($58,500 per user) H2o.ai, Dataiku DSS, etc ($10,000 per user per year) You get the idea. The worst part about the SAS/IBM/H2o prices, is that much of the software that runs these machine learning libraries is open source to begin with. These companies have a business model of taking freely available open source tools, building a GUI on top of the system, and charging tens of thousands of dollars per year for support. Weka’s software design is centralized around java. Dataiku DSS uses primarily Java, Python, and a kernel design compatible with multiple languages including R. H2o is Java and R. IBM’s software is built on SPSS, an older programming language that was originally punch-card based. Most of these expensive products have proprietary model formats and data cleaning requirements, making interoperability and portability of code a near impossibility. New Software Pricing: Scikit-Learn - Python libraries for Machine Learning (Free) Weka - Java libraries for Machine Learning (Free) TensorFlow (Google) Open Source Machine Learning (Free) FAIR (Facebook) Open Source Machine Learning (Free) R, Python, Spark, Hadoop, Caret (Free) The Machine Learning servers and tools that used to be exclusively the domain of Hedge Funds, Fortune 1000 companies, and large drug manufacturers, are now accessible to anyone. The data science workbench tool that we built over a week is meant to illustrate how easy it is to duplicate the features of the more expensive institutional packages, using completely free software. Our code is also hundreds of lines of code, something that should be relatively easy to maintain by an enterprise.
[embed width="560" height="315"]https://www.youtube.com/watch?v=_TRu2qHcxKs[/embed] Let’s now delve into the app we made. As we mentioned before, we wanted to make an open-source application in theme with the growth of data science startups. Using a combination of Shiny, caret, and other great open source tools, we made a fairly workable platform that can perform basic data analysis, preprocessing, modeling, and validation. We spent only three days developing this app and there will be surely many bugs and glitches with the code. Keep in mind that our main intention was to create a functional prototype to showcase a small fraction of creative possibilities available to us through the open source community.
We used R Studio as the main IDE for our app. R console works fine as well. This blog will give a general overview of our development process. The code is available here if you wish to play around with it and learn more about our full stack. To begin, we created two blank r files: ui.r and server.r within a new project directory or folder. You can name this folder anything. For the ui.r file you will need to install and load the following packages below.
In the server.r file, you will need to install and load the following packages as well.
Once these packages are loaded and declared in their respective files, we can proceed with the UI phase. For those new to Shiny and R, we recommend playing around with these tutorials to provide introduction to Shiny to gain a better understanding of the relationship between the ui.R and server.R files. You can find the link here to the main shiny page.
In order to make a visually appealing and straightforward interface design, we implemented the shiny dashboard package. This package had a very straightforward documentation that can be found here. The shiny package comes with a large pool of high quality icons, css themes, and other bootstrap quality elements. I recommend following the shiny dashboard tutorials and then cross reference your learning with our code to get the most out of this blog post. The server.r code was tricker than the UI for a couple of reasons, reactive shiny features is a must for creating an interactive shiny app. The main shiny blog does a fantastic job explaining dynamic and reactive scripting in R so I will leave the explanation to them in this link. Essentially, reactive functions in shiny means that you are creating smaller “pseudo-functions” that automatically receives user input when interacting with features such as a check-box or slider. We wanted reactive functionality to allow the users to customize their tuning parameters for the modeling part of the app. Due to the size of the project, we won't be going into the details of the code. However, if enough requests are made, we may consider creating a tutorial blog post to go more in depth in the development of our app. You will find the main files in this blogpost below. Again, feel free to access our github if you wish to play around with our app and source code. UI.R
Feel free to access our github to access SERVER.R.
Our app has three main parts. The first part is the data preparation feature. You can upload almost any csv file and the app will automatically perform missing analysis. We mainly used the iris.csv data set as the main test set. The second part is the analysis feature. The analysis features contains three sub-features: preprocessing, feature graphing, and modeling. For the pre-processing sub-feature, we kept the options minimal and included cross validation, IPA, and ICA options that can be activated through shiny widgets. The feature graphing is a simple graphing application that produces a lattice plot of all data features in the data set. For the modeling sub-feature, we made four algorithms available for the user to choose and tune: KNN, logit boost, gradient boosting method, and neural networks. All of the modeling algorithms were from the caret package and is a fantastic package to familiarize yourself with if you want to pursue more machine learning application in R. The third part of the app is the results feature. After selecting the models to use for the uploaded data set, you can see and compare results from the different models on one page. The demo video above provides a visual overview on how to operate the shiny application. Feel free to reach out to us if you have any questions or comments about the code. Again, this is completely open source so please take it and play around with it!
We are living in an exciting era of data science. Looking into different perspectives of the open source resources available to any data enthusiast, it's easy to see how so many startups are gaining traction in industry. Everything you need to start a data science startup are only a few keystrokes away. For fun, we loaded our app onto an AWS instance and compared its computational runtime with a local instance of the app. We found that our app was x10 faster on an AWS instance than my local machine (Macbook Air, 8 GB ram, 512 SSD). This only shows that anyone can create a budget friendly data science startup as long as you are creative and determined to see it through. Thank you for reading our blog post and as a bonus please enjoy our deep learning art below!