Subscribe to DSC Newsletter

Contributed by Avi Yashchin and Joseph Lee. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on their second class project(due at 4th week of the program).

Collaborators:

Joseph Lee

Avi Yashchin

 

Introduction

Our project is centralized around the development of an open source workbench that is focused on providing data scientists with automated tools for exploratory analysis and model selection.  The full stack design is made in R, a statistical programming language. Before getting into the low-level details, let's take a step back and think about the trending term "Data Science.”  

Part1:

The Startup Phenomena:  Making the next Netflix of Machine Learning, Uber of Data Modeling, or Chipotle of Data?

It seems that everywhere you turn these days there’s someone starting a “Data Science” company.  Are you a PhD Dropout from Berkeley? Start a Data Science company. Are you a programmer that knows how to use MongoDB? Start a Data Science company.  Did you study english at Yale? That’s right - Data Science.  The good people cbinsights.com made a chart about Venture Investment in AI over the last 5 years:   image01     What gives, and why now?

  1. Better Hardware: The advance of Moore’s Law has radically reduced Memory, Networking, and Data Storage costs.
  2. Better Software: Open Source tools provide free tools to train models of any flavor.  Previous statistics packages used to require expensive contracts with private companies.
  3. Better Algorithms: Feature Engineering used to be a large part of the data science experience.  New Algorithms are able to learn the most predictive features automatically, without transformations or assumptions about the correct distribution of data needed.
  4. Education and Training: You can learn Data Science in R and Python free on coursera, and learn Big Data tools like Apache Spark free on edx.  Seriously.

There’s been a lot of attention on data science platforms and workbenches that attempt to improve the data scientist’s workflow or allow non data-scientists to perform data science through an immersive user interface.  We’re going to show you how to build your very own, open source, machine learning workbench in R.  Please steal it.

The Hardware Perspective 

Everyone has seen some form of the below chart with Computer Processing power rising exponentially.  While the fast CPU has driven many innovations, the inexpensive CPU is not the critical factor in Machine Learning. CPU Prices: [caption id="attachment_8306" align="aligncenter" width="622"]The collapse in prices of Hard Disk Space, Memory, and Network Capacity. The collapse in prices of Hard Disk Space, Memory, and Network Capacity.[/caption]   Hard Disk Space: Hard Drive Cost Chart       Memory Prices:   [caption id="attachment_8309" align="aligncenter" width="476"]Source: http://www.slideshare.net/IPExpo/slideshare1-0950-pathfinder Negative trend between the year and memory price.[/caption]   Network Prices: Source: https://mentaleffort.wordpress.com/tag/technical-debt/     It’s not just CPU’s that are dropping in price, but *every part of the PC*.  Distributed Machine Learning algorithms depend as much on memory, network speed, and to a smaller degree hard disk speed, as much as they rely on CPU speed.  It’s the aggregation of multiple exponential trends that is democratizing access.  Many people think, despite these price drops, that tools like AWS are still “too expensive.”  This couldn’t be further from the truth, I want to explore the AWS pricing. Old model:  Buying “Big Iron

  • Buy or Rent Mainframe Computers from IBM, Unisys, BMC, etc
  • Buy a large network of custom servers, only use them for a few days a year while modeling. Try to sell excess server time to, umm ... Pixar?  Maybe the Weather Channel.
  • Huge excess capacity and large upfront costs.

Modern Alternative - Renting from “The Cloud

  • AWS Reserved instances cost less than 13 cents per hour.  Spot instances can be acquired for pennies per hour.
  • Train your models, then *shut down your servers*.
  • No excess capacity, no upfront costs.

  Here’s an AWS pricing list as of 11/20/15.  The critical factor here is the $0.126 Per Hour Pricing.   Amazon Pricing List   Assuming that you live in the NorthEastern corridor, California, or the midwest, and assuming that your computer draws at least 1 kWh, renting server space from amazon is less expensive than just paying for computer electricity in your home state.  I live in NYC, pay my own electricity, and I was able to save money by moving my computation demand onto AWS. [caption id="attachment_8311" align="aligncenter" width="546"]Source: http://goanuj.freeshell.org/e/index.html Source: http://goanuj.freeshell.org/e/index.html[/caption] With AWS, you can get a server with up to 244 GB of main memory and up to 40 CPUs; no more limitations by hardware and computation time for your R-based analyses.  While hardware price reductions are nice, we will see that Machine Learning software prices have collapsed even further.  

The Software Perspective

Open Source machine learning libraries has been a revolution for the machine learning community.  The once obscure and specialized topics of machine learning and statistical learning can now be leveraged by a much larger demographic, globally. Machine learning workbenches have been implemented in academia and industry.  The following is a short list of many popular platforms. Old Software Pricing: SAS enterprise miner ($140,000 for the first year) IBM SPSS Statistics ($51,000 per user) Alteryx Server ($58,500 per user) H2o.ai, Dataiku DSS, etc ($10,000 per user per year) You get the idea. The worst part about the SAS/IBM/H2o prices, is that much of the software that runs these machine learning libraries is open source to begin with.  These companies have a business model of taking freely available open source tools, building a GUI on top of the system, and charging tens of thousands of dollars per year for support. Weka’s software design is centralized around java. Dataiku DSS uses primarily Java, Python, and a kernel design compatible with multiple languages including R.  H2o is Java and R.  IBM’s software is built on SPSS, an older programming language that was originally punch-card based. Most of these expensive products have proprietary model formats and data cleaning requirements, making interoperability and portability of code a near impossibility. New Software Pricing: Scikit-Learn - Python libraries for Machine Learning (Free) Weka - Java libraries for Machine Learning (Free) TensorFlow (Google) Open Source Machine Learning (Free) FAIR (Facebook) Open Source Machine Learning (Free) R, Python, Spark, Hadoop, Caret (Free) The Machine Learning servers and tools that used to be exclusively the domain of Hedge Funds, Fortune 1000 companies, and large drug manufacturers, are now accessible to anyone. The data science workbench tool that we built over a week is meant to illustrate how easy it is to duplicate the features of the more expensive institutional packages, using completely free software.  Our code is also hundreds of lines of code, something that should be relatively easy to maintain by an enterprise.  

The Algorithms Perspective

Our project is focused on the full stack implementation of R in order to not only explore its computational nuances in a data science setting, but also explore how R’s UI capabilities can contribute to a positive workflow and user experience. Old design paradigms for Machine Learning required a developer to learn many different modeling packages, usually written by different people, with inconsistencies in how models are specified and predictions are made. In the past, each row of the below table was a completely different workflow.  Each Model Class had its own input, format, and tuning parameters. Running a single data set through multiple models used to require hundreds of lines of code.  However, there is an R-based Open Source Alternative - Caret, which has Standardized model tuning.  We’ve been calling this the “Scikit-Learn” for R.  The Caret function Syntax is a dream to work with, and anyone can create, tune, and compare the results from multiple models with ease.  Caret is 100% free. Model Class Package Caret Function Syntax lda MASS predict(obj) (no options needed) glm stats predict(obj, type = "response") gbm gbm predict(obj, type = "response", n.trees) made mda predict(obj, type = "posterior") rpart rpart predict(obj, type = "prob") Weka RWeka predict(obj, type = "probability") LogitBoost caTools predict(obj, type = "raw", nIter)   Caret Homepage   The Open Source Alternative - Shiny Shiny is not the only new tool for computer visualizations, but is a fully functional web app development package that can streamline R code directly into an interactive frame without the need to know know javascript or html.  The Shiny package is compatible with many other interfaces including Google Viz, Tableau, matplotlib, bokeh (and a ton of others).  With R and Shiny, you can setup a webserver, and provide visualization tools to your BI teams in real-time.  Did we mention this tool is free?   We’re broke students, and our classroom is at WeWork, where we get free coffee and beer.  Shiny and R fit right in.  

Part2:

Our Data Science App

[embed width="560" height="315"]https://www.youtube.com/watch?v=_TRu2qHcxKs[/embed] Let’s now delve into the app we made.  As we mentioned before, we wanted to make an open-source application in theme with the growth of data science startups.  Using a combination of Shiny, caret, and other great open source tools, we made a fairly workable platform that can perform basic data analysis, preprocessing, modeling, and validation.  We spent only three days developing this app and there will be surely many bugs and glitches with the code.  Keep in mind that our main intention was to create a functional prototype to showcase a small fraction of creative possibilities available to us through the open source community.  

Brief Tutorial

We used R Studio as the main IDE for our app.  R console works fine as well.   This blog will give a general overview of our development process.  The code is available here if you wish to play around with it and learn more about our full stack. To begin, we created two blank r files: ui.r and server.r within a new project directory or folder.  You can name this folder anything.  For the ui.r file you will need to install and load the following packages below.


require(shiny);require(shinyIncubator);
library(shinydashboard);

In the server.r file, you will need to install and load the following packages as well.


require(shiny);require(caret);
require(e1071);
require(randomForest);
require(nnet);
require(glmnet);
require(gbm);
library(mice);
library(VIM);
require(fastICA);
require(pastecs);
library(googleVis);
library("PASWR");
require("doMC")
source("helpers.R")

Once these packages are loaded and declared in their respective files, we can proceed with the UI phase. For those new to Shiny and R, we recommend playing around with these tutorials to provide introduction to Shiny to gain a better understanding of the relationship between the ui.R and server.R files.  You can find the link here to the main shiny page.

The UI & Server Code

In order to make a visually appealing and straightforward interface design, we implemented the shiny dashboard package.  This package had a very straightforward documentation that can be found here.  The shiny package comes with a large pool of high quality icons, css themes, and other bootstrap quality elements.  I recommend following the shiny dashboard tutorials and then cross reference your learning with our code to get the most out of this blog post.   The server.r code was tricker than the UI for a couple of reasons, reactive shiny features is a must for creating an interactive shiny app.  The main shiny blog does a fantastic job explaining dynamic and reactive scripting in R so I will leave the explanation to them in this link.  Essentially, reactive functions in shiny means that you are creating smaller “pseudo-functions” that automatically receives user input when interacting with features such as a check-box or slider.  We wanted reactive functionality to allow the users to customize their tuning parameters for the modeling part of the app.   Due to the size of the project, we won't be going into the details of the code.  However, if enough requests are made, we may consider creating a tutorial blog post to go more in depth in the development of our app.  You will find the main files in this blogpost below.  Again, feel free to access our github if you wish to play around with our app and source code. UI.R


require(shiny);require(shinyIncubator);
library(shinydashboard);
require(pastecs);

shinyUI(dashboardPage(
skin = "blue",

dashboardHeader(title = "ML Explorer Beta 1.0"),
dashboardSidebar(
sidebarMenu(
sidebarSearchForm(textId = "searchText", buttonId = "searchButton",
label = "Search..."),
menuItem("Summary", tabName = "summary", icon = icon("dashboard")),
#menuItem("Upload", tabName = "dataupload", icon = icon("upload")),
#menuItem("Database", tabName = "database", icon = icon("database")),
menuItem("Data Preparation", tabName = "datapreparation", icon = icon("wrench")),
menuItem("Analysis", tabName = "analysis", icon = icon("cogs"),
menuSubItem("Train & Validation",icon = icon("cog"), tabName = "trainvalidation"),
menuSubItem("Features",icon = icon("cog"), tabName = "features"),
menuSubItem("Algorithm",icon = icon("cog"), tabName = "algorithm")
),

menuItem("Results", tabName = "results", icon = icon("dashboard")),
menuItem("About", tabName = "about", icon = icon("info")),
menuItem("Code", tabName = "code", icon = icon("code"))
)

),
dashboardBody(

tabItems(
# First tab content
tabItem(tabName = "summary",
fluidRow(

box(title = "How To Use", status = "primary", solidHeader = TRUE,
collapsible = TRUE, width = 8,
h4("Step 1: Upload Dataset"),
h5("Ideally any csv file is useable. It is recommended to perform cleaning and munging methods prior to the upload though. We intend to apply data munging/cleaning methods in this app in the near future."),
h4("Step 2: Analyze Data"),
h5("Current version allows the user to perform basic missing analysis."),
h4("Step 3: Choose Pre-processing Methods"),
h5("Basic K-Cross Validation Methods are applicable. "),
h4("Step 4: Choose Model"),
h5("Choose from a selection of machine learning models to run. Selected parameters for each corresponding model are available to tune and manipulate."),
h4("Step 5: Run Application"),
h5("Once the model(s) have been executed, the results for each model can be viewed in the results tab for analysis."),
imageOutput("image2"))),
fluidRow(
box(title = "Libraries/Dependencies",status = "primary", solidHeader = TRUE,
collapsible = TRUE, width = 8,
h4("- The caret package was used for the backend machine learning algorithms."),
h4("- Shiny Dashboard was used for the front end development."),
h4("- The application is compatiable with AWS for server usage.")))),

######################################
# Data Preparation Tab Contents
######################################

# Second tab content
tabItem(tabName = "datapreparation",
fluidPage(
tabBox(
id = "datapreptab",


tabPanel(h4("Data"),

fileInput('rawInputFile','Upload Data File',accept=c('text/csv', 'text/comma-separated-values,text/plain', '.csv')),
uiOutput("labelSelectUI"),
checkboxInput('headerUI','Header',TRUE),
radioButtons('sepUI','Seperator',c(Comma=',',Semicolon=';',Tab='\t'),'Comma'),
radioButtons('quoteUI','Quote',c(None='','Double Quote'='"','Single Quote'="'"),'Double Quote')),

tabPanel(h4("Data Analysis"), verbatimTextOutput("textmissing"), dataTableOutput("colmissing")),

tabPanel(h4("View Data"), dataTableOutput("pre.data"))),
infoBoxOutput("missingBox"))),

##################################################################################
#### Training/Splitting Tab Set Contents
##################################################################################

tabItem(tabName = "trainvalidation",

radioButtons("crossFoldTypeUI","Cross Validation Type",c("K-Fold CV"='cv',"Repeated KFold CV"="repeatedcv"),"K-Fold CV"),
numericInput("foldsUI","Number of Folds(k)",5),
conditionalPanel(condition="input.crossFoldTypeUI == repeatedcv",
numericInput("repeatUI","Number of Repeats",5)),
uiOutput("CVTypeUI"),
radioButtons("preprocessingUI","Pre-processing Type",c('No Preprocessing'="",'PCA'="pca",'ICA'="ica"),'No Preprocessing'),
uiOutput("ppUI")
),

##################################################################################
#### Algorithm Tab Set Contents
##################################################################################

tabItem(tabName = "algorithm",
fluidRow(
box(title = "K- Nearest Neighbor", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,

checkboxInput("KNNmodelSelectionUI", "On/Off", value = FALSE),
h4("KNN is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression."),
uiOutput("KNNmodelParametersUI"),
tags$hr()
)
),

fluidRow(
box(title = "Boosted Logistic Regression", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
checkboxInput("LGRmodelSelectionUI", "On/Off", value = FALSE),
h4("LogitBoost is a boosting algorithm formulated by Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The original framework use the ADA boosting method in context with logistic regression."),
uiOutput("LGRmodelParametersUI"),
tags$hr()
)
),

fluidRow(
box(title = "Gradient Boosting Method", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
checkboxInput("GBMmodelSelectionUI", "On/Off", value = FALSE),
h4("Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function."),
uiOutput("GBMmodelParametersUI"),
tags$hr()
)
),


fluidRow(
box(title = "Neural Network", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
checkboxInput("modelSelectionUI", "On/Off", value = FALSE),
h4("Artifical Neural Networks are a family of statistical learning models inspired by biological neural networks (the central nervous systems of animals, in particular the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown."),
uiOutput("modelParametersUI"),
tags$hr()
)
),


uiOutput("dummyTagUI"),
uiOutput("GBMdummyTagUI"),
uiOutput("KNNdummyTagUI"),
uiOutput("LGRdummyTagUI"),
actionButton("runAnalysisUI", " Run", icon = icon("play"))),

############################################

tabItem(tabName = "features",
fluidPage(plotOutput("caretPlotUI", width = "950px", height = "750px"))),

##################################################################################
#### Algorithm Tab Set Contents
##################################################################################

tabItem(tabName = "results",
fluidRow(
box(title = "K-Nearest Neighbor", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,

tabBox(
tabPanel("Best Results",tableOutput("KNNbestResultsUI")),
tabPanel("Train Results",tableOutput("KNNtrainResultsUI")),
tabPanel("Accuracy Plot",plotOutput("KNNfinalPlotUI")))
)
),

fluidRow(
box(title = "Logistic Regression", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
tabBox(

tabPanel("Best Results",tableOutput("LGRbestResultsUI")),
tabPanel("Train Results",tableOutput("LGRtrainResultsUI")),
tabPanel("Accuracy Plot",plotOutput("LGRfinalPlotUI"))
)
)
),

fluidRow(
box(title = "Gradient Boosting Method", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
tabBox(

tabPanel("Best Results",tableOutput("GBMbestResultsUI")),
tabPanel("Train Results",tableOutput("GBMtrainResultsUI")),
tabPanel("Accuracy Plot",plotOutput("GBMfinalPlotUI")))
)
),


fluidRow(
box(title = "Neural Network", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
tabBox(

tabPanel("Best Results",tableOutput("bestResultsUI")),
tabPanel("Train Results",tableOutput("trainResultsUI")),
tabPanel("Accuracy Plot",plotOutput("finalPlotUI"))
)
)
)),

############################################

tabItem(tabName = "about",
fluidRow(
box(title = "Contact", status = "primary", solidHeader = TRUE,
collapsible = TRUE, width = 8,
h4("Joseph Lee"),
h5("Programmer, Full Stack Developer"),
h4("Avi Yashchin"),
h5("Co-programmer, Server Architect"),
h4("NYC Data Science Academy")
)),

fluidRow(
box(title = "Beta 1.0", status = "primary", solidHeader = TRUE,
collapsible = TRUE, width = 8,
h4("Version 1.0 Notes"),
h5("- Next version iteration will focus on data munging and cleaning as well as implementing more UI functions for feature engineering.
The version should work relatively well with clean data."),
h5("-Data that is not clean with mislabeled levels and factors will likely break the application or produce highly innacurate results."),
h5("-Current version only uses Accuracy metric, next version will ideally incorporate ROC evaluation"),
h5("-Next version will incorporate using test data to produce prediction data for kaggle competitions")
))),


tabItem(tabName = "code",
fluidRow(
box(title = "Code", status = "primary", solidHeader = TRUE,
collapsible = TRUE, width = 8,
h5("The code is open source and available at the github link: [To Be Posted Soon]"))))
))


))

 

Feel free to access our github to access SERVER.R. 

   

How To Use Our App

Our app has three main parts.  The first part is the data preparation feature.  You can upload almost any csv file and the app will automatically perform missing analysis.  We mainly used the iris.csv data set as the main test set. The second part is the analysis feature.  The analysis features contains three sub-features: preprocessing, feature graphing, and modeling.  For the pre-processing sub-feature, we kept the options minimal and included cross validation, IPA, and ICA options that can be activated through shiny widgets.  The feature graphing is a simple graphing application that produces a lattice plot of all data features in the data set.  For the modeling sub-feature, we made four algorithms available for the user to choose and tune: KNN, logit boost, gradient boosting method, and neural networks.  All of the modeling algorithms were from the caret package and is a fantastic package to familiarize yourself with if you want to pursue more machine learning application in R.    The third part of the app is the results feature.  After selecting the models to use for the uploaded data set, you can see and compare results from the different models on one page. The demo video above provides a visual overview on how to operate the shiny application.  Feel free to reach out to us if you have any questions or comments about the code.  Again, this is completely open source so please take it and play around with it!    

Conclusion

We are living in an exciting era of data science.  Looking into different perspectives of the open source resources available to any data enthusiast, it's easy to see how so many startups are gaining traction in industry.  Everything you need to start a data science startup are only a few keystrokes away.  For fun, we loaded our app onto an AWS instance and compared its computational runtime with a local instance of the app.  We found that our app was x10 faster on an AWS instance than my local machine (Macbook Air, 8 GB ram, 512 SSD).  This only shows that anyone can create a budget friendly data science startup as long as you are creative and determined to see it through.  Thank you for reading our blog post and as a bonus please enjoy our deep learning art below!  

Bonus Deep Learning Art

WE HAD SO MUCH EXTRA AWS HORSEPOWER WE MADE DEEP LEARNING ART USING PICTURES OF JOE’S CAT: cat1 cat2 cat3    


Views: 2200

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service