Home » Uncategorized

Deploying Predictive Models

Over the last decade, we have seen tremendous interest in the application of data mining and statistical algorithms, first in research and science and, more recently across various industries, that has led to the development of myriad solutions by the data science community.

Most of the times data science algorithms are built standalone on platforms like R or python etc. In order to build a data-driven product or use these algorithms for real-time predictions it’s essential these algorithms get integrated or ported over to the application stack.

Let’s say your data Science team has built an amazingly accurate model in R using some package which has a built-in algorithm and we are ready to put it to work. However application servers run on Java, and this particular package is not available in Java. R-Java bridge would be a maintenance problem.  One solution is to create Java version of R package and rebuild your algorithms using that package.  This is however tedious as we just need algorithms to predict the outcomes.

 To address portability of algorithms Data Mining group  has developed two vendor neutral standards for exporting data science algorithms, namely, Predictive Model Markup Language (PMML) and Portable Format for Analytics (PFA).

Predictive Model Markup Language (PMML)

The Predictive Model Markup Language (PMML) is the de facto standard language used to represent predictive analytic models. It allows for predictive solutions to be easily shared between PMML compliant applications without the need for custom coding, that is, it may be developed in one application and directly deployed on another.

Traditionally, after building a model, the data scientist team had to write a document describing the entire solution. This document was then passed to the IT engineering team, which would then recode it into the production environment to make the solution operational and scalable. With PMML, that double effort is no longer required since the predictive solution as a whole (data transformations + predictive model) is simply represented as a PMML file which is then used as is for production deployment. Now we have packages and libraries in all data mining tools (R, Python etc) to generate PMML file for the model built within the particular tool.

Working with PMML

PMML incorporates data pre-processing and data post-processing as well as the predictive model itself. The elements supported by PMML are given at the DMG website (http://dmg.org/pmml/v4-2-1/GeneralStructure.html).

The structure of a PMML file follows the steps commonly used to build a predictive solution which includes:

  1. Header: contains general information about the PMML document, such as copyright information for the model, its description, an attribute for a timestamp etc.
  2. Data Dictionary -contains definitions for all the possible fields used by the model.These include numerical, ordinal, and categorical fields.
  3. Mining Schema – defines the strategies for handling missing and outlier values.
  4. Data Transformations – define the computations required for pre-processing the raw input data into derived fields
  5. Model Definition – defines the structure and the parameters used to build the model. PMML currently covers models like Association Rules, Cluster Models, Decision Trees, Naïve Bayes classifiers, Neural Networks, Regression, Rulesets, Sequences, Support Vector Machines, Text Models,  and Time Series model
  6. Outputs – define the outputs expected from the model.
  7. Targets – define the post-processing steps to be applied to the model output.
  8. Model Explanation – defines the performance metrics obtained when passing test data through the model (as opposed to training data).
  9. Model Verification – defines a sample set of input data records together with expected model outputs.

Thus PMML enables the instant deployment of predictive solutions which can be expressed in their entirety(including data pre-processing, data post-processing, and modeling technique). It is currently supported by all of the top commercial and open source statistical tools.

Portable Format for Analytics (PFA)

The Portable Format for Analytics (PFA) is a JSON-based predictive model interchange format. It provides a common language to help smooth the transition from development to production. PFA encapsulates a unit of data processing scoring engine, the “predict” method of a model.

Working with PFA

When compared with PMML, PFA adds the flexibility of arbitrary function composition, rather than choosing from a set of established models. Want to partition a space with clusters and associate a different SVM to each cluster? Want to augment decision trees so that a neural network is performed at each node to decide which branch to follow? We can do all these, and many more things in PFA comfortably. Let’s understand this a bit more.

 A PFA scoring engine is a JSON file containing model parameters and a scoring procedure. The scoring procedure transforms inputs to outputs by composing functions that range in complexity from addition to neural nets. If your “predict” method can be expressed in terms of common data science primitives (arithmetic, special functions, matrices, list/map manipulations, decision trees, nearest cluster/neighbor, and “lapply”-like functional programming), then it can be written in a few lines of PFA “code” (actually JSON). For example, a random forest is scored like this: 

   {"a.mode":
      {"a.map": [{"cell": "forest"},
         {"params": [{"tree": "TreeNode"}],
          "ret": "string",
          "do": {"model.tree.simpleTree":
             ["input", "tree"]}}]}}

Starting from the innermost function call, “model.tree.simpleTree” scores “input” against a “tree”, which is part of an inline user-defined function that transforms a “TreeNode” named “tree” into a “string” by scoring it, which is applied to a list of “TreeNodes” from a data cell named “forest”, and the most common (“a.mode”) score is reported. It could be generated automatically from an R expression like this: 

a.mode(a.map(forest, function(tree) {
model.tree.simpleTree(input, tree)
   }))

In the scenario above, the data scientist would only have to express the model in PFA for the backend engineer to plug it into a PFA implementation running in production. Thanks to a detailed conformance suite, everyone can be confident that a scoring engine that tests well in R or Python will work in Java.

Originally Posted Here