Over the last decade, we have seen tremendous interest in the application of data mining and statistical algorithms, first in research and science and, more recently across various industries, that has led to the development of myriad solutions by the data science community.
Most of the times data science algorithms are built standalone on platforms like R or python etc. In order to build a data-driven product or use these algorithms for real-time predictions it’s essential these algorithms get integrated or ported over to the application stack.
Let’s say your data Science team has built an amazingly accurate model in R using some package which has a built-in algorithm and we are ready to put it to work. However application servers run on Java, and this particular package is not available in Java. R-Java bridge would be a maintenance problem. One solution is to create Java version of R package and rebuild your algorithms using that package. This is however tedious as we just need algorithms to predict the outcomes.
To address portability of algorithms Data Mining group has developed two vendor neutral standards for exporting data science algorithms, namely, Predictive Model Markup Language (PMML) and Portable Format for Analytics (PFA).
Predictive Model Markup Language (PMML)
The Predictive Model Markup Language (PMML) is the de facto standard language used to represent predictive analytic models. It allows for predictive solutions to be easily shared between PMML compliant applications without the need for custom coding, that is, it may be developed in one application and directly deployed on another.
Traditionally, after building a model, the data scientist team had to write a document describing the entire solution. This document was then passed to the IT engineering team, which would then recode it into the production environment to make the solution operational and scalable. With PMML, that double effort is no longer required since the predictive solution as a whole (data transformations + predictive model) is simply represented as a PMML file which is then used as is for production deployment. Now we have packages and libraries in all data mining tools (R, Python etc) to generate PMML file for the model built within the particular tool.
Working with PMML
The structure of a PMML file follows the steps commonly used to build a predictive solution which includes:
Thus PMML enables the instant deployment of predictive solutions which can be expressed in their entirety(including data pre-processing, data post-processing, and modeling technique). It is currently supported by all of the top commercial and open source statistical tools.
Portable Format for Analytics (PFA)
The Portable Format for Analytics (PFA) is a JSON-based predictive model interchange format. It provides a common language to help smooth the transition from development to production. PFA encapsulates a unit of data processing scoring engine, the "predict" method of a model.
Working with PFA
When compared with PMML, PFA adds the flexibility of arbitrary function composition, rather than choosing from a set of established models. Want to partition a space with clusters and associate a different SVM to each cluster? Want to augment decision trees so that a neural network is performed at each node to decide which branch to follow? We can do all these, and many more things in PFA comfortably. Let’s understand this a bit more.
A PFA scoring engine is a JSON file containing model parameters and a scoring procedure. The scoring procedure transforms inputs to outputs by composing functions that range in complexity from addition to neural nets. If your "predict" method can be expressed in terms of common data science primitives (arithmetic, special functions, matrices, list/map manipulations, decision trees, nearest cluster/neighbor, and "lapply"-like functional programming), then it can be written in a few lines of PFA "code" (actually JSON). For example, a random forest is scored like this:
{"a.mode":
{"a.map": [{"cell": "forest"},
{"params": [{"tree": "TreeNode"}],
"ret": "string",
"do": {"model.tree.simpleTree":
["input", "tree"]}}]}}
Starting from the innermost function call, "model.tree.simpleTree" scores "input" against a "tree", which is part of an inline user-defined function that transforms a "TreeNode" named "tree" into a "string" by scoring it, which is applied to a list of "TreeNodes" from a data cell named "forest", and the most common ("a.mode") score is reported. It could be generated automatically from an R expression like this:
a.mode(a.map(forest, function(tree) {
model.tree.simpleTree(input, tree)
}))
In the scenario above, the data scientist would only have to express the model in PFA for the backend engineer to plug it into a PFA implementation running in production. Thanks to a detailed conformance suite, everyone can be confident that a scoring engine that tests well in R or Python will work in Java.
Originally Posted Here
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central