As a data science professional how would you respond to the question, “How familiar are you with PMML?”
Your choices are:
- “I have no idea what PMML is”
- “I have heard and read about PMML”
- “I have played around with PMML”
- “I have used PMML in my projects”
I have seen many responses to the question and the most popular one was “I have no idea what PMML is”.
Is this surprising? I guess, if we look at the distribution of people who responded this way, most of them were somewhat removed from the task of model deployment. One might say that the response is normal since they are only ones responsible for the model-building process. However, it is quite common for a model development team to also assume the task of model deployment into production. Even if they are not directly involved in the deployment, they will most likely have to guide the IT team. Herein lies the unfortunate aspect. Not knowing what PMML is will lead the deployment team into an awkward, inefficient and difficult production process. It is time-consuming and usually costly. This is the reason why people in the data science field should be aware of what PMML is and how it can be useful.
PMML stands for Predictive Model Markup Language. It was developed by the Data Mining Group, an independent, vendor-led committee. PMML provides an open standard for representing data mining models. In this way, models can easily be shared between different applications by avoiding proprietary issues and incompatibilities. Currently, all major commercial and open-source data mining tools already support PMML. Check out R’s PMML package.
PMML is an XML-based language which follows a very intuitive structure to describe data pre- and post-processing as well as predictive algorithms. Not only does PMML represent a wide range of statistical techniques, but it can also be used to represent input data as well as the data transformations necessary to turn raw data into meaningful features.
Consider now that if there were an application that can process a PMML model and execute it in any production environment—common these days are Hadoop, Spark, Storm, Teradata, Greenplum, Oracle, IBM Pure Data, AWS, Azure, etc.—we would have a very clean and simple model deployment environment that “connects” many different modeling tools to different production systems. I think this is one way to get every component of the analytics ecosystem to work together without obstruction. This is the reason why I titled this blog “PMML – A Good Citizen of the Analytics Ecosystem.”
Actually, this process doesn’t need to be restricted to just model deployment into production. It applies equally well to model deployment into test environment. For example, one could use open-source R to develop a model on a reasonable sample size and apply the model in its PMML form against very large data/Big Data in the test system.
The simplicity of the PMML approach can mean making your analytical model available to your business users in a day or less, versus in months.
What was your answer to the question?
*** Note: the diagram is a resource shared by Zementis, Inc., a leader and specialist in PMML model deployment