Today, an increasing number of institutional clients are looking for solutions, strategies and roadmaps to implement Big Data and Predictive Analytics initiatives within their own organizations. While the exact nature of the solutions and recommendations may differ from client to client, based on a number of factors, like the industry they operate in, the size of their operations and business model, there are common threads that can be applied to their needs.
While looking for these common threads, I came across an interesting white paper titled "Standards in Predictive Analytics" by James Taylor (CEO, Decision Management Solutions) in which he shares his thoughts on the subject. This blog post summarizes some of the key points that the author makes, along with some of my own thoughts from my engagements with both mid and large sized clients, within and outside the US, which I hope you find useful.
The changing nature of Predictive Analytics today.
In the past, a predictive analytic model was generated using a single proprietary tool like ,for example SAS, against a sample of structured data. The model would then be applied in batch to generate scores for future use in a database or data warehouse.
Today, there is a focus on "operationalizing analytics", that is, building models and applying these models in their day to day operations, turning the organization's data into useful, actionable insight (e.g. real time scoring) which can be used NOW to improve customer engagement, manage risk, reduce fraud, etc.
The biggest mistake that organizations make while trying to build an Analytics and/or Big Data Strategy is focusing on the technology before understanding the business problems/opportunities and decisions that need to be made. In other words, it is NOT enough to just ask for greater insight. You need to take the effort to better understand the insight and the underlying business problem that the insight will hopefully help solve. Once identified and well understood, the desired insight will naturally drive the analytics and big data requirements.
More Big Data drives more Predictive Analytics
The growth of Predictive Analytics has increasingly merged with the growth of Big Data. Increased digitization and the internet has exponentially increased the amount of big data available as well as the range of data types and the speed at which data arrives. This is commonly described as the "3 Vs": Volume, Variety and Velocity of Big Data.
Organizations are finding that the data they need for predictive analytics is no longer all structured data and no longer data stored in their databases and data warehouses. There is increasing evidence to support the notion that predictive analytic models built leveraging both structured and unstructured Big Data are making a more transformative impact on the business when compared to analytic models built using structured data alone.
3 Core Emerging Themes in Predictive Analytics
The role of R in broadening the predictive analytic ecosystem
R is free and open source making it appealing as a tool to learn advanced analytics with. Because R is open and designed to be extensible, the number of algorithms available for it is huge with over 5300 packages today. Ironically. this proliferation of packages has led some to question Are there too many R packages today? (but I digress...).
While scalability and performance has traditionally been an issue with R, commercial Vendors like Revolution Analytics are providing their own R implementations for Big Datasets that overcome these limitations.
The role of Hadoop in handling Big Data for predictive analytics
Hadoop consists of two core elements- the Hadoop Distributed file System or HDFS and the MapReduce programming framework.
While some newer organizations, like web 2.0 companies are putting all their data in Hadoop, a mixed database/data warehouse/Hadoop approach is more common. In the mixed environment, Hadoop is used as a landing zone for data where it can be pre-processed before being moved to a data warehouse. This allows for rapid addition of new data sources to an existing environment. Hadoop is also used as an active archive, where older data that might have been archived inaccessibly in the past to be available for analysis, can now be used to build predictive models.
The role of PMML in moving to real-time predictive analytics
If predictive analytic models cannot be effectively operationalized and injected into operational systems, there is the risk that they will sit on the shelf and lose value. Most analytic environments are batch oriented and are often loosely attached to the production environment. There is a need to move models built in a variety of analytic tools, into their production environments including workflow engines, business rules management systems, etc. PMML (Predictive Model Markup Language) has emerged as an important way to achieve this.