Note: Many thanks to those who provided comments on my previous blog “Features Part 1: Are Features the new Data?”. And special thanks to Somil Gupta and Harsha Srivatsa who patiently worked with me to clarify many of my thoughts on the subject. Yea, some of my thoughts are a bit whacky, but life should be a bit whacky if we want to learn new things.
My blog “Features Part 1: Are Features the new Data?” created quite a stir, and lots of learning for me! In that blog, I proposed the “Data-to-Features-to-ML Models-to-Use Cases” topology that highlighted the need to integrate the disciplines of Data Management, Data Engineering, Feature Engineering, Data Science, and Value Engineering (Figure 1).
Figure 1: “Data-to-Features-to-Use Cases” Value Topology
Yea, Figure 1 was sort of a mess. I erred in trying to cram too much into a single slide. So, let’s deconstruct Figure 1 into 3 new slides to support my original points on features:
Let’s start the deconstruction process.
The first thing in deconstructing Figure 1 is to just show the relationships between the different data and analytic constructs that are used to help organizations optimize their key business and operational use cases (Figure 2).
Figure 2: “Data-to-Features-to-Use Case” Value Topology
As depicted in Figure 2:
Note: there is nothing to prevent the mathematical transformation (and blending) of existing features to create newer, higher-level features. That supports features as economic assets that can be shared, reused, and continuously refined, which I’ll talk more about in a future blog.
Let’s now apply the value topology depicted in Figure 2 to a healthcare example.
Let’s say that you are a healthcare provider (hospital), and you are seeking to apply data and analytics to optimize the organization’s key operational use cases of reducing unplanned hospital readmissions, improving patient satisfaction, and reducing average days stay. We apply the “Thinking Like a Data Scientist” methodology (hint-hint-hint) to bring the domain experts into the feature engineering and ML model development processes to create Cardiac Risk and Diabetes Risk predictive scores that we can apply at the individual patient level to make individual patient and care decisions to reduce unplanned hospital readmissions, improve patient satisfaction, and reduce average days stay (Figure 3).
Figure 3: Healthcare Provider Example of Data-to-Features-to-Use Case Value Topology
From Figure 3, we can see the following data and feature relationships:
The final step in deconstructing Figure 1 is to define a more holistic “Data Management” discipline to activate the Data-to-Features-to-Use Case value topology (Figure 4).
Figure 4: Expanding the Discipline of Modern Data Management
I’m a firm believer – and reflect that belief in my writings and teachings – that we need to “up the stakes” with respect to how we reinvent data management into a critical business discipline necessary to support today’s data economy (see my blog "Reframing Data Management: Data Management 2.0”). I believe that the modern data management discipline requires the blending of the following loosely coupled data and analytic practices:
Figure 5: How Design Thinking Can Fuel Feature Engineering
One other complementary discipline that I should call out as a critical component of the modern data management discipline is Design Thinking. Design Thinking creates a culture of empowerment that democratizes ideation across domain experts and the data science team in identifying those variables, metrics, and features that might be better predictors of behaviors and performance (Figure 6).
Figure 6: Design Thinking: The Empowerment and Democratization of Ideation
If features and feature engineering are a key to creating analytics that deliver relevant, meaningful, and quantifiable business outcomes, then empowering the domain experts and integrating them into the data science process early is key to success. And that’s exactly what Design Thinking (and “Thinking Like a Data Scientist”) seeks to accomplish.
My initial blog “Features Part 1: Are Features the new Data?” generated a maelstrom of comments, which motivated me to write this blog to clarify my original propositions on the “data-to-features-to-Use Cases” topology. While I hope that this blog continues to generate dialogue, comments, and more learning, I am also preparing to write two additional blogs on features:
So, watch this space for more on the important topic of features.
 Note: Features are input variables or data elements used by models to make prediction (e.g., (1) women (2) under 25 (3) who smoke tobacco). Feature Selection is the process of selecting a subset of relevant features (or a feature set) for use in ML model construction.
 Shapley Additive Explanations supports Feature Importance which helps you estimate how much each feature of your data contributed to the ML model's prediction accuracy and precision.
 Weights and biases are the learnable parameters of some machine learning models, including neural networks. Weights control the signal (or the strength of the connection) between two neurons or nodes. In other words, a weight decides how much influence the input will have on the output.
 Data Wrangling the process of cleaning, structuring, and enriching raw data into a desired format for better decision making
 Data munging is the process of transforming and mapping data from one "raw" data format into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics