*Guest blog post by Pasha Roberts, Chief Scientist, Talent Analytics @pasharoberts*

Our prior article on this venue began outlining the business value for solving “the other churn” - employee attrition. We introduced the “quantitative scissors” with a simple model of employee costs, benefit, and breakeven points. The goal was to create a robust mental model for the cost of employee attrition.

In this entry, we will extend that model to tease out the factors that underlie attrition cost. With this work we hope to streamline the first step of CRISP-DM, “Business Understanding.” By understanding the underlying structure, analysts can systematically attack the problem rather than engage in an open-ended fishing expedition.

The histogram is a useful tool to see how attrition plays out in an organization. It is easy to produce from simple HR records, and the graphic tells a deeper story than simple averages or turnover rates. Most managers are able to understand histograms with some coaching.

Figure 1

*Figure 1* shows a basic histogram of tenure. The horizontal “X” axis maps out the number of years tenure in a specific role. The vertical “Y” axis shows the count of how many employees had that amount of tenure. We can see a stack of early departures in the first 9 months, then another rise later.

Figure 2

Figure 3

But this is deceptive. The top “double hump” pattern is actually the sum of two simpler employee clusters. “Good Fit” individuals in *Figure 2* left the role by being promoted, or being hired away. “Bad Fit” individuals in *Figure 3* left the role because of under-performance, problems with hours, or for disliking the work. It is not difficult to classify termination codes into business-oriented clusters, or to use machine learning for data-driven clusters. Of course the real world has more nuance and ambiguity, but the patterns are there to be found.

Figure 4

The histogram also mirrors how analytics is used to predict tenure - by modeling the probability of an individual to terminate at a certain duration. It is the same kind of “survival” problem as customer churn or medical outcome research. The outcome of an analytics model will be a density curve, *Figure 4*, much like the histogram, showing the probability of termination for “Good Fit” (blue) and “Bad Fit” (brown) employees, at each point of tenure. This simplified model uses the Weibull distribution, which is popular in this class of survival analytics.

Figure 5

Next we return to the cost/benefit information in *Figure 5*, which we calculated in the previous blog entry. Different inputs will shift the shape of the cost and benefit curves, but it is inevitable that employees will have *some* net cost in the beginning, then face a breakeven point, then provide net positive value to the employer. This example is tuned to a short-tenured, fast-training job role, but you can design curves to meet your specific situation.

Figure 6

We sum these costs in *Figure 6*, to make a cumulative net benefit. The plot shows the net cost or benefit accrued by an employee if they get to a specific tenure. The red region shows the net cost until breakeven, after which more tenure is pure benefit.

Now we have a probability at each tenure point *Figure 4*, and a cumulative net benefit at each tenure point *Figure 6*. Borrowing some concepts from finance, we will calculate an Expected Value at each tenure: we simply multiply the probability of reaching that tenure, by the net value of that tenure. Finally, we examine our mix of employee clusters - in this model we posit 60% good-fit employees, 40% bad-fit employees. We multiply each cluster’s Expected Value curve by this good-bad ratio, to get *Figure 7*: an **Expected Cumulative Net Benefit**.

Figure 7

This is a mouthful, but it is very useful to describe the business costs of attrition. As in *Figure 4*, the blue curve represents how we expect to derive value from “Good Fit” employees, and the brown curve shows how we expect to lose value from “Bad Fit” employees. The “Bad Fit” people all leave before they break even. Some of the “Good Fit” also leave before breakeven. But most stick around well past breakeven. A few “Good Fit” folks even make it past the three-year cutoff of these graphs.

The sum of the area under both Expected Cumulative Net Benefit curves give us the overall expectation from hiring in this entire system, from all of the prior assumptions and models. We will call this the **Expected Value of Hiring, or EVH**.

The higher the EVH, the better. Below zero means you are losing money with every hire. With these inputs, our model predicts that a “Good Fit” employee will deliver an EVH of 48% of their potential benefit. Our “Bad Fit” employees are predicted to deliver –17% of their potential benefit - a loss. At our mix of 60% “Good Fit” and 40% “Bad Fit”, the **overall Expected Value of Hiring is 22% of potential employee benefit**.

Value is measured as a percentage of an employee’s fully-ramped-up productivity. This 100% is ultimately divided into **three piles**: salary, EVH, and loss. In the models above, the employee was paid 50% of their productivity, so that `100% - salary%`

is up for grabs by the business to maximize. Our “Good Fit” employees yield 48% EVH, is very close to the potential of 50%.

In dollar terms, we tend to net `EVH%/salary% * $salary`

dollars value from an average employee in this role. It is tempting to divide again for `EVH/(100% - salary%)`

for some kind of efficiency metric, but this is too much abstraction for today.

In the real world, when you reduce salary, you will reduce market demand for the role and increase turnover, while saving money in the short run.

Employment, and business in general, is not a laboratory environment. We don’t get do-overs for failed scenarios, and our ability to “try things out” is limited. Customer analytics is slightly more amenable to A/B testing, just because the relationship is thinner, and there are often many customers.

With this model, we are able to play try out different approaches, so that predictive analytics can pursue the right target. We can move sliders to examine modeled outcomes, rather than hiring and firing thousand of workers. Of course “all models are wrong,” but we have found this one to be useful. The next blog entry will examine the sensitivity of the output (EVH) variable to our 9 input variables, and lay out data science inquiries into several different hiring situations.

Data Science is popularly thought of as an inductive process, and it may seem odd to lay out concepts before collecting data. In practice, the best data science is not an open-ended, free-ranging search for vague patterns. The most powerful data science is directed at a specific business problem, with a clear understanding of the underlying elements of the problem. If we listen, the data will tell us where the our pre-conceptions of those elements are wrong, and we can continue to evolve.

That understanding is our goal, so that we can slash the cost of employee attrition, create a happier workforce, and deliver superior business ROI.

We have made this model available in R on GitHub. You can run it in the free and powerful RStudio, with interactive sliders to change inputs, recalculating EVH and new graphs on the fly. Console-based R (my workhorse) does not support the needed manipulate library. We are working on a web implementation as well.

In the spirit of collaboration and learning, we have put this code, over 500 lines of R, up on GitHub so that other researchers can download, experiment, and engage. If you don’t like our Weibull distributions, you can swap in a Log-Logit or whatever you want. If you want to create a U-shaped cost curve, go ahead. You can share your progress back to us with a “pull request”, or “fork” your own variant. If you find a bug, create an “issue.” Keep us posted. GitHub can be an important resource for collaboration in quantitative research - we encourage practitioners to dig into it.

You can find it instructions and code at https://github.com/talentanalytics/churn201 . We will continue to build up this model as an engine for this series. Please engage!

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central