Summary: If you’re running a Data Science Team you need to be thinking about efficiency and productivity. Those solutions can take the form of management and process, but there are also some new tools you should be evaluating.
In our last article we talked about some organizational principles and tips for increasing the productivity of your data science team. Those ideas were in the realm of management and process. You can tell that our profession has reached a certain level of maturity when our combined effort as a team needs to assessed for efficiency and productivity. While good process is critically important, there are also some new tools available.
Some Tools to Consider
Data Scientists, just like any other professionals are proud of their skills and will tend to keep doing what they’ve always been doing that they know works. However, unless you’re making a real effort to keep up with the industry you may be unaware that there are some really interesting productivity tools introduced in the last year or two that can have a big impact on your team.
These two examples are primarily of interest to companies in finance, insurance, telecos, healthcare, retail, manufacturing, and ecommerce where they are constantly developing new models and refreshing old ones in support of new products, markets, and campaigns. If your group maintains in the range of say 25 to 50 models (but up to thousands or more) you should examine these tools.
Once a model enters production it’s not uncommon to turn over its maintenance to junior data scientists who monitor its effectiveness and manage the refresh process. And in highly regulated industries like insurance and banking you may need to report on exactly what features you’re using (or allowed to use) if your model functions to include or exclude certain customers. This means a kind of cradle-to-grave monitoring of each model from creation through refresh to retirement.
When the number of models reach a large enough number, attempting to manage this on a spreadsheet will breakdown. These new model management tools:
- Create a centralized model repository with version control.
- Allow collaboration and review at each step of the modeling process starting with creation all the way through to retirement.
- Many have customizable workflow features allowing supervising data scientists to see the currency, definition, and function of similar models.
- Enforce a common methodology like CRISP-DM bringing governance to the process.
- Can automatically report on the accuracy of the model over time signaling when refresh is needed.
- If you are in a regulated environment, a model manager creates an auditable process to prove compliance with internal rules and external regulation. This can even include Basel II risk model validation reports to assess the soundness of internal credit risk measurement systems, tracking down anomalies and answering regulator inquiries on demand.
These days if you are an analytics driven company and in any of the aforementioned industries such as finance, insurance, telecos, healthcare, retail, and ecommerce you have a very large number of customers and prospects. When the target market is large it’s intuitive that breaking it into an increasing number of segments will enhance accuracy as each model is hyper-personalized to smaller and smaller segments of the market.
The challenge in growing very large numbers of segments for a single campaign, market, or product is that each of those models needs to be separately developed and the factors that make each one more accurate are likely to be somewhat different from segment to segment. In short, you can’t simply take what you learned from the first model and apply it universally to all the other segment models. This means a lot of data science labor.
Congratulations, you have become a model factory. The good news is that there are tools, typically also identified by the name ‘model factories’ that automate modeling in a highly granular environment.
It works more or less like this. Define the full set of prospects and the features on which you have previously determined they should be segmented. Tell the factory modeler what level of granularity you want and the factory modeler platform will automatically generate each of these models, dozens or hundreds of them, and even tell you when further segmentation is no longer profitable.
This also means that the specific algorithms optimum for each segment model may be different. Your factory modeler platform should be able to run a large number of different algorithms simultaneously and identify the champion model for each segment. This may include different algorithms or even features.
Some additional features you should find in your factory modeling platform:
- Standardized and automatically repeatable data blending, cleansing, and prep for each segment.
- A method of collaboration among data scientists to share templates and ideas as well as a documented history of development, deployment, and refreshes.
- Automated deployment into your operational systems.
The factory modeler platform can greatly increase productivity and accuracy. It will also greatly speed up learning and sharing about the customer or prospect market you are tackling so that senior and junior data scientists can all benefit from the learning.
If you happen to be fortunate enough to lead a data science group, you can also take comfort in being able to centrally observe the process and even to put control criterial in place.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at: