There’s a lot of literature on learning the technical aspect of data science: statistics, machine learning, data munging, big data. This material will serve you well when starting out or working under a lead. But what about when you are ready to spread your wings and lead a project yourself or embark on a project independently? Here you need a different sort of storytelling – the type that communicates why you are working on a project, what the value is, and what you have accomplished. Without these skills, you run the risk of aimlessly seeking a solution without much to show for it. The last thing you want is to be a deer in the headlights when someone asks you what the business value of your work is. Pair the 3 Vs of big data with the 3 Ps of model development to increase the success rate of your project. Read on to learn how to detail the problem, the process, and your progress on any data science project.
In the real-world, problems are often not well defined. It is up to the practitioner to define the problem. Compare this to many classroom settings and entry-level positions that detail every last minutiae of work. This is equivalent to color-by-number coloring books. You are given the problem and the method. Your job is strictly execution. This can be an effective approach for learning a subject but less so for solving actual problems where things are more open-ended.
At some point in your drawing career you graduate from this detailed instruction and move to coloring books without numbers. The problem is still given to you, but now you choose the method. Hence, you have to decide what colors to use. More importantly, a “successful” drawing is now contingent upon whether you choose good color combinations.
Finally, you outgrow coloring books altogether. What happens now? Instead of a line drawing, you are given a blank sheet of paper. It is up to you to define the problem. Here you have the greatest freedom but also the highest risk of failure.
This progression from color-by-number to an empty piece of paper isn’t so different from the maturation of a (data) scientist. First you learn the techniques. Then you learn how to apply the techniques to problems given to you. Finally, you define the problems. As your career advances, your success will be contingent on transforming blank sheets of paper into something valuable, ie identifying opportunities from data. Hence, your first challenge is defining the problem.
There are a number of ways to ask this question. Equally valid are:
- What problem are you solving?
- What is the purpose of this project?
- What is your goal?
The answers need to be specific. A lot of times they will sound like a use case or user story, which takes the form of “I want to do X because Y”. This will help you identify who the beneficiary is for the project as well. If you don’t know what you are solving nor who benefits, you most certainly will fail as your project becomes indistinguishable from entertainment.
As you work through the problem definition, you may find that there are numerous people who benefit, each with a slightly different problem that can be solved by the same model or analysis. If this is the case, it’s necessary to prioritize the problems and focus on the most important one. This is particularly important when developing models. When problems are conflated, you’ll find that it’s much harder to find a solution. So this is a form of simplification.
Some of you might protest that prioritization is not part of your responsibilities. That may be true, but as you advance in your career, you will be expected to not only lead initiatives but drive new ones. That means knowing how to prioritize.
Once you know what you’re drawing and who it’s for, how do you go about creating the drawing? This is your process, or method. In data science, it usually follows the scientific method. That means you need to have a hypothesis and a way to test that hypothesis. Your model will likely be built on a number of such hypotheses.
Like acting, there are numerous methods, and no single approach trumps others. That said, your process needs to contain at least the following elements:
- Data – where are you getting it, how complete is it, what biases exist?
- Theory – what is your high-level thesis for your model, ie what relationships is the model exploiting to make an inference?
- Evaluation – how do you know if your model is (in)effective?
Documenting your process in advance will clarify your thinking and make it easier for your collaborators to understand and review your approach.
With process out of the way, the only thing left to do is the “actual work”! When you’re waist deep in data, how do you know how much progress you’ve made? In our coloring example, if 3/4 of the picture is colored, there’s 1/4 left to do. Easy. Puzzles illustrate another example. Here the work outstanding is not just a function of work completed. This is because puzzles have easy bits and they have hard bits. Depending on if you go for the pain early or late determines how much work is left.
The same is true of models. Not only that, but models have dead ends, like mazes. With models, it’s not always clear when you reach a dead end. So communicating how much progress you’ve made and how much work is left is a difficult exercise in its own right. This is where having a sound process comes into play. Detailing your hypotheses and your validation criteria makes it easy to know when you’ve hit a dead end. It also provides guidance on whether there are alternate paths and what work is left.
Despite data science being a team sport, it’s important for practitioners to be able to work on tasks and projects independently. Not only is this good for your career, it is good for your collaborators in the business and technology groups. Using the 3Ps method described above will help you crystallize your thinking and improve your communication with others. Mastering this technique may even reduce the number of dead ends you encounter, quickening the journey to completion.
Have your own process? Share your method or tips you have for effectively leading data science projects in the comments.
Brian Lee Yung Rowe is an Adjunct Professor at the CUNY MS Data Analytics program and also the founder of Pez.AI, a chat-based data analysis and computing platform with a conversational AI interface. This article is reformatted for this blog. View the original on Cartesian Faith.