Summary: Management values the self-starting, data-driven, curious, and urgent characteristics that define the Citizen Data Scientist. But the path to encouraging these individuals also requires setting limits and risk procedures of a wholly new type. Procedures that will protect the organization so that bad analytic conclusions don’t become bad financial outcomes for the company.
Thanks to Gartner the term ‘Citizen Data Scientist’ has become ingrained in our professional literature, in the strategies of analytic platform developers, and even in the strategies of companies committed to advanced analytics.
Gartner also recently forecast that the ranks of the Citizen Data Scientist would grow five times more rapidly than the ranks of actual data scientists.
The pervasive rationale behind these arguments is that there are not enough data scientists to go around and/or that they’re too expensive compared to data analysts and even statisticians. As a result, most of the organizational strategies for integrating advanced analytics look like the ‘sprinkle it around’ school. We’ll take our tough-to-find and expensive-to-hire data scientists and most often use a sort of centralized or decentralized center-of-excellence structure to get the most out of them.
Coming from the analytics industry side, plenty of Advanced Analytic Platform developers have noticed this too. Happy days. That many more customers to use our software – if only we can make it simple enough for a non-data scientist to use. Therein lies the rub.
It’s no secret that in many types of model building we data scientists spend 80% of our time on blending, cleaning, transforming, and imputing bad or missing data. Out of that 80% maybe ¼ or about 20% is creative time (discovery and creating synthetic features comes to mind) but the rest admittedly does not use our deepest skills. So many Advanced Analytic Platform developers are moving to automate as much of this as possible and some that I’ve tested have been reasonably successful.
That’s good for the data scientists. Working faster on the parts of the process that require the least of our skills. But now that it’s fast and looks relatively easy, that also makes our advanced tools much more tempting for non-data scientists to use. Is that really what we want to encourage Citizen Data Scientists to do?
Citizen Data Scientists – Who they are and Where to Find Them
This question is not as strange as it may sound. There are actually some competing definitions of who might qualify as a Citizen Data Scientist. But first there’s the issue of where they can be found at all.
When we talk about Citizen Data Scientists we’re not talking about the merely data-curious. We’re talking about the contrast between Citizen and full data scientists, specifically that an organization has both types coexisting at the same time.
If as a company you think you can rely solely on Citizen Data Scientists without any of the real pros, you haven’t really understood the nature or the benefit of advanced analytics. This is not an amateur undertaking. Maybe your Citizen Data Scientists can build a book case but that doesn’t mean they should be trusted to build a house.
Mostly in Larger Companies
If you accept that Citizen Data Scientists exist only where there is some commitment to advanced analytics, this means the organizations where they’re found tend to be quite large, because unfortunately, adoption of advance analytics among SMBs is still pretty paltry. So the conversation we should be having is how do these two differently skilled resources relate and what is the most efficient and beneficial use for their skills.
There are actually two quite different definitions of the Citizen Data Science that I see regularly.
Data-Driven Manager: Typically this is a self-starting individual who recognizes that advanced analytics enables his path to his business goals. He is fact driven, has been actively using many types of lower level analytics like Excel and perhaps data viz, and is proactive in pursuing more and better data and more and better analytic tools. There is an urgency and curiosity here that makes waiting for the data scientists or the IT data analysts to help him not fast enough.
Up-Skill Data Analyst: This person already works with data and probably carries an analyst title. They are both curious and ambitious and see that acquiring data science skills is good for the organization and particularly for them personally.
The traits of both these types are characteristics that management values: data driven, proactive, urgent, innovative, problem-solving. Beyond that, it’s the level of commitment to learning the tools that separates these Citizen Data Scientists from the merely data-curious.
Both the aspiring Citizen Data Scientist and their management may under estimate the learning curve though with increasingly automated analytic platforms that appears to be getting shorter.
The second half is that in order to develop and retain these skills, the Citizen Data Scientist needs to consistently spend time both with this problem type and with the tools. Particularly with the Data-Driven Manager this may be more time than they anticipated and may be in conflict with other time goals. However, if they do keep up their commitment they may become a valued asset.
Here’s the core of the risk. Just because the advanced tools now look easy and fast to use doesn’t mean a novice can use them accurately. I can think of a dozen ways a non-data scientist can get erroneous or directionally wrong results using these tools and I bet you can too. After testing some of these new highly automated analytic platforms, here are just a handful of the problems they don’t address and neither will their untrained user.
- Using the wrong data: You blended together those two fields that are both labeled ‘sales’ but one is in units and the other is in cases.
- Failure to correctly clean the data: The automated platform says it took care of that but it turns out that all those zeros weren’t really nulls, they had meaning.
- Failure to correctly transform the data: Too many potential mistakes to count here. How about, the automated platform correctly identified a categorical data item that had 1,000 different values that it blew out into 1,000 unique columns, and you’ve only got 5,000 observations. Think that might over fit?
- Whoops that outlier it eliminated wasn’t a mistake, it was an entire small market segment it missed because it didn’t segment and cluster first.
- Hey, my linear regression got an r^2 of .85. Too bad the automated platform didn’t check to see that the data was curvilinear and non-heterogeneous so the result is just plain wrong.
And these are just the tall tent poles in the pantheon of how to get the wrong answer. As we all know, this just scratches the surface of things-gone-wrong.
Nurturing, Feeding, and Controlling the Citizen Data Scientist
Now we get to the heart of the matter. The personality traits these Citizen Data Scientists possess are exactly what management encourages. No executive group is going to slap back these self-starters and tell them to wait until the data scientist gets there. So given that we should be encouraging and supporting these folks, how do we do that and protect the organization against seriously wrong analysis and conclusions at the same time?
There are some positive things that management should be doing. These include:
- Empowering broad access to data, as well as data blending, data viz and basic analytic tools. Let our Citizen Data Scientists dig in so long as it fits in their time budget.
- Recognize and reward them for their positive contributions.
- Recognize people with good analytic instincts and provide them with training and opportunities to engage at a deeper level.
- Encourage the Citizen Data Scientist to reach out for help and advice from your data science staff.
- Experiment. Start small and follow what works.
Here is where we have to make our executive team smart about the capabilities and limitations of their new analytic crew. I propose three principles of organization:
Principle 1: Encourage Data Blending and Data Viz: Encourage data blending and data visualization. Encourage Citizen Data Scientists to use data and visuals to explain results. This will reinforce a valuable data-driven culture. Tools and training for blending and data viz should be made widely available.
Principle 2: Limits – If Your Citizen Data Scientists Produce Analytic Models Make Sure Their Financial Impact Is Understood and Have them Checked by an Actual Data Scientist.
If the analysis and recommendations derived by a Citizen Data Scientist point the way to a unique solution to a business problem, allow it to bloom so long as the outcome is not fiscally material to the organization. As soon as a solution becomes financially significant or represents a major departure from current practice, create a team to check results including a data scientist.
Consider whether you want to let your Citizen Data Scientists access advanced modeling tools where the risks of a wrong answer in the absence of deep data science knowledge is high, no matter how automated the platform may appear. If you do decide to make these tools available:
- Never implement an embedded predictive model created by a Citizen Data Scientist in an operational system, business process, or workflow until it has been confirmed by an actual data scientist.
- Never put an analytic application developed by a Citizen Data Scientist into production with live data sets until it has been confirmed by an actual data scientist.
A less than optimal model will under achieve the results that could have been obtained. A completely wrong model can cause the organization to take the wrong action 100% of the time. These financial risks are simply too great.
Principle 3: Make Sure Your Data Governance and Security Protects Critical Data
Review and ensure your data governance and controls around access to data are where you need them to be. Especially with regard to the inappropriate disclosure of sensitive customer, employee, or financial data which your Citizen Data Scientist my purposefully or accidentally uncover.
Phrases like ‘democratizing data’ or ‘democratizing analytics’ are very popular among the vendors of these highly simplified analytic platforms. Be aware that executive management must balance the nurturing of analytic, data-driven individuals with a concern for serious risk management.
Discovery of new analytic driven insights is exciting for everyone. You don’t realize business benefit until those insights are implemented in operations. Depending on the size of the financial and organizational impact a new analytic insight may make, it is up to executive management to exercise a type of oversight with which they’re not familiar. That is, how to continue to encourage many employees to use data and analytics while taking procedural steps to ensure that an analytic mistake doesn’t become a serious economic mistake.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at: