Data science has become an integral part of a 21st century organization. Almost all the companies, large, medium or small, want to utilize data science to fulfill their business imperatives by
- Improving their current processes for better operational efficiency thereby reducing costs
- Growing current business and opening new revenue streams
- Streamlining supporting functions including HR
There are much more industry applications and avenues still to be explored then what is stated above, still setting up and successfully running a data science practice is something which has not been fully achieved so far, for most of the organizations.
Evidently today majority of the data science projects fail. Out of the successful ones, again a large percentage of those are not actually utilized as they should be as they are not considered to be accurate enough.
There are many reasons for that, but one of the most notable one is the inability to have different teams in an organization to be on the same page, collaborate and work on these projects jointly.
Why this is so important, is something which has been completely ignored and all the focus is on building something which is academically correct by the book, utilizes open source and looks complex enough to justify the money and efforts spent.
While data scientist can manage and solve a business problem which satisfies the above criterion, what they cannot do independently is to give business context to the solution approach and intermediate results, at the same time to build objects which are easily understood, interpreted and utilized by end users. Hence business users, though will accept the solutions initially, but will not be able to utilize it since they do not understand complex objects which are created, they cannot relate with the approach and hence when they are not able to utilize it they question the accuracy of it and stop using it altogether.
On the other side data scientist and data engineers need active involvement & participation of business users so that they can maintain business relevance of the models. Consider a scenario wherein data scientist is analyzing the data and have noticed few outliers or some special conditions. Now most of the time he doesn’t know how to deal with these. He might want to suppress or retain those data points depending on the business value & context behind those. This information on what these special data points represent in business context and if can be suppressed without loosing any information, must come from business.
Similarly, it is very important to have an active IT involvement as they would be the ones who provision resources needed for successful project execution. Data science models are computing resource intensive hence any deficiency on that will lead to unacceptable turnaround times. Similarly, unavailability of all the needed clean data at the required velocity will cause the project to fail ultimately.
Hence, we see why it is of utmost importance to have stakeholders – data scientists, engineers, IT and business users on the same page. This is easier said than done because of the diverse skill sets, bandwidth and priorities across the stakeholder group.
While management directives can still help them align their priorities and bandwidth, the diverse skillsets they have is a huge impediment in having them collaborate with each other on a common problem, share ideas, propose solutions and jointly review results at every stage to ensure end goals are met and the product can be easily understood and utilized by the user.
Hence apart from different productivity and project management tools in the market, specific to data science projects, it is valuable to have all key stakeholders onboarded on a standard citizen data science tool which should have following characteristics
- Easy to use and intuitive drag-drop graphical user interface for data exploration, data discovery and visualization
- Ability to define key data preparation tasks step by step and incrementally
- Availability of common data science algorithms, out of box, so that users who are not comfortable in coding R or Python can still get a flavor on what are different categories of data science problems, how they can be solved, and solution be deployed
- Generated code and objects can be ported to open source data science platform, made possible using open source standard like PMML
Apart from the above specific capabilities below generic ones will be helpful as well
- Easy collaboration features so that users can quickly share the objects across teams, send notifications & comments while working on the same project simultaneously
- Documentation interface to generate easy to ready documents to share and present what has been done
- Health check and resource consumption monitoring agent to estimate potential complexity of the models and hence resources that might be needed during different stages of the project
Above features will allow the stakeholders to discover, define and work on a common problem, refining it step by step, finalizing the data & solution approach, and once it reaches at the advanced stage, already created objects can be ported to full fledged data science platform for data scientist & engineers to take over.
This will allow business users to be aware as to what is being built and why it is being built and how it is going to work in production.
By the time IT team will also have an idea about the required resources and they can start provisioning the required infrastructure in terms of computing resources as well as data integration points.
Data scientist & engineers will also have a better idea about the comfort level of business users in terms of skills and expectations and they can hence create much simpler & intuitive objects accordingly. Moreover, whenever they face certain special conditions or want business to review their approach, they can now create intuitive visualizations & objects for business users to understand, who are now more aware and can make sense out of it based on the work they have done on the citizen data science tool.
So, we have seen why having business users, IT and data science team on the same page is so important yet so difficult. Having said that it can be achieved by carefully selecting and deploying a citizen data science tool and having all stakeholders get accustomed to it slowly. Idea is to inculcate data science thinking within the organization rather then having silos of data science projects running in an organization with few skilled data scientist & engineers being sole custodians of the data science assets.