Subscribe to DSC Newsletter

The application of pattern recognition technology to large datasets has revolutionised the digital economy. But digital represents only 5% of GDP in OECD countries: the remaining 95% is still largely untouched by data science (DS). The larger “old economy” companies are just beginning their data journey and data science is yet to be institutionalised: Outside the tech leviathans DS is still a cottage industry with artisan DS crafting bespoke prototypes to their own standards.

If DS is to fulfil its promise, it needs to industrialise. This blog explains what I mean by this, and proposes a number of issues which must be addressed if it is to do so.

Most DS blogs are technical: algorithms, distributed computation, visualisation etc. The rest are case studies of projects where these techniques are applied to a domain. I would like to look beyond these issues, interesting and important as they may be, to the general structure of what we are trying to do here: applying automated pattern recognition systems to real world problems in an industrial way.

To the extent that there are any articles on industrialisation, they are written by consultants and business school people who have never written a Spark job or run a regression. This leaves them somewhat empty, often substituting jargon for insight. And they often take a polarised position for effect: “Data Science will save the world”; “Data Science is dead”; etc. So this is my attempt to frame a debate on the basis of practical experience. Let me know what you think.

The core questions are as follows: How can we turn an activity into an industry? How can we build a DS framework that meets the needs of businesses and people: How can we ensure quality? How can we build sustainably? How can we be responsible? How can we as DS benefit from the value we create?

The note below highlights a number of areas. I guess there will be more. My intention is to address each idea in turn for a blog of its own. At some point.

A common definition of Data Science

Everyone seems to mean something slightly different by the term “data science”. And then they disagree about who is a “real” data scientist and who is a fake. If we can’t agree on who we are and what we do, what hope have the businesses who might want to use our skills? We also need to be able to articulate why what we do is genuinely different to the analysts, data engineers, software developers and quants who used to occupy the “brilliant nerd” space, and who think they still do.

I am not even sure if Data Scientist and Big Data really deserve capitalisation.

My view is that there is a spectrum of different activities that could validly claim to be data science, from analysts who run decision tree models on R, to data engineers with a copy of Sqoop and Oozie, to guys building Bayes networks in Spark.

A common understanding of the dimensions and boundaries of our space, and a taxonomy of roles that exist within it will enable DS to explain ourselves and our value to the world. Without it the perception of DS will revert to the lowest common denominator.

Recognising the importance of production Data Science

In my view, the value of DS is not just in the impact of analytics, but in the ability to execute analytics at an unprecedented level of detail and to deliver that content direct to users on demand. Business Analytics serves (relatively few) actionable insights to (relatively few) managers. Data science technologies allow Amazon to recommend a different subset of products directly to millions of individuals. You can’t do that manually.

A key component of the value of DS is automation, the only way to reduce the marginal cost of production low enough for this to be viable. In my team we put an equal emphasis on automation as we do on pattern recognition and distributed computation.  We don’t think about delivering actionable insights, but applications which deliver actionable insights. I get the feeling that this is not universal yet.

Production is not just about delivery and latency. DS apply complex techniques to large data sets. If the results are going to be used in business critical situations, how do you know they are right? Many DS still use untestable code such as SQL. Some use testable code, but haven’t adopted test frameworks. Some are yet to use version control or continuous integration. Industrialisation of DS will mean that we raise standards of quality to the level of those used in the production, and it is important to realise that we aren’t there yet.

Exploration vs Production

The recognition of production as a key element of value implies the following tension. On the one hand data science is exploratory; You are employing analytical techniques to play around with data to find out new things and to do this you need freedom. On the other hand data science is production-oriented; You need to build robust applications which deliver consistent quality. Ultimately, the exploration can only be evaluated in production, and the exploratory phase must be adjusted with what you’ve learned. So you shouldn’t have two discrete phases where models are designed and then implemented. You need to design as you build and build as you design. Industrial data science must reconcile this tension by developing practical approaches to team structure and working practices, and by adopting technologies that enable exploratory scientists to develop production code without compromising intellectual momentum. In particular we need to think about how the Agile Methodology can be adapted for data science, and how to document the process. My colleagues have been working on the Agile DS manifesto which I cannot wholeheartedly endorse, but I think it is a worthwhile first attempt (see http://www.datasciencemanifesto.org).

Embracing changing technologies

Techy nerds tend to discuss technological choices ad nauseam, and DS are no exception. Each DS joins a tribe and rants at the other tribe, betraying that the decision is not as important as the implied investment following from that decision. I defend Scala/R/Python not so much because of its superiority, but because I have spent years learning that technology, and if was not the best, I would be an idiot. “People’s Front of Judea? Splitters!”

All of this is as it should be, and good fun. But there is a more important challenge for DS as an industry. DS has been built upon rapid technological change which shows no sign of slowing down. Open source technologies are engaged in chaotic constant revolution. Even the technologies which win survive for only a few years before the world moves on. Anyone want to use Map/Reduce or Hive? Where will Spark and Yarn be in a few years’ What will happen to a business’s stock of analyses build in obsolete environments?

How should a business manage the adoption of technological change? To be clear, this is not a question of choosing individual technologies that will last: they won’t.  How can a business embrace technological development and yet at the same time preserve the pre-eminence of industrial production. If a firm ignores change through picking winners, long procurement and deployment cycles, it risks losing the race. If it adopts every technology that comes along it will bequeath a hotchpotch of incompatible applications.

The answer probably lies in moving away from the specifics of individual implementations towards shared design standards, in particular a move towards APIs and reactive services, and common coding/development practices. The principles of Minimum Viable Product (MVP) and Single Point of Responsibility (SPR) will become increasingly important as will the reality of continuous improvement and refactoring. You need to design your applications so that components can be adapted to embrace new tech. What those practices are is up to us.

Data Science in old organisations

Data science offers great promise to businesses, but its implementation places equally great strain on those businesses’ organisations. For digital business models to generate returns they need to do things differently, and that’s hard.

As information becomes a core component of a firm’s products, so the problems of legacy data systems are intensified. Layering Hadoop, unstructured data and machine learning on top of a failing RDBMS with an outdated schema is a recipe for disaster. But fixing legacy systems is difficult. Your firm has probably already tried and failed. DS needs to reach out beyond its comfort zone to help data architects adapt their tried-and-mistrusted designs for the modern age. IMHO the answer does not lie in grandiose data lakes, but in simplifying data warehouses and building APIs to serve de-normalised data to users. My team has been developing a Spark design to achieve this, which we can share soon.

Equally the Agile principles mentioned above may strain your firm’s current delivery architecture. Companies will fight to disintermediate and serve customers directly and to do that business units will want to control the full delivery pipeline themselves. If you want to try out a new product or a new price you need to deliver it tomorrow not in line with an existing quarterly release cycle. DS moves analytics to the front office. We need to become the product owners.

All this has unresolved implications for where DS teams sit within organisations: do they sit in IT, Analytics, or should they be embedded in the business itself? What is the balance between embedding for relevance and centralising for technical excellence? Can large organisations live with the freedoms (admin access, linux, acces to production data etc) expected and required by scientists but which are normally denied to analysts and developers? How can a business manage the potential for conflict with incumbent conventional IT and analytical functions who may want a piece of the Big Data action but perhaps don’t have the right expertise or working patterns?

In my experience, DS tend to spend too much of our energies navigating organisations inappropriately configured for innovation. As we industrialise, the questions will be how much of the change needs to come from DS and how much from the organisations themselves?

Communication with business leaders.

DS are sometimes unwilling communicators. We are often more interested in finding answers and building applications than telling the world about them. And when we do try to explain what we are doing, we often bore generalists with technical details. Most DS role specs include communication skills as a key attribute, but it is often traded off against technical skill when both aren’t available in one unit.

But should all the onus be on DS? Just as business organisations must adapt to the centrality of data to their product lines, so business leaders need to engage more deeply in technical issues. Many CEOs will defer to their CIO on all matters technical, challenging only on matters of budget and delivery. Can you imagine Zuckerberg being so passive? CEOs are not idiots, but they have become lazy and over-reliant on IT advisors. This is not good enough.

Communication is a reciprocal process. DS need to become better storytellers and business leaders need to become better storylisteners.

How can DS help the boardroom understand DS issues? It should not come down to individual charisma. In my experience business leaders learn from a combination of their own experience and stories: case studies, taught in business schools and written in journals like HBR. There is a real shortage of case studies which focus on the kind of engagement that CEOs need to have with our technology. Many of these articles are glorified sales pitches, where success is reported as having flowed exclusively from the engagement of some smart consultants (who wrote the case). But in reality, CEOs need to be able to understand what different DS approaches will do to their core business. They can’t just decide on budget and then outsource. Can you imagine doing that with your other core products?

Equally, we need to find a way of making this technology exciting without resorting to hype. How can we evangelise yet keep expectations realistic? How can we stay truthful and spot Big Data frauds (you know, the guys who hang around conferences and write presentations on Big Data, but are yet to run a Map Reduce job)?

Part of this is developing a business language which embraces the technical nature of what we do. We need to find a way of communicating which does not shy away from the decisions and implications but which allows business leaders to link the technology challenges to business outcomes.

I think there is some way to go yet.

A career path for Data Scientists

If we are going to build a DS industry, we are going to need some data scientists (see taxonomy question above). Everyone knows about the supposed shortfall of analytical talent expected over the next 5 years. To meet this shortfall we need to find and develop talent, but first we will need to agree about the roles and skillsets that make up a data science team.

Firstly, what does it take to be a DS? What skills and aptitudes do they need to have? How can you identify a good one? What credentials should a DS have? In particular can you create a DS via a specialised course of learning like an MSc in Data Science (I am not so sure about this one)?

Secondly, how can businesses offer DS career paths in an organisation? DS are human capital. When they walk out of the door they take value with them. Businesses have become bad at managing technology talent, encouraged by the offshoring/outsourcing narrative in which technical talent is a commodity. In my experience, DS are remarkably long term in their outlook. Most of them want to make a difference in their organisations, and they recognise that this can take time to effect. To keep these people, non-tech businesses need to develop career paths that recognise the contributions that DS make over the long term.

The days when talent was prepared to work for nothing in bad conditions for the sake of an interesting problem are over. At tech-enlightened firms you can do cool work and put your kids through college: you don’t have to become a general manager to go up the pay scale. Businesses that don’t address this may struggle in a data-science lead world.

But it’s not just down to the businesses to change. If DS is to industrialise, we need to come to a common understanding of what is expected of the DS and what is expected of the business. If you want to make an impact with DS in a corporation, you may have to wear a shirt and turn up before 10am. It’s a tough world out there.

The economics of  Data Science

Every business function has to pull its weight and data science is no exception. At least DS are in a good position to measure their impact and so make a claim on the value they generate. But perhaps a more difficult question is that of resource allocation. Where should DS resource be applied in order to generate greatest value? In particular is it better to have lots of DS applications which perform moderately, or should resource be concentrated on a few very highly performant applications. Conventionally, businesses are thought to display diminishing marginal returns; that is for each additional unit of effort, you get a lower return. Some people summarise this in an “80:20” rule. You don’t get much benefit from going for the final few percentages of performance.

But it is not clear that all DS applications exhibit this behaviour. In my opinion (but it would be great to measure this, somehow) the value of a personalisation engine is not in doing the easy part (I am a man therefore I like lawn mowers) but in the difficult part (Each Friday I travel to an area where there is a mosque, therefore I might want to buy a present at Eid).

As we industrialise, we will need to help businesses allocate scarce resources to DS projects.

Data Science and society.

DS like to get their hands on data and make use of it for interest and profit. But we are increasingly aware of our duties to the originators of that data (in some sense the “owners” of that data). There are all sorts of issues around consent and privacy and a significant risk that data, if it gets into the wrong hands, can be used to the detriment of those people.

DS have a duty of care to protect private data, and yet we have neither a common and agreed set of values about this, nor do we have an equivalent set of practical steps to ensure that data isn’t compromised.

An industry body to promote Data Science

The standard of public debate about DS is low, with hysterical newspapers making hyperbolic claims about the danger of this new technology. This may well become more problematic as the DS industry establishes and the technology becomes pervasive. DS needs to find some kind of forum through which it can get its message across, and an industry body might be a good way to do this.

 

I guess there may be a few more Big Issues. Let me know what you think.

I will do what I can to address these issues in more depth in the coming weeks.

Views: 2630

Tags: data, science

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by msbikk on May 25, 2016 at 4:52am

Great post Harry.

I think the problem to see DS at Industry scale will be quite complex in terms of talent to build as such. Reasons are well stated in this article. Knowledge to build such a industry and knowledge to consume the products of DS is also critical. Personally, I think it's hard to sell DS - Concepts, solutions to stakeholders who lack the vision or background or benefits that DS can bring to their organizations and distance they maintain from science. Maybe because, I'm novice in this area and enrolled to do Masters in DS while working full time for a Microsoft Partner in the area of Analytics.

Comment by Bruno Polach on May 19, 2016 at 7:49am

Thanks for sharing your thoughts Harry, very comprehensive article, covering quite a few aspects. DS is only bound to grow in relevance - however in order to become 'pervasive', which is the term (as you remember) used few years ago in relation to BI - pervasive across the organization, meaning people are buying into it, using it, kind of being on the same boat. There are calls for Humanization or Democratization of DS - I am a proponent of these, including story-telling, which you rightly mentioned as well. Not everyone is born to crunch data sets in Spark or Python, thus we need to make DS comprehensible to wider audiences - metaphors, comparisons, hyperboles - language, semantics = vehicles of mutual comprehension, whilst deepening the expertise of core DS specialists (hats down), who are actually able to play with the code - then we just translate (simplification, there is more than-meets-the-eye of course) and communicate the meaning to the rest of us = popularization.

Comment by Saheed Badru on May 4, 2016 at 4:12pm
Very interesting Insight Harry.
I agree with you on the points you raised about communication alignment between business managers and Data Scientists. I am new to DS and currently learning but from my IT background, i strongly believe standardization of production DS like adopting Agile approaches like you mentioned earlier will go a long way. I look forward to subsequent insights.

Thanks for this wonderful article!
Comment by Harry Powell on April 21, 2016 at 11:27pm

Thanks Don,

Great point about the interpretability of models. Sometimes a decision tree can be effective because it is easier to trust. Diving straight in with a complex neural network may improve precision but it may also frighten away you stakeholders.

On a broader point, thanks for contributing to the debate! Your two cents are worth so much more than that. DSC seems to be a site for Lurkers. Come on guys. Get involved.

H. 

Comment by Don Grust on April 21, 2016 at 9:08am

Great article about what I also believe is an important problem (or opportunity). Adding my two cents, it helps me to frame this as an adoption problem, i.e. how to bring DS beyond the early adopters to those who are more risk averse, such as the older organizations and even many business leaders you mention in your post. One aspect of this would be to do a better job of explaining the models. Black boxes may be more accurate but "trust me" may not work for the more risk averse. Thank you for your thoughtful post.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service