Turning something raw into something industrially valuable has always required 2 things; science and engineering. The science is our attempt to explain and predict the behavior exhibited by some complex system and capture those explanations in the form of testable models. The engineering looks to mechanize those modeled concepts into useable tools that make a direct impact on society.

As we move into the information age our definition of what we consider valuable is shifting to something more intangible. The industrial revolution showed us how machines could take over menial and repetitive tasks and elevate society to better standards of living. But in the information age the bottleneck to productivity is no longer inadequate assembly lines or slow equipment. We now look to machines to produce something that is less physical and more information-driven. This is because society has extended its human nature beyond its physical limitations where we now communicate on the planetary scale. Where we gain knowledge of our world through digitally-curated and organized news feeds. Where we promote our ideas through online videos and blogs. Where we run our financial institutions, our governments, our hospitals, our schools and our businesses using collected and communicated data.

Producing something of value now requires the ability to move from large amounts of raw collected data to something that elevates our standard of living by allowing us to use that information in new and effective ways. But this requires more than simply scaling our existing machines into this new world of rich and varied data. This new world represents entirely novel complex systems that we need to understand. In order to produce value we need models of its behavior so we can capture those new-found concepts in our machines. If we want to turn this new "oil" into something useful we can't just find better ways of digging it out of the ground. We need to understand what makes this oil exhibit its behavior so we can mechanize that knowledge and produce real value. This requires science.

Enter the Data Scientist; a new kind of scientist charged with understanding these new complex systems being generated at scale and translating that understanding into useable tools. Virtually every domain, from particle physics to medicine, now looks at modeling complex data to make our discoveries and produce new value in that field. From traditional sciences to business enterprise, we are realizing that moving from the "oil" to the "car", will require real science to understand these phenomena and solve today's biggest challenges.

Unfortunately, along with this increased demand for 'science on data' is an accompanying ambiguity with regards to what it means to be a data scientist. If you peruse LinkedIn you'll see heated debates about the term and articles on its meaning. Is it statistics, mathematics, computer science, machine learning, artificial intelligence, or just theoretical physicists applying their skills to the real world? You'll get someone in every crowd that throws up the air quotes around 'science' proudly informing the masses that it's all hype, and that it will pass soon enough (interestingly, these people are never scientists). As the demand for understanding our data rises, we see universities offering courses, organizations offering programs and certificates, all in an attempt to fill the skills gap. Compounding the problem is the fact that eager students looking to jump on the data science bandwagon are as diverse as particle physicists to MBA graduates, increasing the risk that the domain will be diluted with such a range of skills that it becomes challenging to pin down what a data science resume actually looks like.

So we can turn to some of the popularizers of the term like Jeff Hammerbacher and try to get clarification, but when and why a term was coined is independent of what the title actually represents today. We can look to academic gurus in particular fields like statistics, machine learning, or scientific computing but these represent different tools we utilize and there is no “data science field" in the traditional sense, at least not yet; so a particular tool or approach cannot possibly give us the “true” definition.

We need to settle once and for all the term and focus on developing the right skills and attracting the right talent. We need a self-consistent 'guiding light' that gets us past the hype and the demand, and distills for us what it really means to be a data scientist. Only then will real ROI be available to organizations. Only then can we apply this fascinating new area to some of our world's biggest challenges and pave the way for future practitioners. It turns out, the answer is right in the title.

What it means to be a Scientist

Regardless of what industries are currently fueling the demand, or what skill sets happen to be sexy today, there is one thing that cannot be argued; if you are going to use the word scientist in your title, you are going to be held accountable for it.

This has been the case throughout all of scientific history. Before our traditional fields of science laid out their foundations in self-consistent theories they were not considered what we now call science. Astronomy was mere star gazing, chemical synthesis was alchemy, biology was bird watching. What makes something scientific is attaching the phenomena you are studying to some self-consistent model that is independent of opinions or subjective interests.

If you are going to use the word scientist in your title, you are going to be held accountable for it.

If I collect butterflies and pin them to styrofoam this does not make me a scientist. If I collect data related to those butterflies like their physical attributes, their life spans, their flying patterns and their mating behavior this still does not qualify me as a scientist. If, however, I take those data and produce some conceptual abstraction in the form of laws or rules that capture the behavior of that phenomenon in the form of testable models, then I am doing science. These models can be passed on to future generations of scientists so that they may "stand on the shoulders of giants" and build a career around improving our understanding of the phenomenon.

As we apply the scientific method to other areas we can of course start to debate its application. We can, for example, discuss the differences between the so-called "hard" and "soft" sciences but this is a rather antiquated notion. Virtually every scientific field has "hardened" its approach and what is considered crucial is the grounding of a field's fundamental concepts in testable models. This means our ideas are fundamentally mathematical in nature in order to provide some ground-level guarantee that there is self-consistency to our theories.

This does not restrict the sciences to the pure language of mathematics, nor does it even require every scientist be well-versed in its application. After all, as we move away from fundamental physics we must introduce an increasing number of approximations and assumptions into our models, often to the point where what we are looking at looks nothing like a mathematically grounded theory. But regardless of the field of study there must always remain some mathematical backbone that underlies our ideas and ensures us that we have a universally self-consistent approach to exploring our world. Another way of saying this is that no scientific discipline proclaims that their theories violate the laws of physics, our most fundamental science. Whatever level of complexity or abstraction, science can test every attempt at building models against a reductionist mathematical understanding of the world.

What holds all of these disciplines accountable to the word science is the use of testable models that provide the most up-to-date rules of how we think phenomena produces its behavior. In all cases of science, we build testable, mathematically grounded models to explain and predict the behavior of some complex system, and it is this activity that gives us our definition of doing modern science.

What it means to be a Data Scientist

In order to attach the word science to data we must show that data can represent a complex system that exhibits behavior, and that we can explain and predict that behavior using our instruments of choice; namely, computers. These 2 requirements, in addition to the individuals who have traditionally entered the domain before it became popular, provides us with our best guide as to what it means to be a data scientist.

Data Representing a Complex System

We can enter into the debate about what exactly we mean by "complex" but for the sake of any practical argument we can say that any system producing unobvious behavior is complex. In other words, if some phenomenon produces behavior in a way that is not immediately obvious, it requires a simplified approximation, a model, to explain and predict how it achieves that behavior.

The data we collect from sensors, websites, detectors, and any other device is being generated from phenomena. We are organisms on a planet interacting in complex ways producing behavior that is anything but obvious. A scientist may have a preference as to what kind of phenomena they wish to study, but science doesn't 'care' what tickles your fancy; science looks to explain and predict complex systems and the data we capture and study in data science has been generated from a complex system. Its 'unobviousness' is why we look to scientists to try and figure out how that behavior was manifested because that discovery is what makes building new technology that acts on data possible.

Explaining and Predicting Behavior

Can we really build models from all these data we are collecting? Well, of course. After all, this is no different from any other science. All sciences collect data whether it is from butterfly collections, particle accelerators, chemical analyses, MRI imaging, or disease propagation. Data are simply recorded activity that was generated from some underlying complexity. To build a model means turning those 'recordings' into a consistent collection of testable concepts that explain and predict the activity we are observing.

Why Scientists are attracted to Data Science

So we have complex systems generating data that we analyze to build testable models that explain and predict some unobvious behavior. So 'scientifically speaking' we don't seem to have an issue with attaching the word science to data, and why would we, ALL of science is attached to data. In addition, the majority of individuals who have entered into data science are indeed full-blown scientists looking to apply their skills outside the ivory towers; if data science wasn't considered a real science that would be a lot of scientists all-of-a-sudden reconsidering what they find interesting and important. Keep in mind, scientific researchers have devoted at least 10 years of their life to a degree that isn't known for its large paychecks after graduation. These are individuals who love exploring phenomena and building new models that will add to their field of study. Applying their skills outside the ivory tower means getting to see their work have a more immediate impact; something rarely if ever seen in academia. Data science is becoming an excellent alternative for trained researchers to do what they love.

So what's the problem? Why is it becoming difficult to identify what it means to be a data scientist? One word. Demand.

The Diluting Power of Demand

Science has become the "cool" kid in the real world and it shouldn't be surprising why that is. If what we value in the information age is the ability to convert our new "oil" into something different, something intangible, it will require, like every industrial innovation before us, a deep understanding of the mechanisms underlying that oil's behavior. If we can understand and predict the behavior of markets, education, healthcare, government, weather, and traffic flow just imagine the challenges we could address and the new products we could build.

The software of tomorrow isn't programming 'simple' logic into machines to produce some automated output. It is using probabilistic approaches and numerical and statistical methods to 'learn' the behavior and act accordingly. The software of tomorrow is aware of the market in which it operates and takes actions that are inline with the models sitting under its hood; models that have been built from intense research on some underlying phenomenon that the software interacts with. Science is now being called upon to be a directly-involved piece of real-world products and for that reason, like never before in history, the demand for ushering in science to help enterprise compete is exploding.

As exciting as all this is, the hype for data science has attracted a lot of attention. Some of this is great. More attention means more opportunity to explore data-driven solutions to new and interesting challenges. But with it has come the usual bandwagon-jumping from vendors as they look to quickly capitalize from the excitement, leading to more confusion as people buy-in to the supercharged hype.

Some vendors throw data science under the analytics umbrella confusing organizations as to the difference between data science and say, optimization of web content for search engines. Others try force-feeding pre-built models into existing business intelligence (BI) applications leading people to think BI and data science are the same thing. Some visualization vendors tout that their software for querying and visualizing large datasets allows you to do data science. Then there is the so-called Big Data push with platforms for working with large amounts of data, leading companies to assume that if they simply install Hadoop or Spark they are integrating data science into their operations - as if scaled computing somehow equates to scaled science.

The problem with all of the above scenarios is that there is little attention being paid to whether or not science is actually taking place. Are testable models being built using real research on the underlying complexities that an organization is attempting to understand and anticipate? Are the models employed actually mapping an algorithmic approach to the pain points of the organization, or are they simply the ones that came with a scaled recommendation engine?

No organization is immune from complexity and, as a result, no pre-built solution or pre-packaged software is going to turn an organization into an entity that competes scientifically. Doing data science takes real research by real scientists devoted to building models and integrating those models into real-world applications. There has never been a shortcut to doing science; that isn't about to change because of promoted software sitting on top of large datasets.

And so we are in need of that 'guiding light' to move us past the hype of vendors, and the diluting power of demand. We need to educate organizations that are looking for talent on what a data science resume should look like. We need to understand data science for what it is, not what we want it to be. We need to own up to the title, and lay the foundations for future generations wanting to understand and model this new and exciting data-rich world.

Owning Up to the Title

To own up to the title of data scientist means practitioners, vendors and organizations must be held accountable to using the term science, just as is expected from every other scientific discipline. What makes science such a powerful approach to discovery and prediction is the fact that its definition is fully independent of human concerns. Yes, we apply science to the areas we are interested in, and are not immune to bias and even falsification of results. But these deviations of the practice do not survive the scientific approach. They are weeded out by the self-consistent and testable mechanisms that underly the scientific method. There is a natural momentum to science that self-corrects and its ability to do this is fully understandable because what survives is the truth. The truth, whether inline with our wishes or not, is simply the way the world works.

Opinions, tools of the trade, programing languages and 'best' practices come and go, but what alway survives is the underlying truth that governs how complex systems operate. That 'thing' that does work in real world settings. That concept thatdoes explain the behavior with enough predictive accuracy to solve challenges and help organizations compete. This requires discovery; not engineered systems, business acumen, or vendor software. Those toolsets and approaches are only as powerful as the science that drives their execution and provides them their modeled behavior. It is not a product that defines data science, but an intangible ability to conduct quality research that turns raw resources into usable technology.

As a data scientist your allegiance is to science; not machine learning, statistics, database technology, or business practices. All of these are critically important but not the 'guiding light' that leads us to objective discovery. We cannot equate data science to a particular discipline or tool, because it is the investigation and the development of models that make us scientists. Tools, practices and languages can be learned, but having a passion and mind for discovery and experimentation is how science has always moved forward.

As a data scientist your allegiance is to science; not machine learning, statistics, database technology or business practices.

As we move forward and lay the foundations of data science we must ensure that we own up to the title of scientist. We must frame the building of new products and new technologies as something that requires real research. We must instill in organizations and practitioners alike that the only way to produce value, compete effectively, and invest in long-term solutions is to understand the complexity of the markets in which we compete.

This is the information age and what we value is the intangible ability to use information to better our lives. Our information has become a complex system and the only way to convert it into something of utility is to do real science. The ambiguity of the term data scientist is only superficial, and merely a byproduct of hype and demand. If we stick to the science, the ambiguity falls away to a solid approach to understanding behavior and building tomorrow's exciting products.

Views: 4569


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dan K. Hansen on November 30, 2015 at 2:01am
What a good article! A long awaited attempt at setting things straight and trying to get some order into all the mumble jumble
Comment by kary bheemaiah on December 4, 2014 at 11:36pm

Great Article. Really puts things into perspective . The term Data Scientist is being used today without people knowing what it means. This article does put things in context. 

Comment by Qingbo Zheng on December 4, 2014 at 1:28pm

Well said!

Comment by Sarath Varma Datla on December 4, 2014 at 10:59am
Thanks @Sean for suggestions... I will check the referred post...
Comment by Sean McClure on December 4, 2014 at 10:21am

@Raymond, Thank you for the nice comment. I agree that this is a much needed message for our field and am always glad to see like-minded practitioners staying true to the science. Simpler methods using quality research practices should always be preferred over fancy algorithms that are not inline with real discovery.  Machine Learning is a most excellent tool to use, but it should be used in the context of capturing real behavior and continual improvement. If the focus is on science, the right tools to use will become apparent as we attempt to solve the challenge. As your point on producing benefit explains. Thanks again. 

@Sath, Thank you Sath, glad you enjoyed the article. The key to developing the required skills is to jump into real problems and solve them, as my previous article points out (http://bit.ly/1BFqFHF). Start competing in Kaggle or here on DSC. Grab their datasets and start writing R and Python to build models. The language of choice will arise naturally from the context of the problem. Whether or not you look to scale your solution towards Big Data will be a byproduct of the demands needed to solve the challenge. Focus on attempting to solve real-world problems instead of trying to engineer your future using the latest trends. You will see that what you learn is the most relevant and powerful technologies anyway....you will naturally gravitate towards many of the popular sought-after tools and techniques simply because these are sought for a reason...they work.  But the key point is that you learned them solving real problems. Science isn't concerned with pedigree or hubris...it simply exposes truth. Be a part of discovering real solutions to real challenges and your resume and background will be exactly what it's supposed to be; a real-world example of a problem solver. 

Comment by Sarath Varma Datla on December 4, 2014 at 9:57am
Great Article Sean... You gave a real scientist perspective on data. what are your suggestions for young data scientist? What skills and methods require focus at early stage for data scientist career??
Comment by Raymond Buhr on December 3, 2014 at 8:09pm
Really great article. One of the best I've read here at DSC. Thanks for taking the time to write this as I think it is fundamental to the core of our field, yet so often gets passed over in favor of shiny new tools.

From personal experience, some of my best work has come from simple statistics and common tools because the experimental design was really well thought out. Conversely, I have done some really interesting machine learning projects that just have not been able to provide much benefit.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service