Subscribe to DSC Newsletter

More and more frequently we see organizations make the mistake of mixing and confusing team roles on a data science or "big data" project - resulting in over-allocation of responsibilities assigned to data scientists. For example, data scientists are often tasked with the role of data engineer leading to a misallocation of human capital. Here the data scientist wastes precious time and energy finding, organizing, cleaning, sorting and moving data. The solution is adding data engineers, among others, to the data science team.
 
Data scientists should be spending time and brainpower on applying data science and analytic results to critical business issues - helping an organization turn data into information - information into knowledge and insights - and valuable, actionable insights into better decision making and game changing strategies.
 
Data engineers are the designers, builders and managers of the information or "big data" infrastructure. They develop the architecture that helps analyze and process data in the way the organization needs it. And they make sure those systems are performing smoothly.
 
Data science is a team sport. There are many different team roles, including: 
 
Business architects;
Data architects;
Data visualizers;
Data change agents.
 
Moreover, data scientists and data engineers are part of a bigger organizational team including business and IT leaders, middle management and front-line employees. The goal is to leverage both internal and external data - as well as structured and unstructured data - to gain competitive advantage and make better decisions. To reach this goal an organization needs to form a data science team with clear roles.
 

Views: 59155

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Elesin Fuad Olalekan on May 31, 2015 at 5:58am

Very informative post. I want to pursue a career in data engineering. Which online courses can get me started?

Comment by Sean McClure on April 14, 2015 at 8:27am

There is a very distinct difference between Data Scientist and Data Engineer, although the 2 fields should work very closely as they are both required to make any data-intensive project a success.  As an analogy, there are 2 things required to make a car run...one is the science in understanding combustion and the other is the machine that capitalizes from that science to produce motion.  Any field that must convert a raw resource into something valuable always requires both science and engineering.

Scientists are concerned with building models that capture the behavior of some complex system. In Data science we use collected data (a raw resource) from a domain of interest (say retail) to build models that will capture the underlying patterns that govern the domain's behavior. By having a model that captures the behavior it means we can now engineer a system that capitalizes from that science and turns it into a working tool.  Without a working model, there would be nothing to engineer...and without a tool to exploit the model, there would no point in building models for the domain. 

It is important that people understand the distinction so that they know the talent that is needed to fulfill the requirements of a project.  You need a scientist because scientists know how to model complex systems and capture the behavior of various domains inside mathematical abstractions. Since mathematical abstractions can be expressed as algorithms, it means we can house models inside software.  You also need an engineer because to make the software work in an on-going fashion, it must sit on top of a data pipeline that continuously cleans and prepares data in a way that is amenable to retraining the model and taking action on the real world.  It also needs to ingest large amounts of data and be able to distribute the scientist's model across many pieces of commodity hardware. 

Neither of these roles should be confused with the data analyst who is charged with looking at data to uncover interesting stats regarding the health of a company.  This means presenting a report or dashboard to highlight interesting trends.  This has nothing to do with automating decision-making because automating decision-making requires a model. With a data analyst, the onus is still on the decision-maker to decipher what all the trends mean. 

The data scientist only looks at stats "en route" to making a model. It is the model that allows decision-making to be automated. This is why a PhD is typically required for the data scientist role, as this person must understand how to build useful models using the techniques of scientific computing and mathematics.  This is their training.  The engineer on the other hand is tasked with making sure those models can live inside real-world enterprise applications. 

The 3 roles can compliment each other as follows:

The Data Analyst often understands where the data lives and how it relates to the domain. Having a data analyst work with the data scientist can be very productive.  It gives the data scientist access to someone who can help define what the data is and what simple trends they have found.  Knowing these simple trends can assist the data scientist in building a model that will capture the domain's behavior.  It gives clues as to how the data scientist will need to clean and prepare the data in a way that makes the model accurate. It also allows the data scientist to see where domain expertise can be used to help inform the model.  By going through the process of cleaning and preparing the data, and building a working model, the data scientist can also sit with the data engineer so that the engineer understands what that data pipeline and architecture must do to make the model work at scale. 

The 3 roles are very distinct, but also very complimentary. 

Comment by Alex Esterkin on August 7, 2013 at 5:46pm

How is "Data Scientist" different from "Data Analyst"?  On your diagram, 'data mining' is in the 'Information Presentation' group. Why?  'Data Mining' is a standard term http://en.wikipedia.org/wiki/Data_mining and seems to be used in a wrong context here.

 

Comment by Dominic Delmolino on July 13, 2013 at 10:01am

A good data engineer sets up data in a way that enables analysis by the data scientist -- for me, this means placing the source data into structures based on the scientist's analysis requirements in a way that supports and enables efficient analysis of the data. Yes, finding, cleaning and moving data is important and can be done by junior data engineers, but a good senior data engineer provides value above and beyond simple data movement and cleansing.

Comment by Robert Lovett on July 8, 2013 at 7:19pm
How about "Data Detective?" Data Engineer with a dose of whimsy and intrigue.
Comment by Michael Jannis Pedersen on July 8, 2013 at 6:52am
So data scientists are on the strategic level and data engineers are on tactical / operational level. I understand the distinction, but as always communication across the planning hierarchy is an important criteria for success.
Comment by Richard Ordowich on July 8, 2013 at 6:25am

I prefer a Data Miner. I think of data as a resource (akin to a natural resource, but data is not governed by any laws of nature).

The analogy to mining carries forth. Mining requires exploration. Mining can pollute as well as provide economic and social benefit. Working in mines can be dangerous. Mining equipment is needed to increase efficiency. You can come up with nothing in mining. You can exploit the mined resource in many ways.

Comment by Djoni Darmawikarta on July 8, 2013 at 6:08am

How about data analyst, a mixture of scientist and engineer?

Comment by Richard Ordowich on July 8, 2013 at 6:01am

I will avoid the question of what is the difference between a data scientist and data engineer and suggest that the regardless of these roles what is lacking in most organizations is a data philosopher.

 

What are the principles that govern the use and exploitation of data, what are the origins of the data (truth, sense making etc.). Without this role we have people who can manipulate data but lack the skills to interpret and understand data.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service