Subscribe to DSC Newsletter

One of the main differences between a data scientist and a data engineer has to do with ETL versus DAD:

  • ETL (Extract/Load/Transform) is for data engineers, or sometimes data architects or database administrators (DBA).
  • DAD (Discover/Access /Distill) is for data scientists.

Data engineers tend to focus on software engineering, data base design, production code, and making sure data is flowing smoothly between source (where it is collected) and destination (where it is extracted and processed, with statistical summaries and output produced by data science algorithms, eventually moved back to the source or elsewhere). Data scientists, while they need to understand this data flow (and how it is optimized, especially when working with Hadoop) don’t actually optimize the data flow itself, but rather the data processing step: extracting value from data. But they work with engineers and business people to define the metrics, design data collecting schemes and make sure data science processes integrate efficiently with the enterprise data systems (storage, data flow). This is especially true for data scientists working in small companies, and a reason why data scientists should be able to write code (more and more, Python) re-usable by engineers.

Sometimes data engineers do DAD, and sometimes data scientists do ETL, but it's not common, and when they do it's  usually internal. For example, the data engineer may do a bit of statistical analysis to optimize some database processes, or the data scientist may do a bit of database management to manage a small, local, private database of summarized information.

DAD is comprised of:

  • Discover: Find, identify the sources of good data, and the metrics. Sometimes request the data to be created (work with data engineers and business analysts).
  • Access: Access the data. Sometimes via an API, a web crawler, an Internet download, a database access or sometimes in-memory within a database.
  • Distill: Extract essence from data, the stuff that leads to decisions, increased ROI, and actions (such as determining optimum bid prices in an automated bidding system). It involves
  • Exploring the data (creating a data dictionary and exploratory analysis)
  • Cleaning (removing impurities)
  • Refining (data summarization, sometimes multiple layers of summarization or hierarchical summarization)
  • Analyzing: statistical analyses (sometimes including stuff like experimental design that can take place even before the Access stage), both automated and manual. Might or might not require statistical modeling
  • Presenting results or integrating results in some automated process

Data science is at the intersection of computer science, business engineering, statistics, data mining, machine learning, operations research, six sigma, automation, and domain expertise. It brings together a number of techniques, processes, and methodologies from different fields, together with business vision and action. Data science is about bridging the different components that contribute to business optimization at large, and eliminating the silos that slow down business efficiency. It has its own unique core, too, including (for instance) the following topics discussed in my book (listed in the “related articles” section):

  • Clustering and taxonomy creation  for large datasets (chapter 2 and 4)
  • Internet topology (chapter 4)
  • Model-free confidence intervals (chapter 5)
  • Analytics as a Service, API’s (chapter 5)
  • Hadoop / Map-Reduce (chapter 5)
  • Fast feature selection (chapter 6)
  • Predictive power of a feature (chapter 6)
  • Advanced visualizations (chapter 4)
  • The curse of big data (chapter 2)
  • What Map-Reduce can't do (chapter 2)
  • Keyword correlations in big data (chapter 4)
  • Eleven features any database, SQL, or NoSQL should have (chapter 4)
  • Correlation and R-squared for big data (chapter 4)
  • Statistical modeling without models (chapter 4)
  • Linear regression on an usual domain, hyperplane, sphere, or simplex (chapter 1)

Caveat:

Some employers are looking for Java or database developers with strong statistical knowledge. These professionals are very rare, so instead the employers sometimes try to hire a data scientist, hoping he/she is strong in developing production code. If you don’t have that level of Java or database expertise, it can be a waste of time to attend these interviews. You should ask upfront if the position to be filled is a Java developer with statistics knowledge, or a statistician with strong Java skills, during your phone interview, though sometimes the hiring manager is unsure what he really wants, and you might be able to convince him to hire a guy like you if you tell the added value that you expertise brings. It is easier for an employer to get a Java software engineer to learn statistics (especially using this book as training material) than the other way around. 

Related articles

Views: 15495

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Qingbo Zheng on November 17, 2014 at 7:21pm

data scientists need to put back on their lab coats, drill into mathematical models and invent the next-generation k-mean clustering for data engineers to use. there is a big mislabeling of job titles nowadays. the majority of data scientists work nowadays is truly data engineering. so Dr. data scientists, stop taking data engineers' jobs. 

Comment by jaap Karman on July 9, 2014 at 12:35am

Sorry I missed this when posted.
The ETL with a DWH was originally build to give data analysts their data to do analytics. They could disturb the operational process as not being educated to computers. Nice that was 30 years ago, we are possible in a technical shift that the old reason of capacity/performance of computers is solved different.

Do not give then analytic users a dwh but a data-lake. In that case ETL is killed with all that expensive processing (and those technicians). Everyone will doing in memory analytics Discover/Access/Distlill 

Would be great to see the fall of ETL and rise of DAD even when there will be no new abbreviation just using ETL abbreviation all the time.

   

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2016   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service