One of the main differences between a data scientist and a data engineer has to do with ETL versus DAD:
Data engineers tend to focus on software engineering, data base design, production code, and making sure data is flowing smoothly between source (where it is collected) and destination (where it is extracted and processed, with statistical summaries and output produced by data science algorithms, eventually moved back to the source or elsewhere). Data scientists, while they need to understand this data flow (and how it is optimized, especially when working with Hadoop) don’t actually optimize the data flow itself, but rather the data processing step: extracting value from data. But they work with engineers and business people to define the metrics, design data collecting schemes and make sure data science processes integrate efficiently with the enterprise data systems (storage, data flow). This is especially true for data scientists working in small companies, and a reason why data scientists should be able to write code (more and more, Python) re-usable by engineers.
Sometimes data engineers do DAD, and sometimes data scientists do ETL, but it's not common, and when they do it's usually internal. For example, the data engineer may do a bit of statistical analysis to optimize some database processes, or the data scientist may do a bit of database management to manage a small, local, private database of summarized information.
DAD is comprised of:
Data science is at the intersection of computer science, business engineering, statistics, data mining, machine learning, operations research, six sigma, automation, and domain expertise. It brings together a number of techniques, processes, and methodologies from different fields, together with business vision and action. Data science is about bridging the different components that contribute to business optimization at large, and eliminating the silos that slow down business efficiency. It has its own unique core, too, including (for instance) the following topics discussed in my book (listed in the “related articles” section):
Some employers are looking for Java or database developers with strong statistical knowledge. These professionals are very rare, so instead the employers sometimes try to hire a data scientist, hoping he/she is strong in developing production code. If you don’t have that level of Java or database expertise, it can be a waste of time to attend these interviews. You should ask upfront if the position to be filled is a Java developer with statistics knowledge, or a statistician with strong Java skills, during your phone interview, though sometimes the hiring manager is unsure what he really wants, and you might be able to convince him to hire a guy like you if you tell the added value that you expertise brings. It is easier for an employer to get a Java software engineer to learn statistics (especially using this book as training material) than the other way around.