Home » Uncategorized

A Self-Study List for Data Engineers and Aspiring Data Architects

This article was written by John Hammink. John Hammink is an American engineer, musician, artist and linguist, with his own entry in Wikipedia


With the explosion of “Big Data” over the last few years, the need for people who know how to build and manage data-pipelines has grown.  Unfortunately, supply has not kept up with demand and there seems to be a shortage of engineers focused on the ingestion and management of data at scale.  Part of the problem is the lack of education focused on this growing field.

Currently, there seems to be no official curriculum or certification to become a Data Engineer or Data Architect (that we know of). While that’s bad news for companies that need qualified engineers, that’s great news for you! Why? Because it’s basic supply and demand and data engineers and data architects are cleaning up! In fact, we did a little research and found that the average salary for a data engineer is around $95.5k. While the salaries for data architects average around $112k nationally, the main path to this strategic, coveted position (and salary) means cutting your teeth as a data engineer and working your way up or making a lateral job move.

What does it take to start getting a piece of this action? Don’t worry: we did research on that, too!  We did a survey of a pool of  ‘Data Engineer’ and ‘Data Architect’ job ads on LinkedIn and GlassDoor and ranked the top 20 sought after job skills.  They are:

SQL    65%
Python    60%
Data Pipelines    55%
Data Warehouse    50%
Hadoop    45%
Hive    45%
ETL    40%
Spark    40%
AWS    30%
Redshift    30%
Java    25%
Kafka    25%
MapReduce    25%
Ruby    25%
Scala    25%
Vertica    25%
Data Quality    20%
JavaScript    20%
NoSQL    20%
Statistics    20%

Percentage of times these skills showed up in data-related job descriptions

Keep in mind that most job ads (and recruiters) are behind the curve on invoking new technology, because recruiters and HR typically get their information second hand.  Plus, cutting-edge technology typically enjoys a period of minimal adoption while it is being tried out.

If you’re starting from the beginning, and set an end goal to become a data architect, the first step is to learn the skills of the data engineer.  While this article won’t – and can’t –  completely connect the dots for you, our aim is to get you thinking about this actively and starting to look for yourself in the right direction.

Roles in a Data Organization

So what are the roles in a data organization?

Data Engineers are the worker bees; they are the ones actually implementing the plan and working with the technology.

Managers (both Development and Project): Development managers may or may not do some of the technical work, but they help to manage the engineers. Project managers help handle the logistical details and time-lines to  keep the project moving according to plan.

Data Architects are the visionaries. They lead the innovation and technical strategy of the product and architecture. Highly experienced and technical, they grow from an engineer position. Very experienced and valuable, they are rare ducks since they’ve essentially been working in this field since its beginning.

When we surveyed several ‘Data Architect’ job descriptions on Glassdoor, LinkedIn and Indeed.com, we found many similarities to the skills required of Data Engineers, so let’s focus on the differences.  The differences include things like coaching and leadership;  data modelling, and feasibility studies.  Another thing required of architects is a firm grasp of legacy technologies. Typically, “legacy” technologies mentioned include: Oracle databases, Teradata, SQL server and Vertica. In the resources below, we don’t cover much of these because there’s extensive documentation on them already.

The Data Pipeline, described

In order to understand what the data engineer (or architect) needs to know, it’s necessary to understand how the data pipeline works.  This is obviously a simplified version, but this will hopefully give you a basic understanding of the pipeline.

A Self-Study List for Data Engineers and Aspiring Data Architects

Common programming languages  are the core programming skills needed to grasp data engineering and pipelines generally.   Among other things, Java and Scala are used to write MapReduce jobs on Hadoop;  Python is a popular pick for data analysis and pipelines, and Ruby is also a popular application glue across the board.

Collection and ingestion are tools at the beginning of the pipeline. Common open-source examples are: Apache Kafka, Fluentd and Embulk. This stage is where data is taken from sources (among them applications, web and server logs, and bulk uploads) and uploaded to a data store for further processing and analytics.  This upload can be streaming, batch or bulk.  These tools are far from the only ones;   many dedicated analytics tools have SDKs for a range of programming languages and development environments that do this.

Storage and management are typically in the middle of the pipeline and take for form of Data Warehouses, Hadoop, Databases (both RDBMS and NoSQL), Data Marts and technologies like Amazon Redshift and Google BigQuery. Basically, this is where data goes to live so it can be accessed later.

Data processing is typically at the end of the pipeline. SQL, Hive, Spark, MapReduce, ELK Stack and Machine Learning all go into this bucket and are used to make sense of the data.   Are you querying your data into a format to use for visualization (like Tableau, Kibana or Chartio)?   Are you formatting your data to export to another data store?  Or maybe running a machine learning algorithm to detect anomalous data? Data processing tools are what you’ll use.

The point is, when you see a job ad or recruiter referring to a specific technology, make it a goal to understand what the technology is, does, and what part of the data pipeline it fits into.

To read the full original article (including resources about Data Collection and Ingestion, Data Storage, Data Processing, and Common Programming Languages) click here. For more big data and data organization related articles on DSC click here.

DSC Resources

Popular Articles