Data transformation 101: Process and new technologies

Data transformation involves converting data from one format into another for further processing, analysis, or integration. The data transformation process is an integral component of data management and data integration. Likewise, companies can improve their data-driven decision-making by streamlining their data management and integration processes through data transformation.

However, as more and more companies adopt cloud-based data storage (IDC reports that today 67% of enterprise infrastructure is cloud-based), the data transformation process must follow suit. Consequently, many companies are searching for alternative data integration processes and data transformation tools that help improve the data quality, readability, and organization company-wide.

In this article, I will explore the data transformation process, how it contributes to the broader processes of data integration, as well as new data transformation technologies.

Benefits of data transformation

From a general perspective, data transformation helps businesses take raw data (structured or unstructured) and transform it for further processing, including analysis, integration, and visualization. All teams within a company’s structure benefit from data transformation, as low-quality unmanaged data can negatively impact all facets of business operations. Some additional benefits of data transformation include:

Improved data organization and management
Increased computer and end-user accessibility
Enhanced data quality and reduced errors
Greater application compatibility and faster data processing

Data integration

Before examining the various ways to transform data, it is important to take a step back and look at the data integration process. Data integration processes multiple types of source data into integrated data, during which the data undergoes cleaning, transformation, analysis, loading, etc. With that, we can see that data transformation is simply a subset of data integration.

Data integration as a whole involves extraction, transformation, cleaning, and loading. Over time, data scientists have combined and rearranged these steps, consequently creating four data integration processes: batch, ETL, ELT, and real-time integration.

Batch integration

Another common method is batch data integration, which involves moving batches of stored data through further transformation and loading processes. This method is mainly used for internal databases, large amounts of data, and data that is not time-sensitive.

ETL integration

Similar to ELT, ETL data processing involves data integration through extraction, transformation, and loading. ETL integration is the most common form of data integration and utilizes batch integration techniques.

ELT integration

ELT data processing involves data integration through extraction, loading, and transformation. Similar to real-time integration, ELT applies open-source tools and cloud technology, making this method best for organizations that need to transform massive amounts of data at a relatively quick pace.

Real-time integration

One of the more recent data integration methods, real-time integration, processes and transforms data upon collection and extraction. This method utilizes CDC (Change Data Capture) techniques, among others, and is helpful for data processing that requires near-instant use.

These same concepts utilized in data integration have also been applied to the individual steps within the larger integration process, such as data transformation. More specifically, both batch data processing and cloud technology, utilized in real-time integration, have been crucial in developing successful data transformation processes and data transformation tools. Now, let’s take a closer look at the types of data transformation processes.

First party data (data you collect yourself about your company and your customers) is rapidly growing in value. Your ability to transform and use that data to drive decisions and strategies will increasingly become the source of competitive advantage.

– Rich Edwards, CEO of Mindspan Systems

Types of data transformation

Batch data transformation

Batch data transformation, also known as bulk data transformation, involves transforming data in groups over a period of time. Traditional batch data transformation involves manual execution with scripted languages such as SQL and Python and is now seen as somewhat outdated.

More specifically, batch transformation involves ETL data integration, in which the data is stored in one location and then transformed and moved in smaller batches over time. It is important to note the significance of batch data transformation on many data integration processes, such as web application integration, data warehousing, and data virtualization. When applied to other data integration processes, the concepts and logistics within batch data transformation can improve the overall integration process.

Interactive data transformation

As many companies turn to cloud-based systems, IBM even reports that 81% of companies use multiple cloud-based systems, end-users of said data are also looking for more versatile methods to transform data. Interactive data transformation, also referred to as real-time data transformation uses similar concepts seen in real-time integration and ELT processing.

Interactive data transformation is an expansion of batch transformation. However, the steps are not necessarily linear. Gaining traction for its accessible end-user visual interface, interactive data transformation takes previously generated and inspected code to identify outliers, patterns, and errors within the data. It then sends this information to a graphical user interface for human end-users to quickly visualize trends, patterns, and more, within the data.

Data transformation languages

In addition to the various types of data transformation, developers can also utilize a variety of transformation languages to transform formal language text into a more useful and readable output text. There are four main types of data transformation languages: macro languages, model transformation languages, low-level languages, and XML transformation languages.

The most commonly used codes in data transformation include ATL, AWK, identity transform, QVT, TXL, XQuery, and XSLT. Ultimately, before deciding what transformation method and language to use, data scientists must consider the source of the data, the type of data being transformed, and the project’s objective.

The data transformation process

Now that I’ve covered the bigger picture of how data transformation fits into the larger picture of data integration, I can examine the more granular steps in data transformation itself. Firstly, it is important to note that while it’s possible to transform data manually, today, companies rely on data transformation tools to partially or fully transform their data. Either way, manual and automated data transformation involves the same steps detailed below.

1. Data discovery and parsing

The first step in the data transformation process involves data discovery and data parsing. Data discovery and data parsing are processes that involve collecting data, consolidating data, and reorganizing data for specific market insights and business intelligence.

2. Data mapping and translation

Once you have profiled your data and decided how you want to transform your data, you can perform data mapping and translation. Data mapping and translation refer to the process of mapping, aggregating, and filtering said data so it can be further processed. For example, in batch transformation, this step would help filter and sort the data in batches so executable code can run smoothly.

3. Programming and code creation

The data programming involves code generation, in which developers will work with executable coding languages, such as SQL, Python, R, or other executable instructions. During this stage, developers are working closely with transformation technologies, also known as code generators. Code generators provide developers with a visual design atmosphere and can run on multiple platforms, making them a favorite among developers.

4. Transforming the data

Now that the code is developed, it can be run against your data. Also known as code execution, this step is the last stage the data passes through before reaching human end-users.

5. Reviewing the data

Once the code executes the data, it is now ready for review. Similar to a quality assurance check, the purpose of this step is to make sure the data has been transformed properly. It is important to note that this step is iterative, in that end-users of this data are responsible for reporting any errors they found in transformed data to the developers, so edits to the code can be made.

Data extraction and transformation have an effect on other business activities. When data is transformed into a more readable format, data analysis can be completed more quickly and accurately than before. Not only does this have an effect on employee morale, but it also has an impact on company decision-making.

– Brian Stewart, CTO of ProsperoWeb

ETL vs. ELT

The recent advancements in big data have required businesses to look elsewhere when storing, processing, and analyzing their data. Moreso, the increasing variety in data sources has also contributed to the strain being placed on data warehouses. Particularly, while companies acquire powerful raw data from data types such as firmographic data, employee data, and social media data, these same data types typically export very large file sizes. Consequently, companies have been searching for alternative methods.

This search has greatly impacted data integration processes, specifically data transformation. That is, companies have been transitioning from traditional data integration processes, such as ETL methods, to cloud-based integration processes, such as ELT and real-time integration.

In the past, many companies have relied on local servers for data storage, making ETL integration the preferred method. However, due to the significant increase in digital communication and business operations in 2020, global data creation is now modeled at a CAGR of 23%, according to Businesswire. Subsequently, the upward trend in global data creation has put a strain on local servers and data storage, and many businesses are looking elsewhere for cloud-based solutions.

What is data transformation in ETL?

ETL, which stands for extraction, transformation, and loading, is a data integration process that involves extracting data from various external sources, often from third-party data providers, transforming the data into the appropriate structure, and then loading that data into a company’s database. The ETL process is considered the most common integration process compared to ELT, ETM, and EMM transformation processes.

Data transformation within ETL occurs in the transformation step; however, it is closely linked to the extraction and loading stages. Traditionally, data transformation within the ETL method utilizes batch transformation with linear steps, including discovery, mapping, programming, code execution, and data review.

Summary

As businesses collect an increasing volume of data, many are forced to find data storage and processing solutions that can handle massive amounts of data with limited money and resources. Similarly, companies are also looking for data transformation solutions that can meet the current needs and industry standards. Companies are recognizing the future of data transformation and shifting towards utilizing cloud-based technology in processes such as ELT integration and interactive data transformation.