Getting Started with Apache Flink: First steps to Stateful Stream Processing

If you’re interested in stateful stream processing and the capabilities it provides, you may have heard of Apache Flink®. It’s well-known for its ability to perform stateful stream processing, but for beginners, it can be a daunting task to get started. So here, we’ll explore the basics of Apache Flink by showing you how to get started with building your first stream processing application. In this case, you’ll be building a Flink pipeline to process data from one Apache Kafka topic to another Kafka topic.

Setting up the Environment

The first step is to set up the environment. In this example, you will be using Redpanda, which is an alternative implementation of the Kafka protocol written in C++. This is a high-performance streaming platform that provides low latency and does not require Zookeeper. While ZooKeeper is very handy for tracking the status of Kafka cluster nodes and maintaining a list of Kafka topics and messages, it’s not easy to run and operate. So, to get up to speed without yak shaving, we’ll use Redpanda.

To work with Redpanda, you’ll use its RPK command-line client to interact with Redpanda clusters. You’ll start by setting up a Redpanda cluster with three nodes and create two topics, an input topic and an output topic, with a replication factor of three and a partition count of 30.

Setting up the Redpanda Cluster:

Install the Redpanda command-line client, RPK.
Set up a Redpanda cluster with three nodes using RPK commands.
Export the Redpanda cluster URLs into an environment variable for easier use.
Check the running containers using docker ps.

Creating Kafka Topics:

Create two topics with a replication factor of three and a partition count of 30: input topic and output topic.
Use the RPK cluster metadata command to view information about the cluster and topics.

Building the Flink Project

To get up to speed quickly, you should use the Flink Quickstart archetype to generate a new Apache Maven project. The generated Maven Project Object Model (POM) file can be modified to use Java 11 as the JDK version and add Flink dependencies, including the Flink Connector Kafka.

You’ll then import the project into an IDE.

Creating a Flink Project:

Generate a new Apache Maven project using the Flink Quickstart archetype.
Build the project and import it into your preferred IDE.
Update the Maven POM file: set Java 11 as the JDK version and add the Flink Connector Kafka dependency.
Remove any unnecessary configuration.

Creating the Flink Job

The job definition starts with configuring checkpointing behavior to ensure the job can restart in case of failures. You’ll then set up a Kafka source to connect to the input topic and define a Watermarking strategy. In Flink, each element needs a Watermark and timestamp to enable the system to track their progress in event time.

A stream will be set up using the “from” source call, and then filter events based on a string and modify the data by converting messages to uppercase. The filtered data will be passed to a sync, and in the first instance, a simple print sync will be used. The stream will be connected to the print sync using the true parameter for the constructor. The Flink job can then be run in the IDE.

Defining a Flink Job:

Configure checkpointing to ensure proper restart behavior in case of failures.
Set up a Kafka source to connect to the input topic and configure it with the bootstrap servers.
Define a data stream with the Kafka source, a watermarking strategy, and a name.
Filter and modify the data using Flink stream processing API.
Replace the print sink with a Kafka sink to return the data to the output topic.

Deploying the Flink Job

The next step is to deploy the job in an actual Flink cluster. You can download the Flink distribution, unpack it, and start up a cluster using the provided .cluster script. Then open the Flink Web UI in the browser to see all the deployed jobs’ statuses, restart them, and so on. You can then build the job using the npm CLI package. This creates a self-contained fetcher of the job with all its dependencies. You can then use the Flink Run command to upload and start the job. You must also specify the job class and the path to the JAR submitted. You can then use Redpanda’s RPK topic producer and consumer to read from the input and output topics and see the messages arriving and being processed in real-time.

Running and Testing the Flink Job:

Run the Flink job from within your IDE.
Use the RPK command to produce and consume messages on the input and output topics.
Check the output to ensure that the data has been properly filtered and modified.

Deploying the Flink Job in a Flink Cluster:

Download the Flink distribution and start a cluster using the provided script.
Open the Flink Web UI in your browser to view the deployed jobs and their statuses.
Build the Flink job using the Maven CLI package command and submit it using the Flink Run command.
Test the deployed job by sending more messages to the input topic and observing the output.

Using Flink SQL

Finally, you can use Flink SQL to deploy the same stream processing logic using a secret. To do this, you’ll need to download the Flink SQL connector for Kafka and use the Flink SQL client to run secret queries against your Flink sources.

Then, with SQL, you can create tables to represent our input and output topics, describe the structure of the topics, and query the tables. You can also apply all the capabilities of SQL, such as filtering and modifying data. You can create another table to represent the output topic and select data from the input table, which will automatically deploy a job to Flink. For database people, all this will sound very familiar. You can then use the RPK topic producer to put messages into the input topic, and the messages will appear on the output topic.

Using Flink SQL to Process Data:

Download the Flink SQL connector for Kafka and add it to the classpath.
Launch the Flink SQL client and create a table representing the input topic.
Run SQL queries against the input topic to filter and modify the data.
Create another table representing the output topic and write the modified data to it.
Monitor the Flink SQL client to observe the data being processed and written to the output topic.

Conclusion

This article uses Decodable senior software engineer Gunnar Morling’s excellent Flink Intro webinar as a guide, which covers the basics of getting started with Flink. You’ll soon discover that there’s so much more power under the Flink hood, and deploying beyond a POC can be more than a little “hairy”—yak-shaving pun intended. If you want to know more, check out this GitHub example repo. There, you’ll find the demo’s source code.