Subscribe to DSC Newsletter

Practical Apache Spark in 10 minutes. Part 5 - Streaming

Spark is a powerful tool which can be applied to solve many interesting problems. Some of them have been discussed in our previous posts. Today we will consider another important application, namely streaming. Streaming data is the data which continuously comes as small records from different sources. There are many use cases for streaming technology such as sensor monitoring in industrial or scientific devices, server logs checking, financial markets monitoring, etc.

In this post, we will examine the case with sensors temperature monitoring. For example, we have several sensors (1,2,3,4,...) in our device. Their state is defined by the following parameters: date (dd/mm/year), sensor number, state (1 - stable, 0 - critical), and temperature (degrees Celsius). The data with the sensors state comes in streaming, and we want to analyze it. Streaming data can be loaded from the different sources. As we don’t have the real streaming data source, we should simulate it. For this purpose, we can use Kafka, Flume, and Kinesis, but the simplest streaming data simulator is Netcat.

First off, run Netcat as a data server with the following command:

nc -lk 8080

Then, we enter the sensors data to it:

21/01/2001 1 1 2322/01/2001 2 0 2023/01/2001 3 1 15
24/01/2001 4 1 10
25/01/2001 5 0 5

Next, we will listen to this server at the port 8080 with the help of Spark Streaming module. Note, that we can send more data with nc during streaming program execution.

In contrast to the previous posts, we will not perform our example in the Spark shell. >We’ll save it as a file in a Spark home directory. The folder name could be spark-2.3.0-bin-hadoop2.7 (depending on the Spark version that you have).

Counting by values

Let’s start with the simple example. We want to calculate how many records came from each sensor.

First off, we have to make some imports:

from pyspark import SparkContextfrom pyspark.streaming import StreamingContext

Then, we begin the program with a common line:

if __name__ == "__main__":

The first important step in every streaming program is initializing StreamingContext. Add the following lines in your program:

sc = SparkContext(appName="PythonStreamingCountByValue")ssc = StreamingContext(sc, 30)

In the last line, 30 is the batch duration.

Now, we should create a DStream that will receive data from the Netcat connecting to hostname:port, like localhost:8080.

ds = ssc.socketTextStream('localhost', 8080)

It is necessary to say that DStream is the continuous sequence of RDD.

The next step is to split our DStream lines to elements and select the second column (sensor number) of our streaming data. This can be achieved with map(func) transformation. The map simply applies the func to each element of the stream.

data = line: int(line.split(" ")[1]))

Then, by applying countByValue transformation, we will count how many times the data came from each sensor during this batch.

data_count = data.countByValue()

Finally, let’s print the result:


With the code above, we've defined the algorithm of our data transformation. To start the execution, we need to start StreamingContext with the following line:


Then we have to set up when to stop Spark Streaming:


If we enter the following input to the Netcat,

01/02/2001 1 1 2301/02/2001 2 0 1201/02/2001 3 1 22
02/02/2001 1 1 25
02/02/2001 2 0 15
02/02/2001 3 0 10

we will get the output like below.

-------------------------------------------Time: 2018-06-22 16:30:30-------------------------------------------
(3, 2)
(1, 2)
(2, 2)

As you can see from the table, the data have come twice from each sensor.

Complete code:

from pyspark import SparkContextfrom pyspark.streaming import StreamingContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingCountByValue")
ssc = StreamingContext(sc, 30)

ds = ssc.socketTextStream('localhost', 8080)

data = line: int(line.split(" ")[1]))
data_count = data.countByValue()


 To run the streaming program, execute:



Let's move to another common task in streaming - filtering. Say we want to accept messages only from online sensors. The filter transformation will help us with this. It returns only the records for which the previously defined function returns true.

Now you may change your program or save it as another file. Whatever you choose, keep the skeleton of the program the same except the lines:

data = line: int(line.split(" ")[1]))data_count = data.countByValue()data_count.pprint()

These lines are the main ones in our program (transformations take place here and the result is printed), so in the examples below we’ll change only these lines.

Replace them with the following ones. (Maybe you'd want to update the appName too)

data = line: line.split(" "))\.map(lambda l: (l[0],int(l[1]),int(l[2]),int(l[3])))data_filter = data.filter(lambda line: line[2]==1)

In the table below, there are input and output for this task

01/02/2001 1 1 2302/02/2001 2 0 1003/02/2001 3 1 25
02/02/2001 2 0 10
01/02/2001 1 1 22
04/02/2001 4 1 24
02/02/2001 1 1 19
05/02/2001 5 0 13
06/02/2001 6 1 26
-------------------------------------------Time: 2018-07-05 09:08:00-------------------------------------------
('01/02/2001', 1, 1, 23)
('03/02/2001', 3, 1, 25)
('01/02/2001', 1, 1, 22)
('04/02/2001', 4, 1, 24)
('02/02/2001', 1, 1, 19)
('06/02/2001', 6, 1, 26)

Complete code:

from pyspark import SparkContextfrom pyspark.streaming import StreamingContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingFiltration")
ssc = StreamingContext(sc, 30)

ds = ssc.socketTextStream('localhost', 8080)

data = line: (line.split(" ")))\
.map(lambda l: (l[0],int(l[1]),int(l[2]),int(l[3])))
data_filter = data.filter(lambda line: line[2]==1)


The maximum temperature of the sensor

Now we’ll make our program more interesting. Let’s assume that we need to know what is the highest temperature of our sensors. To get this information, we should use reduce transformation with max function. The max function gives us the maximum value and the reduce transformation returns the value, calculated by aggregating the element of RDD with the defined function (max in our case).

Again, leave the program skeleton the same. Replace the lines, where transformations take place and the result is printed, with the following ones:

temperature = line: int(line.split(" ")[3]))result = temperature.reduce(max)result.pprint()

In the case of the same input you will get the result:

-------------------------------------------Time: 2018-06-14 12:42:00-------------------------------------------

Complete code:

from pyspark import SparkContextfrom pyspark.streaming import StreamingContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingNetworkMaxTemperature")
ssc = StreamingContext(sc, 30)

ds = ssc.socketTextStream('localhost', 8080)
temperature = line: int(line.split(" ")[3]))
result = temperature.reduce(max)



Streaming data is widely used in various practical applications. One of the most popular streaming application is the receiving data from different sensors in real-time mode. There are many instruments for streaming data analysis (Flink, Storm, Kafka, Spark, Samza, Kinesis, etc.), among which Apache Spark is one of the most popular due to its convenience and simplicity. Apache Spark has a special library for streaming. In this article, we've made a brief overview of Spark Streaming functionality and applied a few basic functions to our data. We have shown that only with several lines of code it is possible to obtain useful information from the datastream, analyze and print it.

Views: 924


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service