Why would a data scientist use Kafka Jupyter Python KSQL TensorFlow all together in a single notebook?
There is an impedance mismatch between model development using Python and its Machine Learning tool stack and a scalable, reliable data platform. The former is what you need for quick and easy prototyping to build analytic models. The latter is what you need to use for data ingestion, preprocessing, model deployment and monitoring at scale. It requires low latency, high throughput, zero data loss and 24/7 availability requirements.
This is the main reason I see in the field why companies struggle to bring analytic models into production to add business value. Python in practice is not the most well-known technology for large scale and performant, reliable environments. However, it is a great tool for data scientist and a great client of a data platform like Apache Kafka.
Therefore, I created a project to demonstrate how this impedance mismatch can be solved. A much more detailed blog post about this topic will come on Confluent Blog soon. In this blog post here, I want to discuss and share my Github project:
"Making Machine Learning Simple and Scalable with Python, Jupyter No...". This project includes a complete Jupyter demo which combines:
If you want to learn more about the relation between the Apache Kafka open source ecosystem and Machine Learning, please check out these two blog posts:
Let's quickly describe these components and then take a look at the combination of them in a Jupyter notebook.
Jupyter exists to develop open-source software, open-standards, and services for inte.... Therefore, it is a great tool to build analytic models using Python and machine learning / deep learning frameworks like TensorFlow.
Using Jupyter notebooks (or similar tools like Google's Colab or Hortonworks' Zeppelin) together with Python and your favorite ML framework (TensorFlow, PyTorch, MXNet, H2O, "you-name-it") is the best and easiest way to do prototyping and building demos.
However, building prototypes or even sophisticated analytic models in a Jupyter notebook with Python is a different challenge than building a scalable, reliable and performant machine learning infrastructure. I always refer to the great paper Hidden Technical Debt in Machine Learning Systems for this discussion:
Think about use cases where you CANNOT go into production without large scale. For instance, connected car infrastructures, payment and fraud detection systems or global web applications with millions of users. This is where the Apache Kafka ecosystem comes into play.
Apache Kafka is an open-source stream-processing software platform developed by Linkedin and donated to Apache Software Foundation. It is written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency streaming platform for handling and processing real-time data feeds.
Confluent KSQL is the streaming SQL engine that enables real-time data processing against.... It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka; without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant. It supports a wide range of streaming operations, for example data filtering, transformations, aggregations, joins, windowing, and sessionization.
Check out these slides and video recording from my talk at Big Data Spain 2018 in Madrid if you want to learn more abo....
To solve the hidden technical dept in Machine Learning infrastructures, you can combine the benefits of ML related tools and the Apache Kafka ecosystem:
The following diagram depicts an example of such an architecture:
If you want to get a better understanding of the relation between the Apache Kafka ecosystem and Machine Learning / Deep Learning, check out the following material:
Let's now take a look at an example which combines all these technologies like Python, Jupyter, Kafka, KSQL and TensorFlow to build a scalable but easy-to-use environment for machine learning.
This Jupyter notebook is not meant to be perfect using all coding and ML best practices, but just a simple guide how to build your own notebooks where you can combine Python APIs with Kafka and KSQL.
We use a test data set of credit card payments from Kaggle as foundation to train an unsupervised autoencoder to detect anomalies and potential fraud in payments.
Focus of this project is not just model training, but the whole Machine Learning infrastructure including data ingestion, data preprocessing, model training, model deployment and monitoring. All of this needs to be scalable, reliable and performant.
The notebook walks you through the following steps:
Here is a screenshot of the Jupyter notebook where use the ksql-python API to
Check out the complete Jupyter Notebook to see how to combine Kafka, KSQL, Numpy, Pandas, ... to integrate and preprocess data and then train your analytic model.
Yes, you can also use Pandas, scikit-learn, TensorFlow transform, and other Python libraries in your Jupyter notebook. Please do so where it makes sense! This is not an "either ... or" question. Pick the right tool for the right problem.
The key point is that the Kafka integration and KSQL statements allow you to
Check out the complete Jupyter notebook to see a full example which combines Python, Kafka.... In my opinion, this is a great combination and valuable for both, data scientist and software engineers.
I would like to get your feedback. Do you see any value in this? Or does it not make any sense in your scenarios and use cases?