From the previous post on “Poor Data Management Practices“, the discussion ended with a high level approach to one possible solution for data silos. Traditional approaches for solving the data silo problem can cost millions of dollars (even for a moderately sized company), and typically requires a huge effort in integration work (e.g., data modeling, system engineering, software design, and development). In this post Flafka, the unofficial name for integrating Flume as a producer for Kafka, is presented as another possible big data solution for data silos.
Without inundating you with technical jargon, Apache Flume is a distributed service that is very efficient at collecting and moving large amounts of data into Hadoop (e.g., click-stream data, security log files, and application data). Flume provides sinks into the Hadoop ecosystem like, Hbase, Solr Index, HDFS, and Kafka to name a few. Additionally, there can be multiple sources, and these built-in mechanisms allow a lot of work to be done without writing a single line of code. The data can also be transformed, augmented, filtered, and aggregated during the ingest process as well. The architecture is flexible, and fault tolerant with many tuning and failover mechanisms. The available sources, channels, and sinks are listed in Figure 1:
Kafka is typically used for building applications for real-time streaming data pipelines that reliably : 1) get data between systems/applications, and 2) transform or react to the streams of data. It is a messaging publish and subscribe system with multiple client libraries that support many programming languages which make it extremely flexible. Kafka has 4 core APIs: 1) Producer, 2) Consumer, 3) Streams, and 4) the Connector.
Kafka and Flume have similar, and sometimes overlapping capabilities, with the exception of Kafka’s functional simplicity, and messaging capabilities, as well as Kafka providing more durability due to its replication; it syncs data across multiple disks. Kafka is extremely simple to setup and configure. Simplicity is usually a good thing, but in some cases it means that the complexities are left to the user. Part of Kafka’s simplicity comes from its lack of dependency on other applications (other than Apache Zookeeper), and, as a result, this means the responsibility for writing producers and consumers falls on the developer.
Fortunately, writing code for Kafka is not extremely complicated (more than 15 languages are supported, including the more popular, Java, Scala, and Python), and things have been made even simpler with the integration of Flume and its code base for integration with Kafka. The two applications have been integrated. This integrated capability is available for Cloudera Distribution including Apache Hadoop (CDH) 5.2, as well as with Hortonworks Data Platform (HDP) 2.4, as well as the regular open source Apache Hadoop ecosystem.
By using Kafka sink, Flume can publish messages to multiple Kafka topics. The Flume agent consist of three elements: 1) Source, 2) Channel, and 3) Sink.
Conversely, Kafka can be used as the source for Flume:
You probably see where I am going with this now, and how this framework can be used as an integration mechanism for multiple disparate systems. Literally, the topics of a Kafka server could range from network log feeds for security monitoring, to social media feeds to monitor consumer sentiment, to banking and credit card transactions, and insurance claims. Customers, their accounts, and basically any activity across multiple lines of business can be aggregated and reported. The best part, this framework is relatively simple, and can be implemented without disruption to current system processes.
The incoming data when using Kafka as a sink for Flume is processed using either an interceptor java class, or Morphlines, which is a technology that uses regular expressions to filter and transform data. Spark also can be used in this framework since Kafka currently requires an external stream processing system to process its messages; an obvious choice for this activity is Spark which integrates well with Kafka. Spark supports exactly-once processing by reading messages in from Kafka.
As you can see, the possibilities are essentially endless in a distributed computing environment where resources are abundant. In addition to Flafka, with multiple Kafka topics, you can build in multi-channel sinks as depicted in the figure below, and you can have multiple agents running.
With all of this flexibility comes the requirement for governance. The power of this framework is obvious and amazingly simplistic. However, before implementing such a framework, there must be clear definitions of how data will flow, the structures that will be used, the data formats, and the methods of consumption. It is extremely tempting to just implement a solution and grow it with changing requirements. Without a defined enterprise information architecture in place, instead of data silos, you might end up with a worse situation. At least with data silos you know where to get data specific to that silo – say banking for example. If developed without the proper technical discipline, this architecture could become a tangled web of data.