11 Steps for understanding Confluent Kafka for Data Engineers!

Before you invest in Confluent $CFLT IPO, you might want to learn about Kafka. ;)

Richie Bachala
5 min readJun 24, 2021

Apache Kafka common benefits are building data pipelines, leveraging real-time data streams, enabling operational metrics, and data integration across countless data sources.

11 Step technical architecture for building Kafka based DataStreams:

Cluster Reference Architecture

A Confluent Platform cluster that is built for high-throughput long-term scalability will have an architecture similar to the following:

This architecture example is designed to scale. Each component is given its own servers, and if any layer becomes overly loaded it can be scaled independently simply by adding nodes to that specific layer.

Cluster Reference Architecture

1. ZooKeeper:

· Manages distributed processes, cluster memberships, and elects cluster controllers.

· High availability requires 5 Zookeeper nodes (2 for failure node)

· Number of ZooKeeper nodes must be odd.

2. Kafka Brokers:

· Main storage, messaging components, and maintains streams of ordered logs of messages in Topics.

· Contains clusters that maintain streams of messages called Topics.

· Topics are shared into partitions of ordered immutable logs of messages.

· Partitions are replicated and distributed for high availability.

· Need 4 Kafka Brokers in a cluster each replicating partition on a separate server/node.

· Highly loaded Brokers should have their own nodes otherwise they can share Zookeepers but separate disks/controllers for Zookeepers.

3. Kafka Connect Workers:

· Stateless Workers integrate external systems to pull/push data on source/target.

· Pluggable JDBC Connectors in Avro or JSON format to Azure Data Lake Storage Gen2 Sink.

· The Kafka Connect Snowflake Sink connector maps and persists events from Topics directly exposing the data to services for querying, enrichment, and analytics.

· Standalone Mode — File or Spooling Directory Connector on the Producer machine to read the file and send events to Kafka.

· Cluster Mode –Deployment on multiple machines discover each other and Brokers serving as synchronization layer automatically load-balance and failover work between them.

· Manage Connectors on cluster using Connect Node (aka. “Worker”) uses RESP API to configure, start, stop, pause and resume. Once connector starts regardless of node, it will load balance parallel tasks to the least loaded available worker node.

· Because they are stateless, they can be deployed in containers.

4. Kafka Clients

Java client JARs interfaces applications to Kafka.

· Are installed along Kafka Brokers but deployed with application client libraries.

· APIs and semantics in variety programming languages (C, C++, Python, and Go).

5. Kafka Streams APIs

· Application embedded library for building distributed stream processing applications for late-arriving data.

· Benefits from higher cpu/core count.

· Deploying multiple instances of application on multiple servers Kafka Sreams will auo load-balance and fail-over.

6. ksqlDB Server

· SQL engine that queries continuously against Apache Kafka in real-time.

· Command Line Interface allows interactive querries to ksqlDB server from any machine.

· ksqlDB server includes data processing to the target Kafka cluster.

· ksqlDB is deployed on a set of servers/nodes that form a cluster determined by processing capacity required (concurrent queries/complexity) handling auto load-balancing/fail-over.

· Benefits from higher CPU counts, networking throughput and SSD storage.

7. Confluent REST Proxy

a HTTP server that provides RESTful interface to Kafka cluster.

· Mandatory component when using RESTful HTTP protocol on separate machine/node.

· If application uses Native Kafka Client then no need to deploy REST Proxy.

· Deploy multiple REST Proxies behind a sticky load-balancer for the same Consumer.

8. Confluent Schema Registry

Metadata layer using RESTful interface for storing and retrieving Avro schemas.

· Stores versioned history of all schemas and allows compatibility settings.

· Includes Message Serializers that plug into Kafka clients

· Handles Schema Storage and retrieval of Kafka messages from Avro format.

· Installed on its own server/node. But small installations can be alongside REST Proxy and Connect Workers.

· High availability requires multiple Schema Registry servers/nodes.

· Multi Schema Registry uses a leader-followers architecture with at most 1 node being Leader any given time.

· Only Leader can publish writes to Kafka Log where all nodes can process read requests.

· Follower nodes forward write requests to Leader.

· Schema Registry stores its schemas in Kafka and do not require storage and can be deployed in containers.

9. Confluent Replicator

Manages multi-cluster deployments of Kafka.

· Provides centralized configuration of cross-cluster replication.

· Replicates Topic configuration in addition to Topic Messages.

· Integrated with Kafka Connect framework and be installed in Connect Nodes in destination cluster. For Multiple Connect Workers the Replicator should be installed in all of the Connect Nodes.

· Large number of Replicator Nodes will have high availability with built-in fail-over.

10. Confluent Auto Data Balancing

· Optimizes resource utilization to scale Kafka Clusters.

· Installed on any machine in the Confluent Platform cluster to communicate to Brokers and ZooKeeper but recommended install on the Brokers or on Control Center Node.

· Collects load metrics and sends instructions to move partitions.

11. Confluent Control Center

· Web based tool for managing and monitoring data pipelines and streaming applications on Apache Kafka.

· Data Stream Monitoring and Alerting using drill-down from Producer to Consumer.

· Multi-cluster monitoring and management data replication between clusters.

· Kafka Connect Configuration to add new sources to load external data systems.

· Runs on a single dedicated machine.

Key Features:

Apache Kafka common benefits are building data pipelines, leveraging real-time data streams, enabling operational metrics, and data integration across countless sources.

· BYOK Customer Master Key Encryption

· Role Based Access

· User and Application Authentication

· Monitors Large & Small Custer Health and Usage by Topic Level

· Monitor Producers and Consumers connecting to Topic Level Metrics

· Schema Registry is a RESTful interface for Topic Metadata from multiple Producers and Consumers

· High Availability must have multiple Schema Registry Servers

· Need Kafka client to access schema storage and retrieval for Kafka messages sent in Avro Format.

· Uses ksqlDB for real-time stream processing

· Connectors send data to cloud storage like Azure Blob Sink or Snowflake Blob Sink

· Pay only for what you use

The above architecture is how my team is looking to scale Apache Kafka & stream analytics within our enterprise with Confluent.io partnership.

Thanks for reading !

https://twitter.com/richiebachala

Join my team & the Confluent community of Kafka developers on the below slack channel.

#KafkaStreams #Kafka #ApacheKafka #DataEngineering

--

--

Richie Bachala

Distributed SQL, Data Engineering Leader @ Yugabyte | past @ Sherwin-Williams, Hitachi, Oracle