From Unstable to Reliable: Adding Kafka to Our System

In our previous post, we shared how we helped a client handle extreme database load by redesigning their architecture. One of the key components in that setup was Kafka—our way of decoupling the data producer from the database to avoid performance bottlenecks.

In this post, we’ll walk through how we installed and configured Kafka step by step, and explain what we paid attention to along the way.

Kafka’s Role in the System

Kafka acts as a buffer between the producer and the database. Instead of pushing data directly into the database, which introduces latency and reliability risks under load, Kafka allows the producer to hand off the data quickly while decoupling the ingestion and storage layers.

This offers several key benefits:

Reduces the load on the database during traffic peaks
Shields the database from application-side spikes
Enables retry mechanisms and delayed processing without data loss

It’s not just about speed—it’s about system resilience.

Step 1: Download Kafka and Set Up the Directory

We began with a clean virtual machine and downloaded Kafka directly from Apache’s official distribution:

wget https://dlcdn.apache.org/kafka/3.9.0/kafka_2.13-3.9.0.tgz
tar -xzf kafka_2.13-3.9.0.tgz
mv kafka_2.13-3.9.0 /home/<user>/kafka/

Kafka requires Java, so we made sure that was installed:

sudo apt install default-jre

Step 2: KRaft Mode Configuration

ZooKeeper has traditionally been a core dependency in Kafka deployments. It's used to manage metadata, leader elections, and coordination between brokers. While powerful, it also introduces complexity—another service to install, monitor, and secure.

Starting with version 2.8 and becoming more stable in later releases, Kafka introduced KRaft mode (Kafka Raft Metadata mode) as an alternative. KRaft replaces ZooKeeper with an internal consensus mechanism, allowing Kafka to manage its own metadata natively.

For our setup, we chose KRaft mode. This allowed us to reduce moving parts and simplify the overall system—particularly useful in a proof-of-concept environment where we needed full visibility and control without additional overhead.

We edited the file config/kraft/server.properties with the following settings:

process.roles=broker,controller
node.id=1
controller.quorum.voters=1@<poc-kafka-1-IP>:9093,2@<poc-kafka-2-IP>:9093,3@<poc-kafka-3-IP>:9093

listeners=PLAINTEXT://:9092,CONTROLLER://:9093
advertised.listeners=PLAINTEXT://<poc-kafka-1-IP>:9092,CONTROLLER://<poc-kafka-1-IP>:9093

inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
log.dirs=/home/<user>/kafka/logs

num.partitions=12
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
log.retention.hours=24
delete.topic.enable=true

These settings defined how the broker would communicate with clients and with other nodes in the cluster. We kept it simple, but scalable.

Step 3: Format the Storage (KRaft-Specific)

KRaft requires you to initialize the cluster storage with a UUID before the broker can start. This is a one-time operation.

We generated a UUID like this:

./bin/kafka-storage.sh random-uui

Then used it to format the storage:

./bin/kafka-storage.sh format -t QDtk213UQxymd43TjNZN2g -c config/kraft/server.properties

This step is very important because without it Kafka will not even try to start.

Step 4: Running Kafka as a Systemd Service

Downloading prometheus exporter:

wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar

jmx exporter config:

lowercaseOutputName: true

rules:
# Special cases and very specific rules
- pattern : kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
  name: kafka_server_$1_$2
  type: GAUGE
  labels:
    clientId: "$3"
    topic: "$4"
    partition: "$5"
# ... (rest of the rules as you provided)

Step 5: Running Kafka as a Systemd Service

To ensure that Kafka would start automatically and recover gracefully from crashes, we defined it as a systemd service. This allowed us to manage Kafka like any other system component—with reliable startup, controlled restarts, and predictable behavior.

Here’s the systemd service file we used (/lib/systemd/system/kafka.service):

[Unit]
Description=kafka
After=network.target

[Service]
Type=simple
ExecStart=/home/<user>/kafka/kafka_2.13-3.9.0/bin/kafka-server-start.sh /home/<user>/kafka/kafka_2.13-3.9.0/config/kraft/server.properties
ExecStop=/home/<user>/kafka/kafka_2.13-3.9.0/bin/kafka-server-stop.sh
Restart=on-failure
RestartSec=10

Environment="KAFKA_HEAP_OPTS=-Xmx10G -Xms10G"
Environment="KAFKA_OPTS=-Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.ssl=false -javaagent:/home/<user>/kafka/jmx_prometheus_javaagent-0.20.0.jar=9099:/home/<user>/kafka/kafka_2.13-3.9.0/etc/jmx-exporter-config.yaml"

StandardOutput=journal
StandardError=journal
SyslogIdentifier=kafka

LimitNOFILE=65536
LimitCORE=infinity

[Install]
WantedBy=multi-user.target

This setup also included a JMX exporter for Prometheus, which made integration with monitoring tools like PMM and Grafana seamless.

Step 6: Start and Verify Kafka

We reloaded systemd and started the service:

sudo systemctl daemon-reload
sudo systemctl enable kafka
sudo systemctl start kafka

To make sure it was working:

systemctl status kafka
netstat -lntp | grep 9092

If we see Kafka listening on port 9092, we're doing well.

Step 7: Monitoring Kafka with PMM via JMX Exporter

To make Kafka observable through PMM (Percona Monitoring and Management), we first needed a way to expose its internal metrics in a format that Prometheus—and by extension, PMM—could understand.

We achieved this by including the Prometheus JMX Exporter as a Java agent in the Kafka service. This allowed us to expose metrics like broker health, topic throughput, and partition activity over an HTTP endpoint on port 9099.

Since PMM doesn’t natively support Kafka, we added it as an external service, pulling metrics from the exposed endpoint:

pmm-admin add external --service-name kafka-metrics --listen-port=9099

This integration allowed us to view Kafka metrics in Grafana dashboards and receive alerts, just like we would for a database server.

Step 8: Verifying Kafka with a Custom Producer and Consumer

After setting up the Kafka cluster, we wanted to make sure it was working as expected. Rather than relying on canned test tools, we used a custom binary called kafka-producer-consumer—a lightweight Go application designed to push and pull data through Kafka at scale.

This allowed us to simulate realistic workloads and validate the full pipeline, from producer to consumer, with actual data flow.

🧪 Running the Producer

To simulate a burst of sensor data, we ran the producer with parameters that define how many messages and sensors to simulate:

./kafka-producer-consumer -run-producer -msg-count 100000 -sensors-count 40 -metrics-count 40 &

This command produced 100,000 messages across 40 sensors, each reporting 40 types of metrics, and sent them into Kafka.

We monitored the broker’s metrics and logs during this time to confirm Kafka was receiving and queuing the messages without errors or slowdowns.

🧪 Running the Consumer

To validate end-to-end delivery, we ran a consumer that reads from Kafka and writes to a PostgreSQL database:

./kafka-producer-consumer --run-consumer --db postgres \
  -dsn 'host=10.117.209.18 user=consumer password=consumer port=5432 dbname=tscale sslmode=disable' >> consumer1.log 2>&1 &

This process pulled messages from Kafka and inserted them into the target database. We tailed the logs to observe insert performance and error handling:

tail -f consumer1.log

This step gave us confidence that:

The Kafka broker was accepting and queuing data correctly
Messages were being read from the correct topic and consumer group
The downstream system (PostgreSQL) could keep up with ingestion

Thanks for reading! Have problems with your database? Do not hesitate to contact us and visit our contact page.