Creating Highly Available PoC for 2M TPS Database with Kafka and Patro

High-traffic database environments are full of challenges. Performance bottlenecks, data integrity issues, and system reliability concerns can quickly escalate, impacting critical business operations. Many companies delay addressing these issues, assuming that hardware upgrades are the only solution. But what if you could scale efficiently without immediately investing in expensive infrastructure changes?

This is exactly the problem one of our clients faced—although our client might eventually purchase new hardware, we demonstrated that we could help them even without that hardware by providing a proof-of-concept environment. Let’s dive into how we optimized their system and ensured seamless scalability without disrupting their operations.

The Challenge: A Database Under Pressure

We recently worked with a client processing nearly 2 million data transactions per second—a staggering load for a conventional database system. Their existing setup followed a simple architecture: a producer application continuously pushing data directly into the database. This model works fine for smaller workloads, but under extreme traffic, it quickly became a liability.

The client acknowledged the problem but hesitated to move forward due to hardware constraints. Rather than waiting for an upgrade, we proposed a Proof of Concept (PoC) to showcase how they could optimize performance with their current resources.

Identifying Bottlenecks

Our first step was to analyze the system’s pain points. The primary culprit was the direct connection between the producer and the database—any slowdown in database performance immediately impacted the application, leading to cascading failures and potential data loss.
To solve this, we needed a solution that could:

Reduce the direct connection between the producer and the database to alleviate load on the database.
Introduce scalability to handle traffic surges dynamically.
Ensure high availability and fault tolerance without requiring drastic infrastructure changes.

The Solution: Moving Beyond Traditional Scaling

When tackling database performance issues, there are two primary scaling approaches:

Vertical Scaling (Scaling Up): Adding more CPU, RAM, and faster storage. While effective, this approach has diminishing returns and can be costly.
Horizontal Scaling (Scaling Out): Distributing workloads across multiple servers to improve performance and redundancy. This approach is more sustainable long-term.

Instead of relying solely on hardware upgrades, we focused on architectural improvements that would enhance scalability and resilience.

Introducing a Kafka-Based Architecture

The game-changer was Kafka—a distributed event streaming platform that acts as a buffer between the producer and the database. Here’s why this made a difference:

Kafka absorbs traffic spikes without overwhelming the database.
Data ingestion happens asynchronously, preventing application slowdowns.
Consumers process data at a controlled rate, reducing the risk of database bottlenecks.

By introducing Kafka, we ensured that data ingestion remained smooth, regardless of fluctuations in incoming traffic.

Dynamic Consumer Scaling

To further optimize performance, we implemented dynamic consumer scaling—adjusting the number of consumers reading from Kafka based on real-time load. This ensured:

Efficient database writes without overwhelming the system.
Automatic scaling based on demand, eliminating the need for manual intervention.
A self-regulating system capable of handling extreme traffic fluctuations.

💡 While this improved performance, it also introduced new operational complexities—trade-offs we had to manage carefully.

Challenges and Trade-offs

No architectural change comes without drawbacks. While our solution improved scalability and fault tolerance, it also introduced new complexities:

Increased Complexity

The addition of ProxySQL/HAProxy as a load balancer introduced another layer that required configuration, monitoring, and maintenance. Ensuring proper failover behavior and query routing demanded additional operational expertise.

Replication Lag

For reporting queries, read replicas introduced the risk of slightly stale data, as they naturally lag behind the master. In most cases, this delay was negligible, but for clients requiring real-time analytics, additional strategies (e.g., synchronous replication) were needed.

Kafka Partitioning Challenges

Kafka’s scalability is driven by partitioning, but improper partitioning strategies can create uneven consumer loads or ordering issues for related events. Careful planning was essential to maintain the balance between throughput and message consistency.

Database Write Load

While Kafka helped smooth out traffic spikes, aggressive consumer writes after a backlog could overwhelm the master database. We implemented rate limiting and throttling to prevent sudden bursts of database writes from causing performance degradation.

Operational Overhead

Managing a distributed setup with Kafka, ProxySQL/HAProxy, and a Patroni-managed PostgreSQL cluster required skilled personnel. While the benefits outweighed the challenges, it was clear that ongoing maintenance and monitoring would be necessary to sustain peak performance.

Ensuring High Availability

Performance improvements are meaningless if the system isn’t reliable. To address this, we integrated HAProxy and Patroni to guarantee high availability:

HAProxy acts as a load balancer, ensuring seamless failover.
Patroni manages automatic failovers, promoting a new master if the primary node fails.

Together, these components provided a robust failover mechanism, preventing downtime and ensuring business continuity.

Kafka’s Built-in Fault Tolerance

One of Kafka’s strongest advantages is its built-in data durability. Even if all database servers went down simultaneously, the data would remain safe in Kafka. Once the system recovered, database writes would resume without any data loss—ensuring complete reliability even in extreme failure scenarios.

The Results: Stability, Scalability, and No Data Loss

After implementing our solution, we deployed the system on virtual machines and conducted stress tests. The results:

Seamless handling of high traffic with zero data loss.
Automatic load balancing and failover management, eliminating single points of failure.
Minimal delays (milliseconds at most), which were a small trade-off for system stability.

This project reaffirmed a fundamental truth: High-traffic databases don’t have to rely solely on hardware upgrades. With the right architecture, even extreme workloads can be managed efficiently.

Here’s a look at the architecture we built for the Proof of Concept (PoC). Before rolling anything out to the client’s system, we set up and tested this in our own virtual machines to make sure it worked as expected. The design brings together Kafka for handling spikes in traffic, consumer services for controlled processing, and a Patroni-managed PostgreSQL cluster with HAProxy for load balancing and high availability. This setup proved to be both scalable and resilient, ensuring smooth operations even under heavy load.

Final Thoughts

Database scalability isn’t just about adding more resources—it’s about making smarter architectural choices. By leveraging Kafka, dynamic consumer scaling, and high-availability mechanisms, we helped our client achieve stability without a major infrastructure overhaul.

However, no solution is perfect. The additional complexity and operational overhead require proper planning and ongoing monitoring. The key takeaway? If your database is struggling under pressure, the solution isn’t always bigger hardware—sometimes, it’s just better architecture.

Thanks for reading! Have problems with your database? Do not hesitate to contact us and visit our contact page.

Creating Highly Available PoC for 2M TPS Database with Kafka and Patroni