Skip to main content

Kafka Disaster Recovery: Active-Passive vs Active-Active Architectures

· 12 min read
OSO Engineering
The team behind OSO Kafka Backup

Kafka disaster recovery is the practice of keeping a second, independent copy of your streaming data and cluster metadata so you can resume service after a region outage, a bad deploy, or human error. Kafka's built-in replication — replication factor 3 with min.insync.replicas=2 — survives broker failure, but it does not survive a lost region or a deleted topic. The right DR design depends on three numbers: your recovery time objective (RTO), your recovery point objective (RPO), and your budget. This guide compares active-passive and active-active architectures and gives you a framework to choose.

Key takeaway

Start with active-passive. It covers most production workloads at a fraction of the cost, with RTO in minutes and a small, bounded RPO. Reserve active-active for Tier 1 systems that cannot tolerate any downtime. The DR plan you test quarterly is the only one that works when a region actually fails.

What is Kafka disaster recovery and why does it matter?

Replication and disaster recovery answer different questions. Replication keeps a partition available when a broker dies. Disaster recovery gets your whole platform back after an event that takes the cluster with it.

A replication factor of 3 protects against one or two broker failures. It does nothing for these:

  • Region outage. When a cloud availability zone or region goes dark, every in-region replica goes with it. All three copies are in the blast radius.
  • Logical corruption. A producer bug writes malformed events for hours. Replication copies the bad records to every replica in milliseconds.
  • Accidental deletion. An operator deletes the wrong topic. The delete propagates to all replicas immediately and permanently.
  • Compliance mandates. Financial and healthcare regulators often require a recoverable copy in a separate failure domain, regardless of uptime.

This is the distinction to write on the wall: replication is availability; disaster recovery is recoverability. For a fuller treatment of why one cannot stand in for the other, see Kafka backup strategies and the disaster recovery use cases in the docs.

The cost of getting this wrong scales with how much your systems depend on the stream. For a batch analytics pipeline, an hour of lost events is an inconvenience. For real-time payments or event-driven microservices, the same hour can mean failed transactions and broken downstream state.

Active-passive Kafka DR — when simplicity wins

In an active-passive architecture, one cluster serves all production traffic. A standby cluster in a separate region receives a continuous copy of the data and sits idle until you need it.

The failover sequence is straightforward:

  1. Detect the failure through health checks on the primary.
  2. Promote the standby by pointing producers and consumers at it.
  3. Restore or replay any data that had not yet reached the standby.
  4. Reset consumer group offsets so applications resume at the right position.

Pros. Lower cost — the standby does no production work and can run smaller until failover. Simpler operations. A single, well-understood recovery path that a runbook can describe end to end.

Cons. RTO is measured in minutes, not seconds, because promotion and redirection take time. RPO is greater than zero: any data written after the last sync but before the outage can be lost.

Best for. Most production workloads, compliance-driven DR, and cost-conscious teams. If you are building your first Kafka DR plan, start here.

You can back the standby copy with continuous backup to object storage rather than a hot second cluster. This is the cheapest form of active-passive: you pay for storage, not for idle brokers. A minimal continuous backup config looks like this:

mode: backup
backup_id: "dr-primary-continuous"

source:
bootstrap_servers:
- broker-1.prod.internal:9092
- broker-2.prod.internal:9092
topics:
include:
- "orders.*"
- "payments.*"

storage:
backend: s3
bucket: kafka-dr-backups
region: us-west-2

backup:
continuous: true
compression: zstd
compression_level: 3
include_offset_headers: true # required for consumer offset reset
consumer_group_snapshot: true # capture group offsets each cycle

Run it as a long-lived process next to the cluster:

kafka-backup backup --config dr-primary-continuous.yaml

Because include_offset_headers and consumer_group_snapshot are on, the backup preserves the original offsets and each consumer group's committed positions. When you restore into the DR cluster, applications resume where they left off instead of reprocessing from the beginning.

Active-active Kafka DR — when downtime is not an option

In an active-active architecture, two or more clusters serve production traffic at the same time. Bidirectional replication keeps them in sync, so a client can be redirected to a surviving cluster with little or no interruption.

This design removes the promotion step. There is no idle standby to wake up because every cluster is already live. The trade-off is that you now own two hard problems: offset synchronization across clusters and conflict resolution when the same logical entity is written in two places.

Pros. Near-zero RTO — clients fail over to an already-running cluster. No single point of failure. Traffic and reads can be distributed geographically to cut latency.

Cons. Higher cost, since every cluster is sized for production. More operational complexity. A real risk of data conflicts and duplicate processing unless your consumers are idempotent and your keys are partitioned to avoid cross-cluster writes to the same entity.

Best for. Financial services, real-time bidding, and mission-critical event processing where a minute of downtime is unacceptable.

Active-active is a replication problem first. The same cross-cluster patterns used for geographic distribution apply directly — see Kafka replication across data centers for the offset-translation mechanics that make failover clean.

Active-passive vs active-active at a glance

DimensionActive-PassiveActive-Active
Typical RTO5–30 minutesSeconds
Typical RPOSeconds to minutesNear zero
Relative costLower (idle or backup-only standby)Higher (all clusters production-sized)
Operational complexityLowHigh
Offset handlingReset on promotionContinuous cross-cluster sync
Conflict riskNone (single writer)Present (needs idempotency)
Best fitMost workloads, compliance DRTier 1, zero-downtime systems

How to design your Kafka RTO/RPO strategy

Do not pick an architecture first. Pick your RTO and RPO targets per workload, then let those numbers choose the architecture. Classify each topic or application by how much downtime and data loss the business can absorb.

TierExample workloadsRPO targetRTO targetRecommended architecture
Tier 1Payments, order capture, fraud checks< 1 min< 5 minActive-active or hot active-passive
Tier 2Notifications, user activity streams< 15 min< 30 minActive-passive with continuous backup
Tier 3Analytics, batch ETL feeds< 1 hour< 4 hoursScheduled backup to object storage

Two rules keep this honest. First, RTO and RPO are business decisions, not engineering preferences — get sign-off from the teams that own the workload. Second, tighter targets cost more, and the curve is steep. Moving a workload from Tier 2 to Tier 1 can double its infrastructure footprint. Only pay for the tier the business actually needs.

Map the tiers to spend deliberately. Active-passive with backup to S3 is the default for Tier 2 and Tier 3, and the storage bill stays flat because compression (Zstd or LZ4) shrinks the archive. Active-active earns its cost only where a Tier 1 target leaves no other option.

Implementing Kafka DR — tools and patterns

Several tools move data between clusters or regions. They are not interchangeable; each fits a different point on the cost-and-complexity curve.

  • MirrorMaker 2. The open-source standard for cross-cluster replication. It preserves consumer offsets through offset translation, which is what makes a clean failover possible. Good for active-passive and active-active alike.
  • MSK Replicator. AWS-native cross-region replication for managed MSK clusters. It removes the operational burden of running MirrorMaker yourself, at the price of AWS lock-in.
  • Cluster Linking. A managed, byte-for-byte replication feature in some commercial platforms. Low latency, low operational overhead, and offsets are preserved without translation.
  • Backup to object storage. Continuous or scheduled backup to S3, GCS, or Azure Blob. This is the DR pattern that also protects against corruption and deletion, because the copy is isolated from the live cluster.

The four are not mutually exclusive. A common production shape is replication for fast failover plus independent backup for point-in-time recovery. Replication gives you a warm cluster; backup gives you a clean copy from before the incident. The alternatives comparison walks through where each tool stops short.

Restoring into the DR cluster

When you fail over, restore the backed-up data into the DR cluster and let the offset headers put consumers back in place. Validate the restore config before you run it — DR is the worst time to discover a typo:

# Dry-run: check the restore config without touching the cluster
kafka-backup validate-restore --config dr-restore.yaml

# Execute the restore into the DR cluster
kafka-backup restore --config dr-restore.yaml

A restore config points at the DR cluster as the target and reads from the same bucket the primary wrote to:

mode: restore
backup_id: "dr-primary-continuous"

target:
bootstrap_servers:
- broker-1.dr.internal:9092
- broker-2.dr.internal:9092

storage:
backend: s3
bucket: kafka-dr-backups
region: us-west-2

Because the backup captured offset headers and consumer group snapshots, applications reconnect to the DR cluster and resume from their last committed position. That single detail is the difference between a clean failover and a day of reprocessing. For the mechanics of preserving and resetting offsets, see backup and restore of Kafka topics.

Your Kafka DR checklist

A DR plan is only real once it is written down and tested. Work through this before you call your setup production-ready:

  • Define RTO and RPO per workload tier, with business sign-off.
  • Choose a DR architecture per tier: active-passive, active-active, or hybrid.
  • Set up cross-cluster replication or continuous backup with offset preservation.
  • Automate failure detection and cluster promotion or client redirection.
  • Test DR quarterly with a full failover drill, not a tabletop review.
  • Keep recovery runbooks in a location that survives the outage — not only in the cluster you are recovering.
  • Measure actual RTO and RPO during each drill and compare to target.

The last two matter most. Runbooks stored in a wiki that depends on the failed region are worthless during the event. And an untested RTO is a guess — the first real failover almost always runs slower than the plan on paper.

Conclusion

Kafka disaster recovery is not one architecture. It is a match between what the business can tolerate and what you are willing to spend. Active-passive covers the large majority of workloads with a simple, cheap, well-understood recovery path. Active-active buys near-zero downtime for the few Tier 1 systems that justify its cost and complexity.

Start with active-passive backed by continuous backup to object storage. Layer replication on top for the workloads that need faster failover. Then test the whole thing on a schedule, because the DR plan you rehearse is the only one that holds when a region actually goes down. When you are ready to build the backup half, the getting started guide and the config reference have the exact settings.

Frequently asked questions

What is the best disaster recovery strategy for Kafka?

For most workloads, active-passive replication or continuous backup to object storage in a separate region is the best strategy. It gives you an RTO of a few minutes and a small RPO at low cost. Reserve active-active for Tier 1 systems that cannot tolerate any downtime, since it costs more and adds operational complexity.

What is the difference between active-passive and active-active Kafka DR?

Active-passive runs one production cluster and a standby that stays idle until failover, giving an RTO in minutes and lower cost. Active-active runs multiple clusters serving traffic at once with bidirectional replication, giving near-zero RTO but higher cost and the added problems of offset synchronization and conflict resolution.

What are realistic RTO and RPO targets for Kafka?

Set targets by workload tier. Tier 1 systems such as payments typically need an RPO under one minute and an RTO under five minutes. Tier 2 systems can accept an RPO under fifteen minutes and an RTO under thirty minutes. Tier 3 batch workloads often tolerate an RPO of an hour and an RTO of several hours.

How do you test Kafka disaster recovery?

Run a full failover drill at least quarterly. Promote the standby or redirect clients, restore backed-up data into the DR cluster, verify consumer groups resume from the correct offsets, and measure the actual RTO and RPO against your targets. A tabletop review is not enough — only a real failover exposes gaps in the runbook.

Does Kafka replication provide disaster recovery?

No. Replication provides availability by keeping partitions online when brokers fail, but it faithfully copies deletions and corrupt records to every replica and cannot survive a full region outage. Disaster recovery requires an independent copy in a separate failure domain, which is why replication and backup are complementary, not interchangeable.