Kafka Geo Replication: Multi-Region and Cross-Datacenter Patterns
Kafka geo replication copies topics between clusters in different regions or datacenters, so a regional outage does not take your streaming platform with it. In-cluster replication (RF=3) protects against broker loss inside one failure domain; geo replication protects against losing the domain itself. This guide compares the four patterns, shows what MirrorMaker 2 setup looks like over a WAN, and covers the failure mode replication cannot solve.
Start with active-passive between two regions. Measure real replication lag before promising an RPO, budget for cross-region transfer costs, and pair replication with point-in-time backups — replication propagates mistakes as faithfully as it propagates good data.
What geo replication is (and is not)
Setting replication.factor=3 puts three copies of each partition on three
brokers — in the same cluster. Rack awareness can spread those replicas
across availability zones, but the cluster is still one blast radius: one
control plane, one region, one set of humans with admin rights.
Geo replication runs a second (or third) Kafka cluster elsewhere and copies topics between them, cluster to cluster. The use cases:
- Disaster recovery — survive a region failure with a warm standby
- Data locality — serve consumers from the nearest region
- Compliance — keep regional data in-region while sharing what is allowed
- Migration — move workloads between datacenters or clouds without a big bang (see the migration use cases)
The metrics that govern every design below: end-to-end replication lag, cross-region latency, and network transfer cost.
The four replication patterns
| Pattern | RTO | Cost | Complexity | Best for |
|---|---|---|---|---|
| Active-passive | Minutes | 2× infra | Low | DR for a single primary region |
| Active-active | Near-zero | 2× infra + conflict handling | High | Regional serving with failover both ways |
| Hub-and-spoke | Varies by spoke | Hub + N spokes | Medium | Central aggregation, regional distribution |
| Mesh | Near-zero | N× everything | Very high | Few orgs genuinely need this |
Active-passive is the honest default. One cluster serves traffic; a standby in another region receives a continuous copy. Failover means repointing clients — the hard part is offset translation, not data movement.
Active-active lets both regions produce and consume. It halves your wasted standby capacity but introduces bidirectional flows, loop prevention, and topic naming discipline. Choose it when both regions must serve writes, not because idle standby feels wasteful.
Hub-and-spoke fits aggregation topologies: regional clusters replicate into a central hub for analytics, or a hub fans reference data out to the edges.
Mesh — everyone replicates to everyone — multiplies links, monitoring, and failure modes quadratically. It is listed here mostly so you can decline it deliberately.
Implementing geo replication with MirrorMaker 2
MirrorMaker 2 (MM2) ships with Apache Kafka and runs on the Connect framework. A minimal active-passive setup:
clusters = primary, dr
primary.bootstrap.servers = kafka-us-east.example.com:9092
dr.bootstrap.servers = kafka-us-west.example.com:9092
# Replicate everything except internals from primary to DR
primary->dr.enabled = true
primary->dr.topics = .*
# Consumer group offset sync, so consumers can fail over
emit.checkpoints.enabled = true
sync.group.offsets.enabled = true
sync.group.offsets.interval.seconds = 60
Three things bite teams on real WAN links:
- Remote topic prefixes. MM2 replicates
ordersasprimary.orderson the DR cluster by default. Consumers failing over must subscribe accordingly, or you must override the replication policy. - Offset translation. Offsets differ between source and target. MM2's checkpoints translate consumer group positions — verify translated offsets in a drill before an outage forces the issue.
- WAN tuning. Raise producer
batch.sizeandlinger.mson the connectors, enable compression, and monitor end-to-end lag (record timestamp delta at the consumer), not just Connect task lag.
Our MirrorMaker 2 comparison covers where MM2 shines and where it stops.
A reference architecture that holds up
The pattern we see work repeatedly for cross-datacenter DR:
- Two regions, active-passive, MM2 running in the target region (pull model — the DR site keeps working if the primary degrades)
- Dedicated replication bandwidth sized at peak produce throughput plus headroom, compressed on the wire
- Health-checked DNS failover for client bootstrap servers, with TTLs low enough to matter during an incident
- Quarterly failover drills that measure achieved RTO/RPO against targets — untested DR is a diagram, not a capability
- Cost line items reviewed explicitly: standby compute, cross-region transfer (usually the surprise), and duplicated storage
The failure mode replication cannot solve
Geo replication is built to copy everything, quickly. That is precisely why it cannot protect you from:
- A producer bug writing poisoned records — replicated in milliseconds
- An accidental topic deletion — propagated to the standby
- A compliance request to reconstruct data as of last quarter
For those you need an immutable copy that lives outside both clusters and can be restored to a moment in time. That is what point-in-time backup provides: records, consumer offsets, and topic configuration in object storage, restorable to any cluster — including the DR cluster you just failed over to.
Mature Kafka estates run both layers: replication for availability, backups for recoverability. The best practices guide covers operating that second layer well.
Frequently asked questions
How does Kafka geo replication work?
A replication tool — most commonly MirrorMaker 2 — consumes topics from a source cluster and produces them to a target cluster in another region, continuously. Checkpoints translate consumer group offsets between clusters so consumers can fail over and resume near where they left off.
What is the difference between Kafka replication and geo replication?
Kafka replication (replication.factor) keeps copies of each partition on multiple brokers within one cluster. Geo replication copies topics between separate clusters in different regions or datacenters, protecting against the loss of an entire site rather than a single broker.
Can Kafka replicate across data centers?
Yes. MirrorMaker 2, Confluent Replicator, and Confluent Cluster Linking all replicate topics between clusters in different datacenters. Cross-datacenter links need WAN tuning: compression, larger batches, and monitoring of end-to-end lag rather than connector lag alone.
What is the best pattern for Kafka cross-region replication?
Active-passive is the right starting point for most teams: one primary cluster and a warm standby receiving a continuous copy. Active-active adds bidirectional replication and conflict handling, and is only worth the complexity when both regions must serve writes.
Does geo replication replace Kafka backups?
No. Replication copies every write to the standby within milliseconds — including corrupted data and accidental deletions. Backups provide immutable, point-in-time copies outside both clusters, which is what you restore from after a logical failure rather than an infrastructure one.
Related reading: OSO Kafka Backup vs MirrorMaker 2, disaster recovery use cases, and how to backup and restore Kafka topics.