Skip to main content

Kafka Geo Replication: Multi-Region and Cross-Datacenter Patterns

· 6 min read
OSO Engineering
The team behind OSO Kafka Backup

Kafka geo replication copies topics between clusters in different regions or datacenters, so a regional outage does not take your streaming platform with it. In-cluster replication (RF=3) protects against broker loss inside one failure domain; geo replication protects against losing the domain itself. This guide compares the four patterns, shows what MirrorMaker 2 setup looks like over a WAN, and covers the failure mode replication cannot solve.

Key takeaway

Start with active-passive between two regions. Measure real replication lag before promising an RPO, budget for cross-region transfer costs, and pair replication with point-in-time backups — replication propagates mistakes as faithfully as it propagates good data.

What geo replication is (and is not)

Setting replication.factor=3 puts three copies of each partition on three brokers — in the same cluster. Rack awareness can spread those replicas across availability zones, but the cluster is still one blast radius: one control plane, one region, one set of humans with admin rights.

Geo replication runs a second (or third) Kafka cluster elsewhere and copies topics between them, cluster to cluster. The use cases:

  • Disaster recovery — survive a region failure with a warm standby
  • Data locality — serve consumers from the nearest region
  • Compliance — keep regional data in-region while sharing what is allowed
  • Migration — move workloads between datacenters or clouds without a big bang (see the migration use cases)

The metrics that govern every design below: end-to-end replication lag, cross-region latency, and network transfer cost.

The four replication patterns

PatternRTOCostComplexityBest for
Active-passiveMinutes2× infraLowDR for a single primary region
Active-activeNear-zero2× infra + conflict handlingHighRegional serving with failover both ways
Hub-and-spokeVaries by spokeHub + N spokesMediumCentral aggregation, regional distribution
MeshNear-zeroN× everythingVery highFew orgs genuinely need this

Active-passive is the honest default. One cluster serves traffic; a standby in another region receives a continuous copy. Failover means repointing clients — the hard part is offset translation, not data movement.

Active-active lets both regions produce and consume. It halves your wasted standby capacity but introduces bidirectional flows, loop prevention, and topic naming discipline. Choose it when both regions must serve writes, not because idle standby feels wasteful.

Hub-and-spoke fits aggregation topologies: regional clusters replicate into a central hub for analytics, or a hub fans reference data out to the edges.

Mesh — everyone replicates to everyone — multiplies links, monitoring, and failure modes quadratically. It is listed here mostly so you can decline it deliberately.

Implementing geo replication with MirrorMaker 2

MirrorMaker 2 (MM2) ships with Apache Kafka and runs on the Connect framework. A minimal active-passive setup:

mm2.properties
clusters = primary, dr
primary.bootstrap.servers = kafka-us-east.example.com:9092
dr.bootstrap.servers = kafka-us-west.example.com:9092

# Replicate everything except internals from primary to DR
primary->dr.enabled = true
primary->dr.topics = .*

# Consumer group offset sync, so consumers can fail over
emit.checkpoints.enabled = true
sync.group.offsets.enabled = true
sync.group.offsets.interval.seconds = 60

Three things bite teams on real WAN links:

  1. Remote topic prefixes. MM2 replicates orders as primary.orders on the DR cluster by default. Consumers failing over must subscribe accordingly, or you must override the replication policy.
  2. Offset translation. Offsets differ between source and target. MM2's checkpoints translate consumer group positions — verify translated offsets in a drill before an outage forces the issue.
  3. WAN tuning. Raise producer batch.size and linger.ms on the connectors, enable compression, and monitor end-to-end lag (record timestamp delta at the consumer), not just Connect task lag.

Our MirrorMaker 2 comparison covers where MM2 shines and where it stops.

A reference architecture that holds up

The pattern we see work repeatedly for cross-datacenter DR:

  • Two regions, active-passive, MM2 running in the target region (pull model — the DR site keeps working if the primary degrades)
  • Dedicated replication bandwidth sized at peak produce throughput plus headroom, compressed on the wire
  • Health-checked DNS failover for client bootstrap servers, with TTLs low enough to matter during an incident
  • Quarterly failover drills that measure achieved RTO/RPO against targets — untested DR is a diagram, not a capability
  • Cost line items reviewed explicitly: standby compute, cross-region transfer (usually the surprise), and duplicated storage

The failure mode replication cannot solve

Geo replication is built to copy everything, quickly. That is precisely why it cannot protect you from:

  • A producer bug writing poisoned records — replicated in milliseconds
  • An accidental topic deletion — propagated to the standby
  • A compliance request to reconstruct data as of last quarter

For those you need an immutable copy that lives outside both clusters and can be restored to a moment in time. That is what point-in-time backup provides: records, consumer offsets, and topic configuration in object storage, restorable to any cluster — including the DR cluster you just failed over to.

Mature Kafka estates run both layers: replication for availability, backups for recoverability. The best practices guide covers operating that second layer well.

Frequently asked questions

How does Kafka geo replication work?

A replication tool — most commonly MirrorMaker 2 — consumes topics from a source cluster and produces them to a target cluster in another region, continuously. Checkpoints translate consumer group offsets between clusters so consumers can fail over and resume near where they left off.

What is the difference between Kafka replication and geo replication?

Kafka replication (replication.factor) keeps copies of each partition on multiple brokers within one cluster. Geo replication copies topics between separate clusters in different regions or datacenters, protecting against the loss of an entire site rather than a single broker.

Can Kafka replicate across data centers?

Yes. MirrorMaker 2, Confluent Replicator, and Confluent Cluster Linking all replicate topics between clusters in different datacenters. Cross-datacenter links need WAN tuning: compression, larger batches, and monitoring of end-to-end lag rather than connector lag alone.

What is the best pattern for Kafka cross-region replication?

Active-passive is the right starting point for most teams: one primary cluster and a warm standby receiving a continuous copy. Active-active adds bidirectional replication and conflict handling, and is only worth the complexity when both regions must serve writes.

Does geo replication replace Kafka backups?

No. Replication copies every write to the standby within milliseconds — including corrupted data and accidental deletions. Backups provide immutable, point-in-time copies outside both clusters, which is what you restore from after a logical failure rather than an infrastructure one.


Related reading: OSO Kafka Backup vs MirrorMaker 2, disaster recovery use cases, and how to backup and restore Kafka topics.