Kafka Backup Strategies: The Complete Guide to Protecting Streaming Data

July 4, 2026 · 12 min read

The team behind OSO Kafka Backup

Kafka backup means copying topic data, consumer group offsets, and cluster metadata to independent storage — object storage or a filesystem — so you can restore it after an incident. Replication alone cannot do this: it faithfully copies deletions and corrupt records to every replica in real time. A complete Kafka backup strategy combines replication for hardware resilience with periodic or continuous backups for protection against logical errors, regional outages, and compliance obligations.

Key takeaway

Pick a strategy by failure mode, not by tool. Replication covers broker loss. Backups cover human error, bad deploys, and region-wide outages. Most production clusters need both, and the backup half is the one most teams skip.

Why replication is not a Kafka backup

Kafka's replication factor is the first thing engineers point to when backup comes up. A replication factor of 3 means every partition lives on three brokers, so losing one — or even two — costs you nothing. That protection is real, but it only covers one failure mode: hardware.

Everything else passes straight through it:

Human error. An engineer deletes the wrong topic, or a producer with a bug writes garbage for six hours. Replication copies the mistake to every replica within milliseconds.
Retention. Kafka deletes data on schedule. Once a segment ages out of retention.ms, no replica holds it — and a compliance request or late-built consumer cannot get it back.
Cluster-wide failure. A bad broker upgrade, a corrupted controller quorum, or a region outage takes all replicas down together.
Ransomware and credential compromise. An attacker with cluster admin rights can destroy every replica at once. An off-cluster backup in a bucket with separate credentials survives.

This is the house rule worth writing on the wall: replication is availability, backup is recoverability. They answer different questions, and the disaster recovery use cases show how the two fit together in practice.

The four core Kafka backup strategies

Every production approach to Kafka backup is a variation of four strategies. They differ in what they capture, how fresh the copy is, and what a restore costs you in time.

Strategy	Typical RPO	Typical RTO	Protects against	Relative cost	Complexity
Topic-level export	Minutes to hours	Hours	Topic loss, retention expiry	Low	Low
Cluster-wide backup	Minutes	Hours	Cluster loss, metadata loss	Medium	Medium
Continuous replication	Seconds	Minutes	Region/DC failure	High	High
Point-in-time recovery	Seconds to minutes	Minutes to hours	Logical errors, bad writes	Medium	Medium

1. Topic-level export

The simplest strategy: consume selected topics and write them to object storage. The Kafka Connect S3 Sink connector is the common implementation — it streams records into S3 objects, partitioned by topic and time.

Topic-level export is cheap and easy to reason about, which makes it a good first step. Its weakness is scope: a plain sink connector captures message values but not consumer group offsets, topic configurations, or schemas. Restoring from it means re-producing records into a new topic and accepting that every consumer loses its position.

2. Cluster-wide backup

A cluster-wide backup captures the full state you would need to rebuild: topic data across all selected topics, topic configurations, and consumer group offsets. This is the difference between "we have the bytes somewhere" and "we can stand the cluster back up."

Offsets matter more than most teams expect. A restore that loses consumer positions forces every consumer group to choose between reprocessing everything and skipping to latest — both are incidents in their own right. OSO Kafka Backup captures offsets and topic configuration as part of every backup, and its consumer_group_snapshot option writes a per-backup snapshot of every group's committed offsets to storage.

3. Continuous replication

MirrorMaker 2, Confluent Replicator, and MSK Replicator continuously copy topics to a second cluster, usually in another datacenter or region. When the primary fails, consumers and producers fail over to the standby. This is the strongest answer to regional failure: RPO of seconds and RTO measured in minutes.

But remember the failure-mode framing — a standby cluster is still a live copy. It replicates deletions, corruption, and poison-pill records just as quickly as good data. Continuous replication belongs in a disaster recovery architecture alongside backups, not instead of them. Our MirrorMaker 2 comparison covers where replication shines and where it stops.

4. Point-in-time recovery

Point-in-time recovery (PITR) restores a topic to its state at a specific moment — before the bad deploy, before the accidental delete, before the corrupt batch landed. It requires backups that preserve offsets and timestamps, so the restore can replay exactly the records that existed in a given window and no more.

With OSO Kafka Backup, the restore window is expressed in Unix milliseconds:

mode: restore
backup_id: "prod-backup-20260704"

target:
  bootstrap_servers:
    - dr-broker-1:9092

storage:
  backend: s3
  bucket: company-kafka-backups
  region: us-west-2

restore:
  # Only records produced between 10:00 and 14:00 UTC
  time_window_start: 1783159200000
  time_window_end: 1783173600000

  topic_mapping:
    orders: orders_restored

  consumer_group_strategy: header-based

Restoring into a mapped topic (orders_restored) lets you validate the recovered data before cutting consumers over — a pattern the backup and restore tutorial walks through end to end.

How to choose the right Kafka backup strategy

Four questions drive the decision. Answer them per topic tier, not per cluster — a payments topic and a clickstream topic should never share a policy.

What is the RPO? How much data can you afford to lose? An RPO under a minute pushes you toward continuous backup or replication. Hours of acceptable loss make scheduled snapshots fine.
What is the RTO? How long can consumers be down? Fail-over to a warm standby is minutes; a restore from object storage is longer and scales with data volume.
What are the compliance obligations? GDPR, HIPAA, and SOX-style regimes often require retention beyond Kafka's own limits, plus proof that recovery works. That mandates off-cluster backups with immutable storage, regardless of your replication setup.
What is the budget? A standby cluster doubles your Kafka spend. Compressed backups in object storage cost a fraction of that — Zstandard compression routinely shrinks Kafka data severalfold before upload.

The combinations that come up most often:

Profile	Recommended combination
Financial services, regulated data	PITR backups + immutable storage + quarterly restore drills
E-commerce, low tolerance for downtime	Continuous replication for fail-over + daily cluster-wide backup
IoT / high-volume telemetry	Topic-level export to object storage with lifecycle tiering
Internal platform, mixed criticality	Cluster-wide backup for all topics + replication for tier-1 only

Defense in depth is the pattern behind all of these: replication answers the availability question, backup answers the recoverability question, and the strategies are cheap to layer because they share nothing.

Scheduling and retention: the policy layer

Whichever strategy you choose, two policies turn it from a script into a system: when backups run, and how long they live.

Scheduling. Snapshot backups pair naturally with a scheduler. Setting stop_at_current_offsets: true makes a backup capture the high watermark of every partition at start, back up to exactly that point, and exit — ideal for a nightly cron job or a Kubernetes CronJob. Continuous mode (continuous: true) replaces the schedule entirely: the process streams records to storage as they arrive and checkpoints its progress, so RPO drops from "since last night" to seconds. From v0.13.5, adding an offset_storage section makes even scheduled snapshots incremental — each run resumes from where the previous one stopped instead of re-reading from earliest.

Retention. Backup retention is a policy decision, not a storage default. Set it from two directions: the minimum your compliance regime requires (often years for financial or health data) and the maximum your privacy obligations allow. Object storage lifecycle rules do the mechanical work — keep recent backups in a hot tier for fast restore, transition older ones to infrequent-access or archive classes, and expire what nothing requires you to keep. Because backups are compressed before upload, even multi-year retention is usually a rounding error next to the cluster's own cost.

The point of the policy layer is that it is written down and versioned. A schedule that lives in one engineer's crontab and a retention rule that lives in nobody's head are how backup systems rot.

Kafka backup tools and what they implement

Tool	Strategy	Offsets preserved	PITR	Storage targets	Licensing
Kafka Connect S3 Sink	Topic-level export	No	No	S3	Open source
MirrorMaker 2	Continuous replication	Translated, with lag	No	Second Kafka cluster	Open source (Apache)
Confluent Replicator	Continuous replication	Yes	No	Second Kafka cluster	Commercial
OSO Kafka Backup	Cluster-wide backup + PITR	Yes, built in	Yes, millisecond precision	S3, Azure Blob, GCS, filesystem	Open source (MIT)

A few notes the table cannot carry:

Kafka Connect S3 Sink is excellent at what it does — durable, well-understood archiving of topic data to S3. It simply was not designed to be a restore path, so pair it with something that is.
MirrorMaker 2 ships with Kafka, costs nothing to license, and handles offset translation between clusters. Running it well takes real operational work; see the comparison against purpose-built backup for the trade-offs.
OSO Kafka Backup is a Rust tool built for backup specifically: compressed (Zstandard or LZ4) segments in object storage, consumer offset preservation, point-in-time restore, and a Kubernetes operator that is Strimzi compatible. The core is MIT-licensed; Schema Registry sync, encryption, RBAC, and audit logging are Enterprise features.

Best practices that make any strategy work

The strategy you pick matters less than the discipline around it. Four habits separate backups that restore from backups that merely exist:

Verify by restoring. Schedule automated restore tests into a scratch cluster. The validate-restore command and dry_run: true make this cheap enough to run daily.
Monitor lag. Alert when kafka_backup_lag_records exceeds your RPO budget — a backup job that died on Friday should page you before Monday.
Back up config as code. Keep every backup.yaml in Git and ship changes through CI, so a typo cannot silently disable a nightly job.
Write the runbook now. The engineer restoring at 3 a.m. should copy-paste, not compose.

Each of these is expanded, with alert thresholds and testing cadences, in the Kafka backup best practices guide.

Your first Kafka backup in 15 minutes

Theory ends here. A snapshot backup of one topic to S3 takes one YAML file and one command.

Create backup.yaml:

mode: backup
backup_id: "first-backup-orders"

source:
  bootstrap_servers:
    - kafka:9092
  topics:
    include:
      - orders

storage:
  backend: s3
  bucket: my-kafka-backups
  region: us-west-2
  prefix: backups/production

backup:
  compression: zstd
  # Snapshot mode: capture current high watermarks, back up to them, exit
  stop_at_current_offsets: true
  consumer_group_snapshot: true

Run it:

kafka-backup backup --config backup.yaml

Then confirm the backup is really there and really restorable:

# List backups in the storage backend
kafka-backup list --config backup.yaml

# Validate integrity of what was written
kafka-backup validate --config backup.yaml --backup-id first-backup-orders

From here the upgrade path is incremental: add more topics to include, set continuous: true for streaming backup, add an offset_storage section for resumable incremental runs, and schedule restore tests. The first backup tutorial covers each step, and the configuration reference documents every option used above. If S3 is not your target, the same config works against Azure Blob, GCS, or a filesystem by switching the storage.backend.

Start before you need it

A Kafka backup strategy is insurance you design while calm and cash in while panicking — the quality of the second moment is set entirely by the first. Start with the simplest thing that covers your worst realistic failure: a scheduled cluster-wide backup to object storage. Layer replication when downtime tolerance demands it, and PITR when logical errors would hurt.

Then test the restore. A backup you have restored is a capability; anything else is a hope.

Frequently asked questions

How do you take a backup of a Kafka topic?

Consume the topic and write its records to independent storage. With OSO Kafka Backup, you declare the topic in a YAML config with an S3, Azure Blob, GCS, or filesystem storage backend and run kafka-backup backup --config backup.yaml. The backup captures records, topic configuration, and consumer group offsets.

What is the difference between Kafka replication and Kafka backup?

Replication keeps live copies of partitions on multiple brokers for availability — it protects against hardware failure but propagates deletions and corruption instantly. Backup writes a separate historical copy to independent storage for recoverability, so you can restore data after human error, logical corruption, or total cluster loss.

Can you do point-in-time recovery with Kafka?

Not with Kafka alone — the broker has no built-in restore mechanism. Backup tools that preserve timestamps and offsets add this: OSO Kafka Backup restores a topic to a specific time window using time_window_start and time_window_end in Unix milliseconds, so you can recover the state before a bad deploy or accidental delete.

How often should you back up Kafka?

Derive the frequency from your RPO. If losing an hour of data is acceptable, hourly snapshot backups suffice. If the tolerance is seconds, run continuous backup mode, which streams records to storage as they arrive. Most teams tier this: continuous for critical topics, daily snapshots for the rest.

What is the best tool for Kafka backup?

It depends on the strategy. Kafka Connect S3 Sink handles simple topic archiving, MirrorMaker 2 handles cluster-to-cluster replication, and OSO Kafka Backup handles cluster-wide backup with consumer offset preservation and millisecond point-in-time recovery to S3, Azure Blob, GCS, or a filesystem. Many production setups combine replication with a dedicated backup tool.

Ready to design yours? Take your first backup in minutes, or start from the disaster recovery architectures if replication is already in place.

Why replication is not a Kafka backup​

The four core Kafka backup strategies​

1. Topic-level export​

2. Cluster-wide backup​

3. Continuous replication​

4. Point-in-time recovery​

How to choose the right Kafka backup strategy​

Scheduling and retention: the policy layer​

Kafka backup tools and what they implement​

Best practices that make any strategy work​

Your first Kafka backup in 15 minutes​

Start before you need it​