Kafka Backup Strategies: The Complete Guide to Protecting Streaming Data
Kafka backup means copying topic data, consumer group offsets, and cluster metadata to independent storage — object storage or a filesystem — so you can restore it after an incident. Replication alone cannot do this: it faithfully copies deletions and corrupt records to every replica in real time. A complete Kafka backup strategy combines replication for hardware resilience with periodic or continuous backups for protection against logical errors, regional outages, and compliance obligations.
Pick a strategy by failure mode, not by tool. Replication covers broker loss. Backups cover human error, bad deploys, and region-wide outages. Most production clusters need both, and the backup half is the one most teams skip.
Why replication is not a Kafka backup
Kafka's replication factor is the first thing engineers point to when backup comes up. A replication factor of 3 means every partition lives on three brokers, so losing one — or even two — costs you nothing. That protection is real, but it only covers one failure mode: hardware.
Everything else passes straight through it:
- Human error. An engineer deletes the wrong topic, or a producer with a bug writes garbage for six hours. Replication copies the mistake to every replica within milliseconds.
- Retention. Kafka deletes data on schedule. Once a segment ages out of
retention.ms, no replica holds it — and a compliance request or late-built consumer cannot get it back. - Cluster-wide failure. A bad broker upgrade, a corrupted controller quorum, or a region outage takes all replicas down together.
- Ransomware and credential compromise. An attacker with cluster admin rights can destroy every replica at once. An off-cluster backup in a bucket with separate credentials survives.
This is the house rule worth writing on the wall: replication is availability, backup is recoverability. They answer different questions, and the disaster recovery use cases show how the two fit together in practice.
The four core Kafka backup strategies
Every production approach to Kafka backup is a variation of four strategies. They differ in what they capture, how fresh the copy is, and what a restore costs you in time.
| Strategy | Typical RPO | Typical RTO | Protects against | Relative cost | Complexity |
|---|---|---|---|---|---|
| Topic-level export | Minutes to hours | Hours | Topic loss, retention expiry | Low | Low |
| Cluster-wide backup | Minutes | Hours | Cluster loss, metadata loss | Medium | Medium |
| Continuous replication | Seconds | Minutes | Region/DC failure | High | High |
| Point-in-time recovery | Seconds to minutes | Minutes to hours | Logical errors, bad writes | Medium | Medium |
1. Topic-level export
The simplest strategy: consume selected topics and write them to object storage. The Kafka Connect S3 Sink connector is the common implementation — it streams records into S3 objects, partitioned by topic and time.
Topic-level export is cheap and easy to reason about, which makes it a good first step. Its weakness is scope: a plain sink connector captures message values but not consumer group offsets, topic configurations, or schemas. Restoring from it means re-producing records into a new topic and accepting that every consumer loses its position.
2. Cluster-wide backup
A cluster-wide backup captures the full state you would need to rebuild: topic data across all selected topics, topic configurations, and consumer group offsets. This is the difference between "we have the bytes somewhere" and "we can stand the cluster back up."
Offsets matter more than most teams expect. A restore that loses consumer
positions forces every consumer group to choose between reprocessing
everything and skipping to latest — both are incidents in their own right.
OSO Kafka Backup captures offsets and topic configuration
as part of every backup, and its consumer_group_snapshot option writes a
per-backup snapshot of every group's committed offsets to storage.
3. Continuous replication
MirrorMaker 2, Confluent Replicator, and MSK Replicator continuously copy topics to a second cluster, usually in another datacenter or region. When the primary fails, consumers and producers fail over to the standby. This is the strongest answer to regional failure: RPO of seconds and RTO measured in minutes.
But remember the failure-mode framing — a standby cluster is still a live copy. It replicates deletions, corruption, and poison-pill records just as quickly as good data. Continuous replication belongs in a disaster recovery architecture alongside backups, not instead of them. Our MirrorMaker 2 comparison covers where replication shines and where it stops.
4. Point-in-time recovery
Point-in-time recovery (PITR) restores a topic to its state at a specific moment — before the bad deploy, before the accidental delete, before the corrupt batch landed. It requires backups that preserve offsets and timestamps, so the restore can replay exactly the records that existed in a given window and no more.
With OSO Kafka Backup, the restore window is expressed in Unix milliseconds:
mode: restore
backup_id: "prod-backup-20260704"
target:
bootstrap_servers:
- dr-broker-1:9092
storage:
backend: s3
bucket: company-kafka-backups
region: us-west-2
restore:
# Only records produced between 10:00 and 14:00 UTC
time_window_start: 1783159200000
time_window_end: 1783173600000
topic_mapping:
orders: orders_restored
consumer_group_strategy: header-based
Restoring into a mapped topic (orders_restored) lets you validate the
recovered data before cutting consumers over — a pattern the
backup and restore tutorial walks
through end to end.
How to choose the right Kafka backup strategy
Four questions drive the decision. Answer them per topic tier, not per cluster — a payments topic and a clickstream topic should never share a policy.
- What is the RPO? How much data can you afford to lose? An RPO under a minute pushes you toward continuous backup or replication. Hours of acceptable loss make scheduled snapshots fine.
- What is the RTO? How long can consumers be down? Fail-over to a warm standby is minutes; a restore from object storage is longer and scales with data volume.
- What are the compliance obligations? GDPR, HIPAA, and SOX-style regimes often require retention beyond Kafka's own limits, plus proof that recovery works. That mandates off-cluster backups with immutable storage, regardless of your replication setup.
- What is the budget? A standby cluster doubles your Kafka spend. Compressed backups in object storage cost a fraction of that — Zstandard compression routinely shrinks Kafka data severalfold before upload.
The combinations that come up most often:
| Profile | Recommended combination |
|---|---|
| Financial services, regulated data | PITR backups + immutable storage + quarterly restore drills |
| E-commerce, low tolerance for downtime | Continuous replication for fail-over + daily cluster-wide backup |
| IoT / high-volume telemetry | Topic-level export to object storage with lifecycle tiering |
| Internal platform, mixed criticality | Cluster-wide backup for all topics + replication for tier-1 only |
Defense in depth is the pattern behind all of these: replication answers the availability question, backup answers the recoverability question, and the strategies are cheap to layer because they share nothing.
Scheduling and retention: the policy layer
Whichever strategy you choose, two policies turn it from a script into a system: when backups run, and how long they live.
Scheduling. Snapshot backups pair naturally with a scheduler. Setting
stop_at_current_offsets: true makes a backup capture the high watermark of
every partition at start, back up to exactly that point, and exit — ideal for
a nightly cron job or a Kubernetes CronJob. Continuous mode
(continuous: true) replaces the schedule entirely: the process streams
records to storage as they arrive and checkpoints its progress, so RPO drops
from "since last night" to seconds. From v0.13.5, adding an offset_storage
section makes even scheduled snapshots incremental — each run resumes from
where the previous one stopped instead of re-reading from earliest.
Retention. Backup retention is a policy decision, not a storage default. Set it from two directions: the minimum your compliance regime requires (often years for financial or health data) and the maximum your privacy obligations allow. Object storage lifecycle rules do the mechanical work — keep recent backups in a hot tier for fast restore, transition older ones to infrequent-access or archive classes, and expire what nothing requires you to keep. Because backups are compressed before upload, even multi-year retention is usually a rounding error next to the cluster's own cost.
The point of the policy layer is that it is written down and versioned. A schedule that lives in one engineer's crontab and a retention rule that lives in nobody's head are how backup systems rot.
Kafka backup tools and what they implement
| Tool | Strategy | Offsets preserved | PITR | Storage targets | Licensing |
|---|---|---|---|---|---|
| Kafka Connect S3 Sink | Topic-level export | No | No | S3 | Open source |
| MirrorMaker 2 | Continuous replication | Translated, with lag | No | Second Kafka cluster | Open source (Apache) |
| Confluent Replicator | Continuous replication | Yes | No | Second Kafka cluster | Commercial |
| OSO Kafka Backup | Cluster-wide backup + PITR | Yes, built in | Yes, millisecond precision | S3, Azure Blob, GCS, filesystem | Open source (MIT) |
A few notes the table cannot carry:
- Kafka Connect S3 Sink is excellent at what it does — durable, well-understood archiving of topic data to S3. It simply was not designed to be a restore path, so pair it with something that is.
- MirrorMaker 2 ships with Kafka, costs nothing to license, and handles offset translation between clusters. Running it well takes real operational work; see the comparison against purpose-built backup for the trade-offs.
- OSO Kafka Backup is a Rust tool built for backup specifically: compressed (Zstandard or LZ4) segments in object storage, consumer offset preservation, point-in-time restore, and a Kubernetes operator that is Strimzi compatible. The core is MIT-licensed; Schema Registry sync, encryption, RBAC, and audit logging are Enterprise features.
Best practices that make any strategy work
The strategy you pick matters less than the discipline around it. Four habits separate backups that restore from backups that merely exist:
- Verify by restoring. Schedule automated restore tests into a scratch
cluster. The
validate-restorecommand anddry_run: truemake this cheap enough to run daily. - Monitor lag. Alert when
kafka_backup_lag_recordsexceeds your RPO budget — a backup job that died on Friday should page you before Monday. - Back up config as code. Keep every
backup.yamlin Git and ship changes through CI, so a typo cannot silently disable a nightly job. - Write the runbook now. The engineer restoring at 3 a.m. should copy-paste, not compose.
Each of these is expanded, with alert thresholds and testing cadences, in the Kafka backup best practices guide.
Your first Kafka backup in 15 minutes
Theory ends here. A snapshot backup of one topic to S3 takes one YAML file and one command.
Create backup.yaml:
mode: backup
backup_id: "first-backup-orders"
source:
bootstrap_servers:
- kafka:9092
topics:
include:
- orders
storage:
backend: s3
bucket: my-kafka-backups
region: us-west-2
prefix: backups/production
backup:
compression: zstd
# Snapshot mode: capture current high watermarks, back up to them, exit
stop_at_current_offsets: true
consumer_group_snapshot: true
Run it:
kafka-backup backup --config backup.yaml
Then confirm the backup is really there and really restorable:
# List backups in the storage backend
kafka-backup list --config backup.yaml
# Validate integrity of what was written
kafka-backup validate --config backup.yaml --backup-id first-backup-orders
From here the upgrade path is incremental: add more topics to include, set
continuous: true for streaming backup, add an offset_storage section for
resumable incremental runs, and schedule restore tests. The
first backup tutorial covers each step, and
the configuration reference documents every option
used above. If S3 is not your target, the same config works against
Azure Blob, GCS, or a filesystem by switching the
storage.backend.
Start before you need it
A Kafka backup strategy is insurance you design while calm and cash in while panicking — the quality of the second moment is set entirely by the first. Start with the simplest thing that covers your worst realistic failure: a scheduled cluster-wide backup to object storage. Layer replication when downtime tolerance demands it, and PITR when logical errors would hurt.
Then test the restore. A backup you have restored is a capability; anything else is a hope.
Frequently asked questions
How do you take a backup of a Kafka topic?
Consume the topic and write its records to independent storage. With OSO Kafka Backup, you declare the topic in a YAML config with an S3, Azure Blob, GCS, or filesystem storage backend and run kafka-backup backup --config backup.yaml. The backup captures records, topic configuration, and consumer group offsets.
What is the difference between Kafka replication and Kafka backup?
Replication keeps live copies of partitions on multiple brokers for availability — it protects against hardware failure but propagates deletions and corruption instantly. Backup writes a separate historical copy to independent storage for recoverability, so you can restore data after human error, logical corruption, or total cluster loss.
Can you do point-in-time recovery with Kafka?
Not with Kafka alone — the broker has no built-in restore mechanism. Backup tools that preserve timestamps and offsets add this: OSO Kafka Backup restores a topic to a specific time window using time_window_start and time_window_end in Unix milliseconds, so you can recover the state before a bad deploy or accidental delete.
How often should you back up Kafka?
Derive the frequency from your RPO. If losing an hour of data is acceptable, hourly snapshot backups suffice. If the tolerance is seconds, run continuous backup mode, which streams records to storage as they arrive. Most teams tier this: continuous for critical topics, daily snapshots for the rest.
What is the best tool for Kafka backup?
It depends on the strategy. Kafka Connect S3 Sink handles simple topic archiving, MirrorMaker 2 handles cluster-to-cluster replication, and OSO Kafka Backup handles cluster-wide backup with consumer offset preservation and millisecond point-in-time recovery to S3, Azure Blob, GCS, or a filesystem. Many production setups combine replication with a dedicated backup tool.
Ready to design yours? Take your first backup in minutes, or start from the disaster recovery architectures if replication is already in place.