Kafka Backup Best Practices: 10 Rules for Production Data Protection
Kafka backup best practices come down to one principle: a backup you have not restored is a hope, not a backup. Retention deletes your data on schedule, replication copies your mistakes in real time, and neither can return a topic to the state it was in before an incident. These 10 rules turn Kafka backups from a checkbox into something you can bet an on-call shift on.
Prioritize rules 2 and 3 — automated restore verification and backup lag monitoring. They catch the two failure modes that actually burn teams: backups that silently stopped working, and backups that cannot be restored.
1. Treat backup as code
Backup configuration belongs in version control next to the applications it
protects. A backup.yaml reviewed in a pull request is auditable; a config
hand-edited on a VM is a mystery six months later.
- Store backup configurations in Git
- Provision backup infrastructure with Terraform or Helm, not the console
- Ship config changes through CI so a typo cannot silently disable a nightly job
If you run Kafka on Kubernetes, the backup operator takes this further: backup schedules become custom resources that GitOps tools reconcile like any other manifest.
2. Verify backups automatically — trust, but verify
The most common backup failure is discovered during an outage: the backup ran for months, but nobody ever restored one. Schedule automated restore tests — weekly at minimum — into a scratch cluster or isolated namespace.
Validate three things on every test:
- Record counts match between the source topic and the restored topic
- Offsets are continuous — no gaps or overlaps at segment boundaries
- Consumers can resume from restored consumer group offsets
The restore mode supports dry_run: true, which validates a backup against
the target cluster without producing a single record — cheap enough to run
daily:
mode: restore
backup_id: "prod-backup-latest"
restore:
dry_run: true
3. Monitor backup lag and health continuously
A backup job that dies on Friday night should page someone before Monday. OSO Kafka Backup exposes Prometheus metrics for exactly this:
| Metric | What to alert on |
|---|---|
kafka_backup_lag_records | Lag exceeding your RPO budget |
kafka_backup_errors_total | Any sustained non-zero rate |
kafka_backup_records_total | Rate dropping to zero mid-window |
kafka_backup_compression_ratio | Sudden shifts (often a payload change upstream) |
The full list is in the metrics reference. Wire the lag metric to your alerting with a threshold derived from your RPO — not a number picked in a hurry.
4. Define recovery objectives before you need them
Two numbers drive every backup decision:
- RPO (recovery point objective) — how much data you can afford to lose. This sets backup frequency, or pushes you to continuous mode.
- RTO (recovery time objective) — how long you can be down. This sets your restore method, parallelism, and where backups physically live.
Map every critical topic to an RPO/RTO tier, write the mapping down, and review it quarterly. A payments topic and a clickstream topic should not share a policy. For the architecture side of this decision, see the disaster recovery use cases.
5. Back up offsets and metadata, not just messages
A topic restore that loses consumer group offsets forces every consumer to choose between reprocessing everything and skipping to latest — both are incidents of their own. Message data alone is roughly half a backup. Capture:
- Consumer group offsets, so processing resumes where it stopped
- Topic configurations — partitions, retention, cleanup policy
- Schemas, so downstream consumers can still deserialize what you restored
- ACLs, so security posture survives the restore
OSO Kafka Backup captures offsets and topic configuration as part of every backup, and keeps offsets consistent with the restored data during point-in-time recovery.
6. Encrypt backup data at rest and in transit
Backups concentrate months of your most valuable data into one bucket — treat them with at least the rigor of the cluster itself.
- Enable server-side encryption on the storage target (SSE-S3 or SSE-KMS on Amazon S3, service-managed keys on Azure and GCS)
- Use TLS between the backup process and both the brokers and the object store
- Keep encryption keys in a KMS with its own access policy, so a Kafka credential leak does not also expose the archive
7. Control costs without weakening the safety net
Backup storage costs are predictable and controllable — unlike the cost of losing the data.
- Compress. Backups are compressed with Zstandard or LZ4 before upload
(
compression: zstdin the backup config), independent of producer-side compression. - Tier. Keep recent backups hot for fast restore; move older ones to infrequent-access or archive classes with bucket lifecycle rules.
- Expire. Retention policies should delete what compliance no longer requires — storage you forgot about is pure waste.
Storage layout details are in the storage format reference, which is what lifecycle rules operate against.
8. Meet data retention regulations deliberately
If your topics carry regulated data, backups are in scope too:
- Map topics to their regimes (GDPR, HIPAA, SOX, PCI-DSS) and set backup retention to match — both minimums and maximums
- Use object-lock or immutable storage for audit-relevant backups
- Automate deletion when retention windows close; manual cleanup does not survive staff turnover
- Keep restore procedures documented — auditors ask for evidence that recovery works, not just that backups exist
9. Test disaster recovery quarterly
Backup verification (rule 2) proves the data is restorable. A DR test proves your organization can restore it: the runbook is current, the credentials work, the on-call engineer knows which cluster to target, and the application teams can validate their services afterward.
Run a full failover drill quarterly. Measure the RTO and RPO you actually achieved against the targets from rule 4, and fix the gap — in tooling or in targets — after every drill.
10. Write runbooks for the engineer at 3 a.m.
The person running a restore under pressure should never compose a config from memory. Good runbooks contain:
- Pre-validated, copy-paste commands:
kafka-backup restore --config restore-payments.yaml, with the config already in Git - A decision tree: partial topic restore vs. full recovery vs. point-in-time rollback
- Escalation paths and the list of application owners to notify
- Links to the dashboards from rule 3, so progress is observable
Start with the first backup tutorial as a template and extend it with your environment's specifics.
Where to start
Do not attempt all ten at once. Add backup lag alerting today (rule 3), schedule a weekly automated restore test this week (rule 2), and write the RPO/RTO map next sprint (rule 4). The rest layer on from there.
A backup you test is a backup you can trust — everything else on this list exists to make that testing routine instead of heroic.
Frequently asked questions
What are the best practices for backing up Kafka?
Version backup configuration in Git, verify restores automatically on a schedule, monitor backup lag with alerting, define RPO and RTO per topic, capture consumer offsets and metadata alongside messages, encrypt backups, control storage costs with compression and lifecycle tiers, and run quarterly disaster recovery drills.
How often should you test Kafka backups?
Run automated restore verification at least weekly, and a dry-run validation daily if your tooling supports it. Full disaster recovery drills involving failover and application teams should run quarterly.
How do you monitor Kafka backup health?
Track backup lag in records, error counts, and throughput via Prometheus metrics such as kafka_backup_lag_records and kafka_backup_errors_total. Alert when lag exceeds your RPO budget or when the error rate is sustained above zero.
What metadata should be included in Kafka backups?
Consumer group offsets, topic configurations (partition counts, retention, cleanup policy), schemas, and ACLs. Without offsets, consumers must reprocess or skip data after a restore; without configs and schemas, the restored topic may not behave like the original.
How do you reduce Kafka backup storage costs?
Compress backup data with Zstandard or LZ4 before upload, move older backups to infrequent-access or archive storage classes with lifecycle policies, and expire backups automatically once retention requirements lapse.
Ready to put these into practice? Take your first backup in minutes, or see how backup fits alongside replication in our MirrorMaker 2 comparison.