Skip to main content

Reliability

The ability to consistently perform backup and restore operations correctly, recover from failures gracefully, and meet defined RPO and RTO targets under all conditions.

Reliability is the foundation of any backup system. A backup that cannot be restored is worse than no backup at all — it creates a false sense of safety. The Reliability pillar ensures that every backup is validated, every restore is rehearsed, and every failure scenario has a documented, tested recovery path.

Design Principles

  1. Automatically recover from failure — Use checkpoints and incremental resume so that a failed backup picks up where it left off rather than starting from scratch.
  2. Test recovery procedures, not just backup creation — A backup is only as good as its last successful restore. Validate regularly.
  3. Scale horizontally to handle partition growth — As topics gain partitions or new topics are added, the backup infrastructure must scale without manual intervention.
  4. Manage change through automation — Use GitOps workflows and Kubernetes CRDs to version, review, and roll out configuration changes predictably.
  5. Design for zero data loss — Understand and configure your RPO targets explicitly; do not rely on defaults to meet business requirements.
  6. Implement fault isolation — One partition failure should not affect the backup of other partitions. Isolate failure domains so blast radius is minimised.

Best Practices

REL-01: Backup Integrity & Validation

Every backup must be validated to confirm it can be used for a successful restore. Silent corruption, incomplete segments, or missing offsets can render a backup useless when it is needed most.

What to Validate

CheckDescription
Manifest completenessAll expected topics and partitions are present in the backup manifest
Segment integrityChecksums match and compressed segments can be decompressed without errors
Offset continuityNo gaps in offset sequences within each partition
Time window coverageBackup spans the expected time range without holes

Why It Matters

Backup operations can report success (exit code 0) while producing incomplete or corrupt output. Network interruptions, storage throttling, or transient Kafka errors can result in partial writes that are only detectable through explicit validation.

Implementation

  • Run kafka-backup validate --deep after every backup operation.
  • Automate validation in your CI/CD pipeline or as a post-backup Kubernetes Job.
  • Periodically perform a full restore-to-temporary-cluster validation to confirm end-to-end recoverability.
tip

Schedule a weekly automated restore to a temporary cluster. This catches issues that static validation cannot detect, such as schema compatibility problems or consumer group restoration failures.

Configuration

Deep validation of a backup:

kafka-backup validate \
--path s3://my-kafka-backups/production \
--backup-id 2026-03-24T00-00-00Z \
--deep

Describe backup metadata for programmatic checks:

kafka-backup describe \
--path s3://my-kafka-backups/production \
--backup-id 2026-03-24T00-00-00Z \
--format json

Anti-Patterns

Anti-Patterns
  • Assuming success from exit code alone — A zero exit code means the process completed, not that the backup is complete or uncorrupted.
  • Never validating until a real restore is needed — Discovering corruption during a production incident turns a recoverable situation into a crisis.
  • Validating only the latest backup — Older backups may have degraded in storage; periodic re-validation catches bit rot and storage issues.

REL-02: Point-in-Time Recovery Strategy

Point-in-time recovery (PITR) allows you to restore data to a specific moment, not just the latest backup. This is critical for recovering from data corruption, accidental deletes, or application bugs that produced bad data.

What to Define

Before implementing PITR, answer these questions:

  • What granularity is needed? — Can you tolerate restoring to the nearest hour, or do you need minute-level precision?
  • What is the maximum acceptable backup window? — The gap between the last backup and the failure determines potential data loss.
  • Which topics need PITR? — Not every topic requires the same recovery granularity.

Why It Matters

Backup FrequencyAchievable RPOUse Case
Continuous< 1 minutePayment transactions, order streams
Hourly< 1 hourUser events, session data
Daily< 24 hoursLogs, analytics, non-critical streams

Implementation

Configure time windows in your restore configuration using epoch milliseconds. This allows precise recovery to the exact moment before corruption or data loss occurred.

warning

Epoch timestamps must be in milliseconds, not seconds. A common mistake is using a 10-digit Unix timestamp (seconds) instead of a 13-digit epoch millisecond value, resulting in restoring data from 1970.

Configuration

PITR restore configuration:

restore:
source:
path: s3://my-kafka-backups/production
backup_id: 2026-03-24T00-00-00Z
target:
bootstrap_servers:
- kafka-restore:9092
options:
time_window_start: 1711234800000 # 2026-03-23T15:00:00Z
time_window_end: 1711238400000 # 2026-03-23T16:00:00Z
topic_mapping:
- source: orders
target: orders-restored
- source: payments
target: payments-restored

Anti-Patterns

Anti-Patterns
  • Daily backups when RPO is 1 hour — The backup frequency must match or exceed the RPO requirement. A daily backup cannot deliver an hourly RPO.
  • Not understanding epoch timestamps — Misconfigured time windows restore the wrong data range, wasting time during an incident.
  • No topic-level RPO classification — Treating all topics with the same backup frequency wastes resources on low-value data and under-protects high-value data.

REL-03: Consumer Offset Recovery

Restoring messages is only half the battle. If consumer offsets are not recovered correctly, applications will either reprocess data (duplicates) or skip data (loss). Offset recovery must be planned as part of every restore operation.

What to Understand

kafka-backup supports multiple offset recovery strategies:

StrategyDescriptionBest For
Timestamp-basedReset offsets to a specific timestampPITR restores
Offset-based (header)Use original offset headers embedded in backupExact replay
Group-basedRestore committed consumer group offsetsResuming existing consumers
Cluster-scanScan target cluster to determine appropriate offsetsCross-cluster migration
ManualSpecify exact offsets per partitionSurgical recovery

Why It Matters

Incorrect offset recovery is the most common cause of post-restore issues. Applications may appear healthy but silently skip records or reprocess hours of data, causing downstream inconsistencies.

Implementation

Always use the two-phase approach: plan first, then execute.

  1. Generate a plan to review before applying changes.
  2. Review the plan to confirm offsets are correct.
  3. Execute the plan to apply offset resets.
tip

The plan phase is non-destructive and produces a reviewable output. Always inspect the plan before executing, especially during incident recovery when mistakes are costly.

Configuration

Generate an offset reset plan:

kafka-backup offset-reset plan \
--config restore-config.yaml \
--strategy timestamp \
--timestamp 1711238400000 \
--output offset-plan.json

Review the plan:

cat offset-plan.json | jq '.partitions[] | {topic, partition, current_offset, new_offset}'

Execute the offset reset:

kafka-backup offset-reset execute \
--plan offset-plan.json \
--config restore-config.yaml

Anti-Patterns

Anti-Patterns
  • Restoring data without considering consumer offsets — Messages are restored but consumers either skip them entirely or reprocess old data.
  • Blindly resetting to earliest or latestearliest causes full reprocessing; latest skips all restored data. Neither is appropriate without understanding the restore context.
  • Skipping the plan phase — Executing offset resets without review during a high-pressure incident leads to compounding errors.

REL-04: Disaster Recovery Planning

A disaster recovery plan defines how you will restore Kafka data when the worst happens. Without explicit RPO and RTO targets per data tier, recovery is ad-hoc and unpredictable.

What to Define

Classify topics into tiers based on business impact and assign RPO/RTO targets:

TierExamplesRPORTO
Tier 1 — CriticalPayments, orders, financial transactions< 1 minute< 15 minutes
Tier 2 — ImportantUser events, session data, notifications< 1 hour< 1 hour
Tier 3 — StandardLogs, metrics, analytics streams< 24 hours< 4 hours

Why It Matters

Without defined tiers, teams either over-invest in protecting low-value data or under-invest in protecting critical streams. Explicit targets drive backup frequency, storage redundancy, and recovery automation decisions.

Implementation

Document and rehearse a six-phase DR procedure:

  1. Detection — Automated alerting identifies the failure (monitoring, health checks, anomaly detection).
  2. Declaration — On-call engineer or automated runbook declares a disaster based on predefined criteria.
  3. Communication — Stakeholders are notified via defined channels (PagerDuty, Slack, status page).
  4. Execution — Restore operations are initiated following the documented runbook.
  5. Validation — Restored data is verified against integrity checks and consumer applications are confirmed healthy.
  6. Fallback — If restoration fails, execute the fallback plan (alternative backup, manual recovery, or degraded mode).
warning

DR documentation must be accessible during an outage. If your runbooks are stored on the same infrastructure that has failed, you cannot access them when you need them most. Keep copies in at least two independent locations.

Anti-Patterns

Anti-Patterns
  • No defined RPO/RTO targets — Without targets, there is no way to measure whether your backup strategy is adequate or whether recovery was successful.
  • DR plan exists but has never been tested — An untested plan is an assumption, not a plan.
  • DR documentation stored only on the failing system — Wiki on the same cloud region, runbooks in the same Kubernetes cluster, or playbooks in the same Git hosting provider.
  • Single-person DR knowledge — If only one engineer knows how to restore, your RTO depends on their availability.

REL-05: Fault Isolation & Redundancy

Backup infrastructure must be isolated from the systems it protects. A failure that takes down your Kafka cluster should not also take down your ability to restore from backup.

What to Implement

Isolation DomainRecommendation
Storage locationDifferent account, region, or cloud provider from the source cluster
ComputeDedicated backup infrastructure, not co-located on broker nodes
NetworkSeparate failure domain; backup process reachable even if source network is degraded
Partition-levelPer-partition fault isolation — one partition failure does not block others

Why It Matters

If backups are stored in the same region and account as the source cluster, a single cloud incident (region outage, account compromise, IAM misconfiguration) can destroy both production data and all backups simultaneously.

Implementation

  • Enable S3 cross-region replication to maintain backup copies in a separate region.
  • Run backup processes on dedicated infrastructure, not on Kafka broker nodes.
  • Use per-partition checkpointing so a failure in one partition allows others to continue.
  • Implement checkpoint-based resume so interrupted backups pick up where they left off.
  • For ransomware protection, maintain air-gapped backups using S3 Object Lock or equivalent immutable storage.
tip

Use a separate AWS account for backup storage. Even if the production account is compromised, the backup account remains isolated. Cross-account IAM roles provide secure access without shared credentials.

Configuration

Cross-region S3 storage with replication:

storage:
type: s3
s3:
bucket: my-kafka-backups-dr
region: us-west-2 # Different region from source cluster
endpoint: ""
force_path_style: false
backup:
checkpoint_interval: 30s # Frequent checkpoints for resume
per_partition_isolation: true

Anti-Patterns

Anti-Patterns
  • Backups in the same region and account as the source cluster — A region-wide outage or account compromise destroys both production data and backups.
  • Running backup processes on Kafka broker nodes — Broker failure takes down both the data source and the backup process simultaneously.
  • No redundancy for backup storage — A single storage location is a single point of failure.
  • Single point of failure in backup infrastructure — One backup server, one storage bucket, one network path.

REL-06: DR Testing & Chaos Engineering

A disaster recovery plan is only reliable if it is tested regularly. Chaos engineering validates that your backup infrastructure handles real-world failure scenarios, not just ideal conditions.

What to Test

Test TypeFrequencyScope
Tabletop exerciseMonthlyWalk through DR scenarios with the team; identify gaps in runbooks
Single topic restoreWeekly (automated)Restore one topic to a temporary cluster and validate integrity
Full cluster restoreQuarterlyRestore all tiered topics and measure actual RTO
Failover drillSemi-annuallySimulate primary cluster loss and execute full DR procedure
Chaos testQuarterlyInject failures into backup infrastructure and observe behaviour

Why It Matters

Without regular testing, your DR plan degrades over time. Infrastructure changes, new topics, configuration drift, and team turnover all erode recovery capabilities. Testing keeps them current.

Implementation

Design chaos scenarios that exercise your failure modes:

ScenarioWhat It Tests
Kill backup process mid-runCheckpoint resume — does backup continue from last checkpoint?
Storage outage (revoke S3 access)Error handling and retry logic
Network partition (block Kafka port)Graceful degradation and reconnection
Corrupt a backup segmentValidation detection — does validate --deep catch it?
Full region failureCross-region restore from replicated backup
warning

Always run chaos tests in a controlled environment with clear rollback procedures. Document the blast radius of each test before executing. Never run destructive chaos tests against production backups without an isolated copy.

Implementation: Document Results

Every DR test must produce a written report including:

  • Actual RTO achieved vs target RTO
  • Actual RPO achieved vs target RPO
  • Issues encountered during the test
  • Action items with owners and deadlines
  • Pass/fail determination against defined success criteria
tip

Track RTO and RPO measurements over time. Trending data reveals whether your recovery capabilities are improving or degrading, and provides evidence for compliance audits.

Anti-Patterns

Anti-Patterns
  • DR drills only after incidents — Reactive testing means you discover problems during real outages, not before them.
  • Testing in non-production-like environments — A DR test against a cluster with 10 partitions does not validate recovery of a production cluster with 10,000 partitions.
  • Undocumented test results — Without written records, lessons are lost, the same issues recur, and compliance auditors have no evidence of testing.
  • Never testing full-scale restore — Single-topic restores build confidence but do not validate that your infrastructure can handle a complete cluster recovery within the RTO window.

Review Questions

Use the following questions during architecture reviews to assess the reliability of your backup strategy:

  1. Are all backups validated automatically after completion using kafka-backup validate --deep?
  2. Is a full restore-to-temporary-cluster test performed at least quarterly?
  3. Are RPO and RTO targets defined and documented for every topic tier?
  4. Does the backup frequency match or exceed the RPO requirement for each tier?
  5. Is consumer offset recovery planned and tested as part of every restore procedure?
  6. Are backups stored in a different failure domain (region, account, or cloud provider) from the source cluster?
  7. Does the backup process use per-partition fault isolation and checkpoint-based resume?
  8. Is there a documented, tested DR procedure with clear escalation and communication steps?
  9. Are DR drills conducted at least quarterly, with results documented and action items tracked?
  10. Are chaos engineering scenarios used to validate backup infrastructure resilience?

Resources