Reliability

The ability to consistently perform backup and restore operations correctly, recover from failures gracefully, and meet defined RPO and RTO targets under all conditions.

Reliability is the foundation of any backup system. A backup that cannot be restored is worse than no backup at all — it creates a false sense of safety. The Reliability pillar ensures that every backup is validated, every restore is rehearsed, and every failure scenario has a documented, tested recovery path.

Design Principles

Automatically recover from failure — Use checkpoints and incremental resume so that a failed backup picks up where it left off rather than starting from scratch.
Test recovery procedures, not just backup creation — A backup is only as good as its last successful restore. Validate regularly.
Scale horizontally to handle partition growth — As topics gain partitions or new topics are added, the backup infrastructure must scale without manual intervention.
Manage change through automation — Use GitOps workflows and Kubernetes CRDs to version, review, and roll out configuration changes predictably.
Design for zero data loss — Understand and configure your RPO targets explicitly; do not rely on defaults to meet business requirements.
Implement fault isolation — One partition failure should not affect the backup of other partitions. Isolate failure domains so blast radius is minimised.

Best Practices

REL-01: Backup Integrity & Validation

Every backup must be validated to confirm it can be used for a successful restore. Silent corruption, incomplete segments, or missing offsets can render a backup useless when it is needed most.

What to Validate

Check	Description
Manifest completeness	All expected topics and partitions are present in the backup manifest
Segment integrity	Checksums match and compressed segments can be decompressed without errors
Offset continuity	No gaps in offset sequences within each partition
Time window coverage	Backup spans the expected time range without holes

Why It Matters

Backup operations can report success (exit code 0) while producing incomplete or corrupt output. Network interruptions, storage throttling, or transient Kafka errors can result in partial writes that are only detectable through explicit validation.

Implementation

Run kafka-backup validate --deep after every backup operation.
Automate validation in your CI/CD pipeline or as a post-backup Kubernetes Job.
Periodically perform a full restore-to-temporary-cluster validation to confirm end-to-end recoverability.

tip

Schedule a weekly automated restore to a temporary cluster. This catches issues that static validation cannot detect, such as schema compatibility problems or consumer group restoration failures.

Configuration

Deep validation of a backup:

kafka-backup validate \
  --path s3://my-kafka-backups/production \
  --backup-id 2026-03-24T00-00-00Z \
  --deep

Describe backup metadata for programmatic checks:

kafka-backup describe \
  --path s3://my-kafka-backups/production \
  --backup-id 2026-03-24T00-00-00Z \
  --format json

Anti-Patterns

Assuming success from exit code alone — A zero exit code means the process completed, not that the backup is complete or uncorrupted.
Never validating until a real restore is needed — Discovering corruption during a production incident turns a recoverable situation into a crisis.
Validating only the latest backup — Older backups may have degraded in storage; periodic re-validation catches bit rot and storage issues.

REL-02: Point-in-Time Recovery Strategy

Point-in-time recovery (PITR) allows you to restore data to a specific moment, not just the latest backup. This is critical for recovering from data corruption, accidental deletes, or application bugs that produced bad data.

What to Define

Before implementing PITR, answer these questions:

What granularity is needed? — Can you tolerate restoring to the nearest hour, or do you need minute-level precision?
What is the maximum acceptable backup window? — The gap between the last backup and the failure determines potential data loss.
Which topics need PITR? — Not every topic requires the same recovery granularity.

Why It Matters

Backup Frequency	Achievable RPO	Use Case
Continuous	< 1 minute	Payment transactions, order streams
Hourly	< 1 hour	User events, session data
Daily	< 24 hours	Logs, analytics, non-critical streams

Implementation

Configure time windows in your restore configuration using epoch milliseconds. This allows precise recovery to the exact moment before corruption or data loss occurred.

warning

Epoch timestamps must be in milliseconds, not seconds. A common mistake is using a 10-digit Unix timestamp (seconds) instead of a 13-digit epoch millisecond value, resulting in restoring data from 1970.

Configuration

PITR restore configuration:

restore:
  source:
    path: s3://my-kafka-backups/production
    backup_id: 2026-03-24T00-00-00Z
  target:
    bootstrap_servers:
      - kafka-restore:9092
  options:
    time_window_start: 1711234800000   # 2026-03-23T15:00:00Z
    time_window_end: 1711238400000     # 2026-03-23T16:00:00Z
    topic_mapping:
      - source: orders
        target: orders-restored
      - source: payments
        target: payments-restored

Anti-Patterns

Daily backups when RPO is 1 hour — The backup frequency must match or exceed the RPO requirement. A daily backup cannot deliver an hourly RPO.
Not understanding epoch timestamps — Misconfigured time windows restore the wrong data range, wasting time during an incident.
No topic-level RPO classification — Treating all topics with the same backup frequency wastes resources on low-value data and under-protects high-value data.

REL-03: Consumer Offset Recovery

Restoring messages is only half the battle. If consumer offsets are not recovered correctly, applications will either reprocess data (duplicates) or skip data (loss). Offset recovery must be planned as part of every restore operation.

What to Understand

kafka-backup supports multiple offset recovery strategies:

Strategy	Description	Best For
Timestamp-based	Reset offsets to a specific timestamp	PITR restores
Offset-based (header)	Use original offset headers embedded in backup	Exact replay
Group-based	Restore committed consumer group offsets	Resuming existing consumers
Cluster-scan	Scan target cluster to determine appropriate offsets	Cross-cluster migration
Manual	Specify exact offsets per partition	Surgical recovery

Why It Matters

Incorrect offset recovery is the most common cause of post-restore issues. Applications may appear healthy but silently skip records or reprocess hours of data, causing downstream inconsistencies.

Implementation

Always use the two-phase approach: plan first, then execute.

Generate a plan to review before applying changes.
Review the plan to confirm offsets are correct.
Execute the plan to apply offset resets.

tip

The plan phase is non-destructive and produces a reviewable output. Always inspect the plan before executing, especially during incident recovery when mistakes are costly.

Configuration

Generate an offset reset plan:

kafka-backup offset-reset plan \
  --config restore-config.yaml \
  --strategy timestamp \
  --timestamp 1711238400000 \
  --output offset-plan.json

Review the plan:

cat offset-plan.json | jq '.partitions[] | {topic, partition, current_offset, new_offset}'

Execute the offset reset:

kafka-backup offset-reset execute \
  --plan offset-plan.json \
  --config restore-config.yaml

Anti-Patterns

Restoring data without considering consumer offsets — Messages are restored but consumers either skip them entirely or reprocess old data.
Blindly resetting to earliest or latest — earliest causes full reprocessing; latest skips all restored data. Neither is appropriate without understanding the restore context.
Skipping the plan phase — Executing offset resets without review during a high-pressure incident leads to compounding errors.

REL-04: Disaster Recovery Planning

A disaster recovery plan defines how you will restore Kafka data when the worst happens. Without explicit RPO and RTO targets per data tier, recovery is ad-hoc and unpredictable.

What to Define

Classify topics into tiers based on business impact and assign RPO/RTO targets:

Tier	Examples	RPO	RTO
Tier 1 — Critical	Payments, orders, financial transactions	< 1 minute	< 15 minutes
Tier 2 — Important	User events, session data, notifications	< 1 hour	< 1 hour
Tier 3 — Standard	Logs, metrics, analytics streams	< 24 hours	< 4 hours

Why It Matters

Without defined tiers, teams either over-invest in protecting low-value data or under-invest in protecting critical streams. Explicit targets drive backup frequency, storage redundancy, and recovery automation decisions.

Implementation

Document and rehearse a six-phase DR procedure:

Detection — Automated alerting identifies the failure (monitoring, health checks, anomaly detection).
Declaration — On-call engineer or automated runbook declares a disaster based on predefined criteria.
Communication — Stakeholders are notified via defined channels (PagerDuty, Slack, status page).
Execution — Restore operations are initiated following the documented runbook.
Validation — Restored data is verified against integrity checks and consumer applications are confirmed healthy.
Fallback — If restoration fails, execute the fallback plan (alternative backup, manual recovery, or degraded mode).

warning

DR documentation must be accessible during an outage. If your runbooks are stored on the same infrastructure that has failed, you cannot access them when you need them most. Keep copies in at least two independent locations.

Anti-Patterns

No defined RPO/RTO targets — Without targets, there is no way to measure whether your backup strategy is adequate or whether recovery was successful.
DR plan exists but has never been tested — An untested plan is an assumption, not a plan.
DR documentation stored only on the failing system — Wiki on the same cloud region, runbooks in the same Kubernetes cluster, or playbooks in the same Git hosting provider.
Single-person DR knowledge — If only one engineer knows how to restore, your RTO depends on their availability.

REL-05: Fault Isolation & Redundancy

Backup infrastructure must be isolated from the systems it protects. A failure that takes down your Kafka cluster should not also take down your ability to restore from backup.

What to Implement

Isolation Domain	Recommendation
Storage location	Different account, region, or cloud provider from the source cluster
Compute	Dedicated backup infrastructure, not co-located on broker nodes
Network	Separate failure domain; backup process reachable even if source network is degraded
Partition-level	Per-partition fault isolation — one partition failure does not block others

Why It Matters

If backups are stored in the same region and account as the source cluster, a single cloud incident (region outage, account compromise, IAM misconfiguration) can destroy both production data and all backups simultaneously.

Implementation

Enable S3 cross-region replication to maintain backup copies in a separate region.
Run backup processes on dedicated infrastructure, not on Kafka broker nodes.
Use per-partition checkpointing so a failure in one partition allows others to continue.
Implement checkpoint-based resume so interrupted backups pick up where they left off.
For ransomware protection, maintain air-gapped backups using S3 Object Lock or equivalent immutable storage.

tip

Use a separate AWS account for backup storage. Even if the production account is compromised, the backup account remains isolated. Cross-account IAM roles provide secure access without shared credentials.

Configuration

Cross-region S3 storage with replication:

storage:
  type: s3
  s3:
    bucket: my-kafka-backups-dr
    region: us-west-2          # Different region from source cluster
    endpoint: ""
    force_path_style: false
  backup:
    checkpoint_interval: 30s   # Frequent checkpoints for resume
    per_partition_isolation: true

Anti-Patterns

Backups in the same region and account as the source cluster — A region-wide outage or account compromise destroys both production data and backups.
Running backup processes on Kafka broker nodes — Broker failure takes down both the data source and the backup process simultaneously.
No redundancy for backup storage — A single storage location is a single point of failure.
Single point of failure in backup infrastructure — One backup server, one storage bucket, one network path.

REL-06: DR Testing & Chaos Engineering

A disaster recovery plan is only reliable if it is tested regularly. Chaos engineering validates that your backup infrastructure handles real-world failure scenarios, not just ideal conditions.

What to Test

Test Type	Frequency	Scope
Tabletop exercise	Monthly	Walk through DR scenarios with the team; identify gaps in runbooks
Single topic restore	Weekly (automated)	Restore one topic to a temporary cluster and validate integrity
Full cluster restore	Quarterly	Restore all tiered topics and measure actual RTO
Failover drill	Semi-annually	Simulate primary cluster loss and execute full DR procedure
Chaos test	Quarterly	Inject failures into backup infrastructure and observe behaviour

Why It Matters

Without regular testing, your DR plan degrades over time. Infrastructure changes, new topics, configuration drift, and team turnover all erode recovery capabilities. Testing keeps them current.

Implementation

Design chaos scenarios that exercise your failure modes:

Scenario	What It Tests
Kill backup process mid-run	Checkpoint resume — does backup continue from last checkpoint?
Storage outage (revoke S3 access)	Error handling and retry logic
Network partition (block Kafka port)	Graceful degradation and reconnection
Corrupt a backup segment	Validation detection — does `validate --deep` catch it?
Full region failure	Cross-region restore from replicated backup

warning

Always run chaos tests in a controlled environment with clear rollback procedures. Document the blast radius of each test before executing. Never run destructive chaos tests against production backups without an isolated copy.

Implementation: Document Results

Every DR test must produce a written report including:

Actual RTO achieved vs target RTO
Actual RPO achieved vs target RPO
Issues encountered during the test
Action items with owners and deadlines
Pass/fail determination against defined success criteria

tip

Track RTO and RPO measurements over time. Trending data reveals whether your recovery capabilities are improving or degrading, and provides evidence for compliance audits.

Anti-Patterns

DR drills only after incidents — Reactive testing means you discover problems during real outages, not before them.
Testing in non-production-like environments — A DR test against a cluster with 10 partitions does not validate recovery of a production cluster with 10,000 partitions.
Undocumented test results — Without written records, lessons are lost, the same issues recur, and compliance auditors have no evidence of testing.
Never testing full-scale restore — Single-topic restores build confidence but do not validate that your infrastructure can handle a complete cluster recovery within the RTO window.

Review Questions

Use the following questions during architecture reviews to assess the reliability of your backup strategy:

Are all backups validated automatically after completion using kafka-backup validate --deep?
Is a full restore-to-temporary-cluster test performed at least quarterly?
Are RPO and RTO targets defined and documented for every topic tier?
Does the backup frequency match or exceed the RPO requirement for each tier?
Is consumer offset recovery planned and tested as part of every restore procedure?
Are backups stored in a different failure domain (region, account, or cloud provider) from the source cluster?
Does the backup process use per-partition fault isolation and checkpoint-based resume?
Is there a documented, tested DR procedure with clear escalation and communication steps?
Are DR drills conducted at least quarterly, with results documented and action items tracked?
Are chaos engineering scenarios used to validate backup infrastructure resilience?

Resources

Point-in-Time Recovery Guide — Step-by-step PITR restore procedures
Offset Management Guide — Consumer offset recovery strategies and workflows
CLI Reference — Complete command reference for backup, restore, and validate operations
PITR Architecture — Technical deep-dive into point-in-time recovery implementation
Multi-Cluster DR Examples — Example configurations for cross-region and multi-cluster disaster recovery

Design Principles​

Best Practices​

REL-01: Backup Integrity & Validation​

What to Validate​

Why It Matters​

Implementation​

Configuration​

Anti-Patterns​

REL-02: Point-in-Time Recovery Strategy​

What to Define​

Why It Matters​

Implementation​

Configuration​

Anti-Patterns​

REL-03: Consumer Offset Recovery​

What to Understand​

Why It Matters​

Implementation​

Configuration​

Anti-Patterns​

REL-04: Disaster Recovery Planning​

What to Define​

Why It Matters​

Implementation​

Anti-Patterns​

REL-05: Fault Isolation & Redundancy​

What to Implement​

Why It Matters​

Implementation​

Configuration​

Anti-Patterns​

REL-06: DR Testing & Chaos Engineering​

What to Test​

Why It Matters​

Implementation​

Implementation: Document Results​

Anti-Patterns​

Review Questions​

Resources​

Design Principles

Best Practices

REL-01: Backup Integrity & Validation

What to Validate

Why It Matters

Implementation

Configuration

Anti-Patterns

REL-02: Point-in-Time Recovery Strategy

What to Define

Why It Matters

Implementation

Configuration

Anti-Patterns

REL-03: Consumer Offset Recovery

What to Understand

Why It Matters

Implementation

Configuration

Anti-Patterns

REL-04: Disaster Recovery Planning

What to Define

Why It Matters

Implementation

Anti-Patterns

REL-05: Fault Isolation & Redundancy

What to Implement

Why It Matters

Implementation

Configuration

Anti-Patterns

REL-06: DR Testing & Chaos Engineering

What to Test

Why It Matters

Implementation

Implementation: Document Results

Anti-Patterns

Review Questions

Resources