Operational Excellence

"The ability to run and monitor Kafka backup workloads effectively, gain insight into their operations, and continuously improve supporting processes and procedures to deliver reliable data protection."

The Operational Excellence pillar focuses on ensuring your backup and restore operations are reliable, observable, and continuously improving. It encompasses how your team organises around backup responsibilities, automates routine tasks, monitors health, responds to incidents, and evolves practices over time.

Design Principles

Perform operations as code -- Define backup schedules, retention policies, and restore procedures as version-controlled configuration. Eliminate manual, ad-hoc CLI invocations for production workloads.
Make frequent, small, reversible changes -- Roll out configuration changes incrementally (e.g., one topic pattern at a time). Use canary deployments for operator upgrades and validate each change before proceeding.
Refine operations procedures frequently -- Review runbooks and automation after every incident and at regular intervals. Update them to reflect current topology, tooling versions, and lessons learned.
Anticipate failure -- Design backup pipelines assuming that Kafka brokers, storage backends, and network connectivity will fail. Build pre-mortems into planning and run regular disaster-recovery drills.
Learn from all operational events -- Treat successful restores, slow backups, and outright failures equally as sources of insight. Conduct blameless post-incident reviews and feed findings back into automation and monitoring.
Use managed services where possible -- Leverage managed object storage (S3, GCS, Azure Blob), managed Kubernetes, and managed Kafka where appropriate to reduce undifferentiated operational burden and let your team focus on backup-specific concerns.

Best Practices

OE-01: Organisation & Team Readiness

What

Establish clear ownership, on-call procedures, and team competency for Kafka backup operations.

Why

Backup systems that lack a clear owner tend to drift into a neglected state. When a restore is needed during an outage, confusion over who is responsible and what steps to follow turns a recoverable incident into a prolonged one.

Implementation Guidance

Assign a backup operations owner -- A named individual or team accountable for backup health, capacity planning, and restore readiness.
Define on-call procedures -- Include backup/restore responsibilities in your existing on-call rotation. Ensure on-call engineers have the necessary access and credentials.
Maintain a disaster-recovery playbook -- Document exact kafka-backup CLI commands for every recovery scenario. Store the playbook alongside your infrastructure code, not in a separate wiki.
Run quarterly DR drills -- Execute full and partial restores against a staging environment. Record time-to-restore and compare against your RTO targets.
Train all platform engineers -- Every engineer on the team should be able to execute a restore independently. Avoid single points of knowledge.
Define escalation paths -- Document when to escalate from on-call to the backup owner, and from the backup owner to OSO support (for Enterprise customers).

tip

Store your DR playbook in the same Git repository as your backup configuration. This ensures the playbook is always versioned alongside the config it references.

Anti-patterns

No designated owner -- Backup is "everyone's responsibility", which means it is no one's responsibility.
Untested playbook -- A restore procedure that has never been executed is not a procedure; it is a hope.
Single point of knowledge -- Only one engineer knows how to operate kafka-backup. When they are unavailable, the team is blocked.

OE-02: Backup Lifecycle Management

What

Define and automate the full lifecycle of backups: scheduling, validation, retention, and deletion.

Why

Without lifecycle automation, storage costs grow unchecked, stale backups give a false sense of security, and teams discover validation failures only when a restore is attempted during an incident.

Implementation Guidance

Define schedules with cron -- Use Kubernetes CronJobs or the operator's built-in scheduling to run backups at predictable intervals.
Automate validation -- Run kafka-backup validate --deep after every backup completes. Deep validation checks segment integrity, offset continuity, and header consistency.
Set retention policies per environment:
- Development: 7 days
- Production: 90 days
- Compliance/Audit: 7 years (with immutable storage locks)
Automate deletion deliberately -- Use KafkaBackup.spec.retention for operator-managed pruning on PVC/local, S3/S3-compatible, and Azure Blob Storage. Use storage lifecycle policies when you need backend-native legal hold, object lock, cross-account enforcement, or GCS support. Never rely on ad hoc manual cleanup.
Tag backups -- Apply metadata labels (environment, team, compliance tier) to every backup for filtering, reporting, and cost allocation.

Configuration Example

apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
  name: production-nightly
  namespace: kafka-backup
  labels:
    environment: production
    team: platform
    compliance-tier: standard
spec:
  schedule: "0 0 2 * * * *"
  stopAtCurrentOffsets: true
  kafkaCluster:
    bootstrapServers:
      - kafka-0.kafka:9092
      - kafka-1.kafka:9092
      - kafka-2.kafka:9092
  topics:
    - "orders-*"
    - "payments-*"
  storage:
    storageType: s3
    s3:
      bucket: acme-kafka-backups-prod
      region: eu-west-1
      prefix: nightly/
      credentialsSecret:
        name: s3-credentials
  compression: zstd
  retention:
    enabled: true
    maxAgeDays: 90
    keepLast: 7
    dryRun: true

This example starts in retention dry-run mode. Review the retention status fields and switch dryRun to false only after the policy has been approved.

If retention must be enforced by the storage backend instead, pair the backup prefix with a lifecycle rule:

{
  "Rules": [
    {
      "ID": "expire-nightly-backups-after-90-days",
      "Status": "Enabled",
      "Filter": { "Prefix": "nightly/" },
      "Expiration": { "Days": 90 }
    }
  ]
}

warning

Always enable validation.deep: true for production backups. Shallow validation only checks that files exist; it does not verify data integrity.

Anti-patterns

No automated deletion -- Storage costs grow linearly and old backups become a liability rather than an asset.
No post-backup validation -- You discover corrupt backups at the worst possible time: during a restore.
Same retention everywhere -- Applying production retention to development wastes storage; applying development retention to compliance data violates regulations.

OE-03: Observability & Monitoring

What

Instrument backup and restore operations with metrics, dashboards, and alerts to maintain full visibility into pipeline health.

Why

Backups are background processes. Without observability, failures go unnoticed until a restore is needed. By then, your most recent valid backup may be hours or days old -- far outside your RPO.

Implementation Guidance

Enable Prometheus metrics on port 8080 for all kafka-backup instances.
Monitor key metrics:
- kafka_backup_lag_records -- Consumer lag per partition. Rising lag indicates the backup cannot keep pace with production throughput.
- kafka_backup_records_total -- Total records backed up. Use the rate to track throughput.
- kafka_backup_compression_ratio -- Compression efficiency. A sudden change may indicate a shift in message format.
- kafka_backup_storage_write_latency_seconds -- Storage backend latency. Elevated latency degrades backup performance and may indicate storage issues.
Build Grafana dashboards for:
- Overall backup health (active jobs, success/failure rates)
- Per-topic backup status and lag
- Storage growth trends and cost projection
- Restore operation tracking and duration
Configure alerts for:
- Backup job failure (any job that does not complete successfully)
- Consumer lag exceeding RPO threshold
- Storage write errors or elevated latency
- Checkpoint staleness (no checkpoint update within expected interval)

Configuration Example

# kafka-backup config
metrics:
  enabled: true
  port: 8080
  bind_address: "0.0.0.0"
  path: "/metrics"

# Prometheus scrape config
scrape_configs:
  - job_name: kafka-backup
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: kafka-backup
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:8080

tip

Set up a dedicated "Backup Health" Grafana dashboard and include it in your team's daily standup review. Catching a slow backup trend early is far cheaper than discovering a gap during an incident.

Anti-patterns

No monitoring at all -- You have no idea whether backups are running, succeeding, or falling behind.
Monitoring backup jobs but not storage -- A backup that completes but fails to write to storage is worse than a visible failure; it is a silent one.
No alerting -- Dashboards that no one watches provide no value. Alerts ensure the right people are notified at the right time.

OE-04: Runbooks & Automation

What

Create detailed, executable runbooks for every backup and restore scenario. Automate routine operations and keep manual procedures as copy-paste-ready CLI commands.

Why

During an outage, engineers are under pressure. Runbooks that contain exact commands -- not prose descriptions -- dramatically reduce mean time to recovery (MTTR). Automation eliminates human error from repetitive tasks.

Implementation Guidance

Maintain runbooks for:
- Full cluster restore from backup
- Single topic point-in-time recovery (PITR)
- Consumer offset recovery
- Backup failure investigation and remediation
- Storage backend failover
- Configuration change rollout
Each runbook must include:
- Prerequisites (access, credentials, tooling versions)
- Step-by-step CLI commands (copy-paste ready)
- Validation steps after each action
- Rollback procedure
- Estimated time to complete

Example Runbook Excerpt: Single Topic PITR Restore

# Step 1: List available backups for the target topic
kafka-backup list \
  --storage s3 \
  --bucket acme-kafka-backups-prod \
  --prefix nightly/ \
  --topic orders-events \
  --from "2026-03-20T00:00:00Z" \
  --to "2026-03-23T23:59:59Z"

# Step 2: Describe the selected backup to confirm contents
kafka-backup describe \
  --storage s3 \
  --bucket acme-kafka-backups-prod \
  --backup-id backup-20260322-020000

# Step 3: Create restore configuration (restore-orders.yaml)
restore:
  kafka:
    bootstrapServers: "kafka-0.kafka:9092,kafka-1.kafka:9092,kafka-2.kafka:9092"
  storage:
    type: s3
    s3:
      bucket: acme-kafka-backups-prod
      region: eu-west-1
  topics:
    include:
      - "orders-events"
  pointInTime: "2026-03-22T14:30:00Z"
  targetTopic: "orders-events-restored"
  restoreOffsets: true

# Step 4: Execute the restore
kafka-backup restore --config restore-orders.yaml

# Step 5: Validate the restored topic
kafka-backup validate \
  --deep \
  --topic orders-events-restored \
  --kafka-bootstrap "kafka-0.kafka:9092"

warning

Always restore to a separate target topic (e.g., orders-events-restored) first. Validate the data before swapping consumers to the restored topic. Never overwrite a production topic directly.

Anti-patterns

Tribal knowledge -- Restore procedures exist only in one engineer's head. They are effectively unavailable at 3 a.m. on a Sunday.
Prose without commands -- "Connect to the cluster and restore the topic" is not a runbook. Exact commands with exact flags are a runbook.
Referencing an external wiki during an outage -- If your Confluence page is behind an SSO that depends on the infrastructure you are trying to recover, your runbook is inaccessible when you need it most.

OE-05: Continuous Improvement

What

Establish feedback loops that drive ongoing improvement to backup operations, configuration, and tooling.

Why

Kafka topologies evolve, throughput changes, compliance requirements shift, and new kafka-backup releases bring performance improvements. A backup strategy that is never revisited will silently fall behind operational needs.

Implementation Guidance

Conduct post-incident reviews after every backup or restore incident. Document root cause, timeline, impact, and action items. Track action item completion.
Run monthly metrics reviews covering:
- Backup duration trends (are backups taking longer as data volume grows?)
- Storage growth rate and cost trajectory
- Restore success rate and time-to-restore
- RPO/RTO compliance percentage
Update configurations when the environment changes:
- New topics or topic patterns added to Kafka
- Significant throughput increases
- New compliance or regulatory requirements
- Infrastructure changes (new regions, storage tiers)
Benchmark against new releases -- Test new kafka-backup versions in staging. Measure throughput, compression ratio, and resource usage against your current version before upgrading production.

tip

Add a recurring calendar event for a monthly "Backup Operations Review". Use it to walk through metrics dashboards, review open action items, and assess whether current configurations still meet requirements.

Anti-patterns

Set-and-forget -- Deploying a backup configuration once and never reviewing it. Environments change; configurations must follow.
No post-incident reviews -- Repeating the same failure because the team never analysed the first occurrence.
Annual-only review -- Reviewing backup strategy once a year guarantees that it is out of date for eleven months.

Review Questions

Use these questions to assess your operational maturity. For each question, rate your current state as None, Basic, Advanced, or Expert.

Is there a designated owner (individual or team) accountable for Kafka backup operations?
Are backup schedules, retention policies, and validation steps defined as version-controlled configuration?
Do you run automated deep validation (kafka-backup validate --deep) after every backup?
Are Prometheus metrics enabled and scraped for all kafka-backup instances?
Do you have Grafana dashboards (or equivalent) providing visibility into backup health, lag, and storage growth?
Are alerts configured for backup failures, RPO threshold breaches, and storage errors?
Do you maintain copy-paste-ready runbooks for every restore scenario (full cluster, single topic PITR, offset recovery)?
Have you executed a full disaster-recovery drill in the last quarter?
Do you conduct post-incident reviews after every backup or restore incident?
Do you review backup metrics and configurations at least monthly to ensure they match current requirements?

Resources

Deployment Guide -- Infrastructure setup for all supported platforms
CLI Reference -- Complete kafka-backup command reference
Kubernetes Operator -- Operator installation, CRDs, and guides
Metrics Reference -- Full list of Prometheus metrics
Monitoring Setup Guide -- Step-by-step Prometheus and Grafana configuration

Design Principles​

Best Practices​

OE-01: Organisation & Team Readiness​

What​

Why​

Implementation Guidance​

Anti-patterns​

OE-02: Backup Lifecycle Management​

What​

Why​

Implementation Guidance​

Configuration Example​

Anti-patterns​

OE-03: Observability & Monitoring​

What​

Why​

Implementation Guidance​

Configuration Example​

Anti-patterns​

OE-04: Runbooks & Automation​

What​

Why​

Implementation Guidance​

Example Runbook Excerpt: Single Topic PITR Restore​

Anti-patterns​

OE-05: Continuous Improvement​

What​

Why​

Implementation Guidance​

Anti-patterns​

Review Questions​

Resources​

Design Principles

Best Practices

OE-01: Organisation & Team Readiness

What

Why

Implementation Guidance

Anti-patterns

OE-02: Backup Lifecycle Management

What

Why

Implementation Guidance

Configuration Example

Anti-patterns

OE-03: Observability & Monitoring

What

Why

Implementation Guidance

Configuration Example

Anti-patterns

OE-04: Runbooks & Automation

What

Why

Implementation Guidance

Example Runbook Excerpt: Single Topic PITR Restore

Anti-patterns

OE-05: Continuous Improvement

What

Why

Implementation Guidance

Anti-patterns

Review Questions

Resources