Restore Jobs and Retry Behavior
The operator executes every backup and restore as a Kubernetes Job. This page describes how those Jobs behave: how many times they may run, how to control retries with spec.backoffLimit, what the CR status reports, and what happens when you delete a resource.
Restores run exactly once
A KafkaRestore is one-shot. The operator creates a single Job for it and never creates another — whether the Job succeeds or fails. Because a restore appends to (or purges) the target topics, implicitly re-running one could duplicate data, so every run must be intentional:
- After a successful restore the CR reports
RestoreComplete=Trueand is never re-executed. - After a failed restore the CR reports
Ready=False/RestoreFailedand the operator does not retry it. - To run a restore again, delete the
KafkaRestoreand create a new one.
Available from operator 0.2.9. Earlier versions could re-create a completed restore Job on every 5-minute reconcile (#29).
Controlling pod retries: spec.backoffLimit
Within its single Job, pod-level retries are governed by the Job's backoffLimit, which you can set on both CRDs:
apiVersion: kafkabackup.com/v1alpha1
kind: KafkaRestore
metadata:
name: restore-orders
namespace: kafka
spec:
strimziClusterRef:
name: my-cluster
backupRef:
name: daily-backup
backoffLimit: 2 # allow up to 2 pod retries (3 attempts total)
| CRD | Default | Rationale |
|---|---|---|
KafkaRestore | 0 — exactly one attempt | A retried pod re-applies a partially completed restore, which can duplicate records. Opt in deliberately if your restore is idempotent (e.g. dryRun). |
KafkaBackup | 3 | Re-running a backup is safe; transient failures (broker restarts, network blips) are retried automatically. Applies to one-shot Jobs and scheduled CronJob runs. |
apiVersion: kafkabackup.com/v1alpha1
kind: KafkaBackup
metadata:
name: daily-backup
namespace: kafka
spec:
strimziClusterRef:
name: my-cluster
schedule:
cron: "0 2 * * *"
backoffLimit: 1 # tighten scheduled runs to a single retry
storage:
type: s3
s3:
bucket: my-kafka-backups
region: eu-west-1
spec.backoffLimit is available from operator 0.2.10 (#31). Before that, all Jobs used a fixed backoffLimit: 3.
Status conditions
The operator watches its Jobs, so the CR status reflects the outcome within seconds of the Job finishing:
| Condition | Meaning |
|---|---|
Ready=False / RestoreRunning | The restore Job is running (or waiting to start) |
Ready=True / RestoreCompleted and RestoreComplete=True | The restore Job succeeded; status.restore carries start/completion times |
Ready=False / RestoreFailed and Error=True | The restore Job exhausted its backoffLimit; the operator will not retry |
KafkaBackup reports the analogous BackupRunning / BackupCompleted / BackupFailed reasons, plus status.lastBackup and status.backupHistory for completed runs.
kubectl get kafkarestore restore-orders -n kafka \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} ({.reason}){"\n"}{end}'
Cleanup on deletion
Deleting a KafkaBackup or KafkaRestore removes everything the operator created for it — Jobs, scheduled CronJobs, generated ConfigMaps, and the Jobs' pods. Deletes are issued with Background propagation so the garbage collector removes dependents; completed pods are not left behind.
Available from operator 0.2.10. Earlier versions left Completed pods orphaned after CR deletion (#30).
kubectl delete kafkarestore restore-orders -n kafka
kubectl get jobs,pods -n kafka -l kafkabackup.com/restore=restore-orders
# No resources found — Jobs and pods are garbage collected together