Skip to main content

Restore Jobs and Retry Behavior

The operator executes every backup and restore as a Kubernetes Job. This page describes how those Jobs behave: how many times they may run, how to control retries with spec.backoffLimit, what the CR status reports, and what happens when you delete a resource.

Restores run exactly once

A KafkaRestore is one-shot. The operator creates a single Job for it and never creates another — whether the Job succeeds or fails. Because a restore appends to (or purges) the target topics, implicitly re-running one could duplicate data, so every run must be intentional:

  • After a successful restore the CR reports RestoreComplete=True and is never re-executed.
  • After a failed restore the CR reports Ready=False / RestoreFailed and the operator does not retry it.
  • To run a restore again, delete the KafkaRestore and create a new one.

Available from operator 0.2.9. Earlier versions could re-create a completed restore Job on every 5-minute reconcile (#29).

Controlling pod retries: spec.backoffLimit

Within its single Job, pod-level retries are governed by the Job's backoffLimit, which you can set on both CRDs:

apiVersion: kafkabackup.com/v1alpha1
kind: KafkaRestore
metadata:
name: restore-orders
namespace: kafka
spec:
strimziClusterRef:
name: my-cluster
backupRef:
name: daily-backup
backoffLimit: 2 # allow up to 2 pod retries (3 attempts total)
CRDDefaultRationale
KafkaRestore0 — exactly one attemptA retried pod re-applies a partially completed restore, which can duplicate records. Opt in deliberately if your restore is idempotent (e.g. dryRun).
KafkaBackup3Re-running a backup is safe; transient failures (broker restarts, network blips) are retried automatically. Applies to one-shot Jobs and scheduled CronJob runs.
apiVersion: kafkabackup.com/v1alpha1
kind: KafkaBackup
metadata:
name: daily-backup
namespace: kafka
spec:
strimziClusterRef:
name: my-cluster
schedule:
cron: "0 2 * * *"
backoffLimit: 1 # tighten scheduled runs to a single retry
storage:
type: s3
s3:
bucket: my-kafka-backups
region: eu-west-1

spec.backoffLimit is available from operator 0.2.10 (#31). Before that, all Jobs used a fixed backoffLimit: 3.

Status conditions

The operator watches its Jobs, so the CR status reflects the outcome within seconds of the Job finishing:

ConditionMeaning
Ready=False / RestoreRunningThe restore Job is running (or waiting to start)
Ready=True / RestoreCompleted and RestoreComplete=TrueThe restore Job succeeded; status.restore carries start/completion times
Ready=False / RestoreFailed and Error=TrueThe restore Job exhausted its backoffLimit; the operator will not retry

KafkaBackup reports the analogous BackupRunning / BackupCompleted / BackupFailed reasons, plus status.lastBackup and status.backupHistory for completed runs.

kubectl get kafkarestore restore-orders -n kafka \
-o jsonpath='{range .status.conditions[*]}{.type}={.status} ({.reason}){"\n"}{end}'

Cleanup on deletion

Deleting a KafkaBackup or KafkaRestore removes everything the operator created for it — Jobs, scheduled CronJobs, generated ConfigMaps, and the Jobs' pods. Deletes are issued with Background propagation so the garbage collector removes dependents; completed pods are not left behind.

Available from operator 0.2.10. Earlier versions left Completed pods orphaned after CR deletion (#30).

kubectl delete kafkarestore restore-orders -n kafka
kubectl get jobs,pods -n kafka -l kafkabackup.com/restore=restore-orders
# No resources found — Jobs and pods are garbage collected together