Reference Architectures

Proven deployment patterns that combine the principles from every Well-Architected pillar into end-to-end, production-ready configurations you can adopt or adapt.

Each reference architecture below includes a complete topology, configuration, cost estimate, and known limitations so you can evaluate trade-offs before committing to a design. Pick the architecture closest to your constraints, then adjust RPO, RTO, and storage tiers to match your specific requirements.

Architecture Comparison

Architecture	RPO	RTO	Complexity	Est. Monthly Cost	Best For
1. Single-Region S3	< 1 hr	< 4 hr	Low	~$105	Single-region workloads, dev/staging
2. Cross-Region DR	< 15 min	< 1 hr	Medium	~$175	Multi-region availability, production DR
3. Multi-Cloud Active-Passive	< 1 hr	< 2 hr	High	~$265	Cloud-provider failure protection
4. Air-Gapped Compliance	< 24 hr	< 8 hr	High	~$195	Ransomware protection, regulatory compliance
5. Kubernetes GitOps Pipeline	< 1 hr	< 2 hr	Medium	~$105	K8s-native teams, declarative operations

tip

Start with Architecture 1 to validate your backup strategy, then evolve toward cross-region or multi-cloud patterns as your availability requirements grow. Each architecture builds on the configuration patterns established in the simpler designs.

Architecture 1: Single-Region Backup to S3

Overview

The simplest production-ready pattern. A single kafka-backup deployment runs continuously inside the same region as your Kafka cluster, streaming data to an S3 bucket with versioning enabled. Prometheus scrapes the built-in metrics endpoint for alerting and dashboards.

When to Use

Single-region Kafka deployment
RPO < 1 hour is acceptable
RTO < 4 hours is acceptable
You want the lowest operational overhead and cost
Cross-region protection is not yet a requirement

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        AWS Region (us-east-1)                   │
│                                                                 │
│  ┌──────────────────────┐       ┌──────────────────────────┐   │
│  │   Kafka Cluster       │       │   Kubernetes Cluster      │   │
│  │                       │       │                            │   │
│  │  ┌─────┐ ┌─────┐     │       │  ┌──────────────────────┐ │   │
│  │  │ b-1 │ │ b-2 │     │ ───── │  │  kafka-backup        │ │   │
│  │  └─────┘ └─────┘     │       │  │  (Deployment, 1 pod) │ │   │
│  │  ┌─────┐              │       │  └──────────┬───────────┘ │   │
│  │  │ b-3 │              │       │             │              │   │
│  │  └─────┘              │       │  ┌──────────┴───────────┐ │   │
│  └──────────────────────┘       │  │  Prometheus + Grafana │ │   │
│                                  │  └──────────────────────┘ │   │
│                                  └──────────────────────────┘   │
│                                           │                      │
│                                           ▼                      │
│                                  ┌──────────────────┐           │
│                                  │  S3 Bucket        │           │
│                                  │  (versioning on)  │           │
│                                  │  kafka-backup/    │           │
│                                  └──────────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Components

Component	Purpose
Kafka cluster (3 brokers)	Source data
kafka-backup (K8s Deployment)	Continuous backup, 1 replica
S3 bucket (versioning enabled)	Backup storage, same region
Prometheus + Grafana	Metrics scraping, alerting, dashboards

Configuration

backup.yaml

source:
  bootstrap_servers:
    - kafka-0.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-1.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-2.kafka-headless.kafka.svc.cluster.local:9092
  topic:
    include:
      - ".*"    # back up all topics

storage:
  type: s3
  s3:
    bucket: my-org-kafka-backup
    region: us-east-1
    prefix: prod/

backup:
  compression: zstd
  segment_max_bytes: 134217728   # 128 MB
  continuous: true
  checkpoint_interval_secs: 60

metrics:
  enabled: true
  port: 9090

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-backup
  namespace: kafka-backup
  labels:
    app: kafka-backup
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kafka-backup
  template:
    metadata:
      labels:
        app: kafka-backup
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: kafka-backup
      containers:
        - name: kafka-backup
          image: osodevops/kafka-backup:latest
          args: ["backup", "--config", "/etc/kafka-backup/backup.yaml"]
          ports:
            - name: metrics
              containerPort: 9090
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: "2"
              memory: 2Gi
          volumeMounts:
            - name: config
              mountPath: /etc/kafka-backup
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: kafka-backup-config

IAM

Refer to SEC-01: Identity & Access Management for the least-privilege IAM policy. The backup role requires write-only access to S3 and read-only access to Kafka.

Cost Estimate

Item	Monthly Cost
S3 storage (~1 TB/day, 30-day retention, zstd compression)	~$70
Compute (1 pod, 2 vCPU / 2 GB)	~$30
Monitoring (Prometheus + Grafana)	~$5
Total	~$105

Limitations

No cross-region protection — a regional outage affects both source and backup
Single storage backend — no redundancy if S3 experiences an availability event
Restore must occur from the same region to avoid data transfer costs

Architecture 2: Cross-Region Disaster Recovery

Overview

Extends Architecture 1 with S3 Cross-Region Replication (CRR) to maintain a replica of all backup data in a secondary region. A standby kafka-backup instance in the DR region can restore data to a pre-provisioned DR Kafka cluster, achieving a significantly lower RTO than a cold-start approach.

When to Use

Multi-region availability is required
RPO < 15 minutes is required
RTO < 1 hour is required
You need protection against a full regional outage
Regulatory requirements mandate geographically separated copies

Architecture Diagram

┌──────────────────────────────┐         ┌──────────────────────────────┐
│     Primary Region            │         │     DR Region                 │
│        (us-east-1)            │         │       (us-west-2)             │
│                               │         │                               │
│  ┌─────────┐  ┌────────────┐ │         │ ┌────────────┐  ┌─────────┐ │
│  │  Kafka   │  │  kafka-    │ │         │ │  kafka-    │  │  DR     │ │
│  │  Cluster │──│  backup    │ │         │ │  backup    │──│  Kafka  │ │
│  │  (prod)  │  │  (backup)  │ │         │ │  (restore) │  │ Cluster │ │
│  └─────────┘  └─────┬──────┘ │         │ └─────┬──────┘  └─────────┘ │
│                      │        │         │       │                      │
│               ┌──────▼──────┐ │  S3 CRR │ ┌────▼───────┐             │
│               │  S3 Bucket  │─┼────────►│ │  S3 Bucket │             │
│               │  (primary)  │ │         │ │  (replica)  │             │
│               └─────────────┘ │         │ └────────────┘             │
└──────────────────────────────┘         └──────────────────────────────┘

Components

Component	Purpose
Primary Kafka cluster	Production source data
kafka-backup (primary region)	Continuous backup to S3
S3 bucket (primary)	Primary backup storage with versioning
S3 Cross-Region Replication	Asynchronous replication to DR region
S3 bucket (DR region)	Replica backup storage
kafka-backup (DR region)	Standby restore instance
DR Kafka cluster	Pre-provisioned restore target

Configuration

Primary Region — backup.yaml

source:
  bootstrap_servers:
    - kafka-0.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-1.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-2.kafka-headless.kafka.svc.cluster.local:9092
  topic:
    include:
      - ".*"

storage:
  type: s3
  s3:
    bucket: my-org-kafka-backup-primary
    region: us-east-1
    prefix: prod/

backup:
  compression: zstd
  segment_max_bytes: 134217728
  continuous: true
  checkpoint_interval_secs: 60

metrics:
  enabled: true
  port: 9090

S3 Cross-Region Replication

{
  "Role": "arn:aws:iam::123456789012:role/s3-crr-role",
  "Rules": [
    {
      "ID": "kafka-backup-crr",
      "Status": "Enabled",
      "Priority": 1,
      "Filter": {
        "Prefix": "prod/"
      },
      "Destination": {
        "Bucket": "arn:aws:s3:::my-org-kafka-backup-dr",
        "StorageClass": "STANDARD_IA"
      },
      "DeleteMarkerReplication": {
        "Status": "Disabled"
      }
    }
  ]
}

DR Region — restore.yaml

source:
  type: s3
  s3:
    bucket: my-org-kafka-backup-dr
    region: us-west-2
    prefix: prod/

target:
  bootstrap_servers:
    - kafka-0.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-1.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-2.kafka-headless.kafka.svc.cluster.local:9092
  topic:
    include:
      - ".*"

restore:
  from_latest: true

Cost Estimate

Item	Monthly Cost
Primary region (Architecture 1)	~$105
S3 Cross-Region Replication (transfer + storage)	~$50
DR standby compute	~$20
Total	~$175

Limitations

S3 CRR replication lag (typically seconds to minutes) adds to effective RPO
DR Kafka cluster incurs cost even when idle
Manual or scripted failover — not automatic unless combined with health-check automation
Cross-region data transfer costs increase with data volume

Architecture 3: Multi-Cloud Active-Passive DR

Overview

Protects against an entire cloud provider outage by maintaining backup data on a secondary cloud platform. The primary backup runs on AWS with S3 storage, while a cross-cloud sync process keeps an Azure Blob Storage copy up to date. A standby kafka-backup instance on Azure can restore to an Azure-hosted Kafka cluster.

When to Use

Cloud provider failure protection is a business requirement
RPO < 1 hour is acceptable
RTO < 2 hours is acceptable
Regulatory or contractual requirements mandate multi-cloud data residency
Your organisation already operates infrastructure on multiple cloud providers

Architecture Diagram

┌─────────────────────────────┐          ┌──────────────────────────────┐
│     AWS (us-east-1)          │          │     Azure (East US)           │
│                              │          │                               │
│  ┌─────────┐ ┌────────────┐ │          │ ┌────────────┐ ┌───────────┐│
│  │  Kafka   │ │  kafka-    │ │          │ │  kafka-    │ │  Azure    ││
│  │  Cluster │─│  backup    │ │          │ │  backup    │─│  Kafka    ││
│  │  (prod)  │ │  (backup)  │ │          │ │  (restore) │ │  Cluster  ││
│  └─────────┘ └─────┬──────┘ │          │ └─────┬──────┘ └───────────┘│
│                     │        │          │       │                      │
│              ┌──────▼──────┐ │  rclone  │ ┌────▼────────────┐        │
│              │  S3 Bucket  │─┼─────────►│ │  Blob Storage   │        │
│              │             │ │  sync    │ │  Container      │        │
│              └─────────────┘ │          │ └─────────────────┘        │
└─────────────────────────────┘          └──────────────────────────────┘

Components

Component	Purpose
AWS Kafka cluster	Production source data
kafka-backup (AWS)	Continuous backup to S3
S3 bucket	Primary backup storage
Cross-cloud sync (rclone)	Scheduled sync from S3 to Azure Blob
Azure Blob Storage	Secondary backup storage
kafka-backup (Azure)	Standby restore instance
Azure Kafka cluster	DR restore target

Configuration

AWS — backup.yaml

source:
  bootstrap_servers:
    - kafka-0.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-1.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-2.kafka-headless.kafka.svc.cluster.local:9092
  topic:
    include:
      - ".*"

storage:
  type: s3
  s3:
    bucket: my-org-kafka-backup
    region: us-east-1
    prefix: prod/

backup:
  compression: zstd
  segment_max_bytes: 134217728
  continuous: true
  checkpoint_interval_secs: 60

metrics:
  enabled: true
  port: 9090

Azure — restore.yaml

source:
  type: azure
  azure:
    storage_account: myorgkafkabackupdr
    container: kafka-backup
    prefix: prod/

target:
  bootstrap_servers:
    - kafka-0.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-1.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-2.kafka-headless.kafka.svc.cluster.local:9092
  topic:
    include:
      - ".*"

restore:
  from_latest: true

Cross-Cloud Sync Script

#!/usr/bin/env bash
# sync-to-azure.sh — runs on a schedule (e.g., every 15 minutes via cron or K8s CronJob)

set -euo pipefail

RCLONE_CONFIG="/etc/rclone/rclone.conf"
SOURCE="aws-s3:my-org-kafka-backup/prod/"
DEST="azure-blob:kafka-backup/prod/"

echo "[$(date -u)] Starting cross-cloud sync..."
rclone sync "$SOURCE" "$DEST" \
  --config "$RCLONE_CONFIG" \
  --transfers 16 \
  --checkers 32 \
  --fast-list \
  --log-level INFO

echo "[$(date -u)] Sync complete."

Cost Estimate

Item	Monthly Cost
Primary AWS (Architecture 1)	~$105
Azure Blob Storage	~$80
Cross-cloud sync (rclone compute + egress)	~$50
DR standby compute (Azure)	~$30
Total	~$265

Limitations

Cross-cloud sync introduces complexity and a potential failure point
Network egress costs (AWS to Azure) scale linearly with data volume
Separate credential management for each cloud provider
Sync lag adds to effective RPO — monitor rclone metrics closely
Requires expertise in both AWS and Azure infrastructure

Architecture 4: Air-Gapped Compliance Backup

Overview

Provides ransomware-proof, tamper-proof backup storage for regulated industries. Backup data is written to a primary S3 bucket, then transferred to a completely isolated AWS account with S3 Object Lock (WORM — Write Once, Read Many). The air-gapped account has no VPC peering or network connectivity to the production environment, ensuring that a compromised production account cannot modify or delete backup data.

When to Use

Ransomware protection is a top priority
RPO < 24 hours is acceptable
RTO < 8 hours is acceptable
Regulatory requirements mandate immutable, tamper-proof backups (financial services, healthcare, government)
Compliance frameworks require geographically or logically separated backup copies
You need to demonstrate chain-of-custody for audit purposes

Architecture Diagram

┌─────────────────────────────────┐        ┌──────────────────────────────────┐
│   Production Account             │        │   Air-Gapped Account              │
│                                  │        │   (no VPC peering, no network)    │
│  ┌─────────┐  ┌───────────────┐ │        │                                   │
│  │  Kafka   │  │  kafka-backup │ │        │  ┌──────────────────────────┐    │
│  │  Cluster │──│  (continuous) │ │        │  │  S3 Bucket               │    │
│  └─────────┘  └──────┬────────┘ │        │  │  (Object Lock / WORM)    │    │
│                       │         │        │  │  (Glacier for archive)    │    │
│               ┌───────▼───────┐ │ S3     │  └──────────────────────────┘    │
│               │  S3 Bucket    │─┼─Batch──│                                   │
│               │  (primary)    │ │ or     │  ┌──────────────────────────┐    │
│               └───────────────┘ │ DataSync │  │  IAM: deny all deletes   │    │
│                                  │        │  │  MFA-protected root only  │    │
│  ┌──────────────────┐           │        │  └──────────────────────────┘    │
│  │  Prometheus +     │           │        │                                   │
│  │  Grafana          │           │        │  ┌──────────────────────────┐    │
│  └──────────────────┘           │        │  │  CloudTrail audit logging │    │
│                                  │        │  └──────────────────────────┘    │
└─────────────────────────────────┘        └──────────────────────────────────┘

Components

Component	Purpose
Kafka cluster	Production source data
kafka-backup (production account)	Continuous backup to primary S3
S3 bucket (primary)	Initial backup storage
AWS S3 Batch / DataSync	Scheduled transfer to air-gapped account
S3 bucket (air-gapped, Object Lock)	Immutable WORM storage
Glacier transition	Long-term archive for cost optimisation
CloudTrail (air-gapped account)	Audit logging for compliance

Configuration

Production Account — backup.yaml

source:
  bootstrap_servers:
    - kafka-0.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-1.kafka-headless.kafka.svc.cluster.local:9092
    - kafka-2.kafka-headless.kafka.svc.cluster.local:9092
  topic:
    include:
      - ".*"

storage:
  type: s3
  s3:
    bucket: my-org-kafka-backup-prod
    region: us-east-1
    prefix: prod/

backup:
  compression: zstd
  segment_max_bytes: 134217728
  continuous: true
  checkpoint_interval_secs: 60

metrics:
  enabled: true
  port: 9090

S3 Object Lock Configuration (Air-Gapped Account)

{
  "ObjectLockEnabled": "Enabled",
  "Rule": {
    "DefaultRetention": {
      "Mode": "COMPLIANCE",
      "Days": 365
    }
  }
}

info

COMPLIANCE mode prevents anyone — including the root user — from deleting or overwriting objects before the retention period expires. Use GOVERNANCE mode if you need the ability to override with special permissions during testing.

Air-Gapped Account IAM Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyAllDeleteOperations",
      "Effect": "Deny",
      "Principal": "*",
      "Action": [
        "s3:DeleteObject",
        "s3:DeleteObjectVersion",
        "s3:PutBucketPolicy",
        "s3:DeleteBucketPolicy"
      ],
      "Resource": [
        "arn:aws:s3:::my-org-kafka-backup-airgap",
        "arn:aws:s3:::my-org-kafka-backup-airgap/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::111111111111:root"
        }
      }
    },
    {
      "Sid": "AllowWriteFromProductionAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::222222222222:role/kafka-backup-transfer-role"
      },
      "Action": [
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-org-kafka-backup-airgap",
        "arn:aws:s3:::my-org-kafka-backup-airgap/*"
      ]
    }
  ]
}

Cost Estimate

Item	Monthly Cost
Primary backup (Architecture 1)	~$105
Air-gapped S3 storage (Glacier + Object Lock)	~$90
Total	~$195

tip

Use S3 Intelligent-Tiering or lifecycle policies to transition older backups to Glacier Deep Archive after 90 days. This can reduce air-gapped storage costs by up to 70% for long-retention requirements.

Limitations

Higher RTO due to the air gap — restoring requires transferring data back from the isolated account
Transfer scheduling adds complexity (S3 Batch operations, DataSync jobs)
MFA-protected root account access is required for emergency operations in the air-gapped account
Object Lock retention cannot be shortened once set in COMPLIANCE mode
Testing restores from the air-gapped account requires careful planning to avoid violating the air gap

Architecture 5: Kubernetes GitOps Backup Pipeline

Overview

A fully declarative, Kubernetes-native approach where backup and restore operations are managed through Custom Resource Definitions (CRDs) and reconciled by a GitOps controller such as ArgoCD or Flux. All configuration lives in a Git repository, providing version history, peer review, and automated rollout for every change.

When to Use

Your team already operates a Kubernetes platform with GitOps tooling
RPO < 1 hour is acceptable
RTO < 2 hours is acceptable
You want all backup configuration versioned, reviewed, and auditable in Git
You need to manage backup across multiple environments (dev, staging, prod) consistently

Architecture Diagram

┌──────────────┐     ┌──────────────────────────────────────────────────────┐
│  Git Repo     │     │  Kubernetes Cluster                                  │
│               │     │                                                      │
│  envs/        │     │  ┌──────────┐    ┌───────────────────────────────┐  │
│  └─ prod/     │────►│  │  ArgoCD  │───►│  kafka-backup Operator        │  │
│     ├─ app.yaml     │  └──────────┘    │                               │  │
│     ├─ backup.yaml  │                  │  ┌─────────────────────────┐  │  │
│     ├─ monitor.yaml │                  │  │  KafkaBackup CR         │  │  │
│     └─ restore.yaml │                  │  │  (reconciles backup     │  │  │
│               │     │                  │  │   pods automatically)   │  │  │
└──────────────┘     │                  │  └───────────┬─────────────┘  │  │
                      │                  └──────────────┼────────────────┘  │
                      │                                 │                    │
                      │                          ┌──────▼──────┐            │
                      │                          │  kafka-backup│            │
                      │                          │  pods        │            │
                      │                          └──────┬──────┘            │
                      │                                 │                    │
                      │  ┌──────────────────┐    ┌──────▼──────┐            │
                      │  │  Prometheus +     │    │  S3 Bucket  │            │
                      │  │  Grafana          │    │             │            │
                      │  │  (ServiceMonitor) │    └─────────────┘            │
                      │  └──────────────────┘                               │
                      └──────────────────────────────────────────────────────┘

Components

Component	Purpose
Git repository	Single source of truth for all backup configuration
ArgoCD / Flux	GitOps controller, reconciles desired state
kafka-backup Operator	Watches KafkaBackup/KafkaRestore CRDs, manages pods
KafkaBackup CRD	Declarative backup specification
KafkaRestore CRD	Declarative restore specification
Prometheus ServiceMonitor	Auto-discovered metrics scraping
S3 bucket	Backup storage

Configuration

Git Repository Structure

environments/
└── prod/
    ├── kustomization.yaml
    ├── argocd-application.yaml
    ├── kafka-backup-crd.yaml
    ├── kafka-restore-crd.yaml
    └── service-monitor.yaml

ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kafka-backup-prod
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/kafka-backup-config.git
    targetRevision: main
    path: environments/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: kafka-backup
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

KafkaBackup Custom Resource

apiVersion: kafka-backup.osodevops.io/v1alpha1
kind: KafkaBackup
metadata:
  name: prod-backup
  namespace: kafka-backup
spec:
  source:
    bootstrapServers:
      - kafka-0.kafka-headless.kafka.svc.cluster.local:9092
      - kafka-1.kafka-headless.kafka.svc.cluster.local:9092
      - kafka-2.kafka-headless.kafka.svc.cluster.local:9092
    topicSelector:
      include:
        - ".*"
  storage:
    type: s3
    s3:
      bucket: my-org-kafka-backup
      region: us-east-1
      prefix: prod/
  backup:
    compression: zstd
    segmentMaxBytes: 134217728
    continuous: true
    checkpointIntervalSecs: 60
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "2"
      memory: 2Gi

KafkaRestore Custom Resource

apiVersion: kafka-backup.osodevops.io/v1alpha1
kind: KafkaRestore
metadata:
  name: prod-restore
  namespace: kafka-backup
spec:
  source:
    type: s3
    s3:
      bucket: my-org-kafka-backup
      region: us-east-1
      prefix: prod/
  target:
    bootstrapServers:
      - kafka-0.kafka-headless.kafka.svc.cluster.local:9092
      - kafka-1.kafka-headless.kafka.svc.cluster.local:9092
      - kafka-2.kafka-headless.kafka.svc.cluster.local:9092
    topicSelector:
      include:
        - ".*"
  restore:
    fromLatest: true
  # Set to 'paused: true' until a restore is needed
  paused: true

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-backup
  namespace: kafka-backup
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: kafka-backup
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

info

With GitOps, every configuration change goes through a pull request. This gives you a full audit trail, peer review, and the ability to roll back any change by reverting a commit.

Cost Estimate

Item	Monthly Cost
Base backup infrastructure (Architecture 1)	~$105
GitOps tooling (ArgoCD/Flux — typically already deployed)	~$0
Total	~$105

Limitations

Requires Kubernetes and GitOps expertise on the team
Operator learning curve — custom resources add an abstraction layer
CRD schema changes require careful upgrade planning
ArgoCD/Flux must be operational for configuration changes to propagate (backup continues running if GitOps is temporarily down)

Choosing an Architecture

tip

Use the comparison table at the top of this page as a starting point. Then consider these questions:

What is your RPO/RTO budget? If < 15 min RPO is required, start with Architecture 2 (Cross-Region DR).
Do you need multi-cloud protection? Architecture 3 is the only option that survives a full cloud provider outage.
Are you in a regulated industry? Architecture 4 (Air-Gapped) provides the immutability guarantees auditors look for.
Is your team already running GitOps? Architecture 5 adds minimal overhead and maximum auditability.
Just getting started? Architecture 1 is the fastest path to a working, production-grade backup.

All architectures can be combined. For example, you can run Architecture 5 (GitOps) as your deployment model while using Architecture 2 (Cross-Region) as your storage topology and Architecture 4 (Air-Gapped) as an additional compliance layer.

Architecture Comparison​

Architecture 1: Single-Region Backup to S3​

Overview​

When to Use​

Architecture Diagram​

Components​

Configuration​

backup.yaml​

Kubernetes Deployment​

IAM​

Cost Estimate​

Limitations​

Architecture 2: Cross-Region Disaster Recovery​

Overview​

When to Use​

Architecture Diagram​

Components​

Configuration​

Primary Region — backup.yaml​

S3 Cross-Region Replication​

DR Region — restore.yaml​

Cost Estimate​

Limitations​

Architecture 3: Multi-Cloud Active-Passive DR​

Overview​

When to Use​

Architecture Diagram​

Components​

Configuration​

AWS — backup.yaml​

Azure — restore.yaml​

Cross-Cloud Sync Script​

Cost Estimate​

Limitations​

Architecture 4: Air-Gapped Compliance Backup​

Overview​

When to Use​

Architecture Diagram​

Components​

Configuration​

Production Account — backup.yaml​

S3 Object Lock Configuration (Air-Gapped Account)​

Air-Gapped Account IAM Policy​

Cost Estimate​

Limitations​

Architecture 5: Kubernetes GitOps Backup Pipeline​

Overview​

When to Use​

Architecture Diagram​

Components​

Configuration​

Git Repository Structure​

ArgoCD Application​

KafkaBackup Custom Resource​

KafkaRestore Custom Resource​

Prometheus ServiceMonitor​

Cost Estimate​

Limitations​

Choosing an Architecture​

Architecture Comparison

Architecture 1: Single-Region Backup to S3

Overview

When to Use

Architecture Diagram

Components

Configuration

backup.yaml

Kubernetes Deployment

IAM

Cost Estimate

Limitations

Architecture 2: Cross-Region Disaster Recovery

Overview

When to Use

Architecture Diagram

Components

Configuration

Primary Region — backup.yaml

S3 Cross-Region Replication

DR Region — restore.yaml

Cost Estimate

Limitations

Architecture 3: Multi-Cloud Active-Passive DR

Overview

When to Use

Architecture Diagram

Components

Configuration

AWS — backup.yaml

Azure — restore.yaml

Cross-Cloud Sync Script

Cost Estimate

Limitations

Architecture 4: Air-Gapped Compliance Backup

Overview

When to Use

Architecture Diagram

Components

Configuration

Production Account — backup.yaml

S3 Object Lock Configuration (Air-Gapped Account)

Air-Gapped Account IAM Policy

Cost Estimate

Limitations

Architecture 5: Kubernetes GitOps Backup Pipeline

Overview

When to Use

Architecture Diagram

Components

Configuration

Git Repository Structure

ArgoCD Application

KafkaBackup Custom Resource

KafkaRestore Custom Resource

Prometheus ServiceMonitor

Cost Estimate

Limitations

Choosing an Architecture