Confluent RBAC Backup & Restore

OSO Kafka Backup Enterprise backs up Confluent Platform RBAC role bindings from the Metadata Service (MDS), ensuring your security posture is recoverable after a disaster.

The Problem

Confluent Platform customers running RBAC accumulate hundreds or thousands of role bindings across Kafka, Schema Registry, Connect, ksqlDB, and Flink. After a DR event:

Without RBAC backup: Every binding must be manually recreated. A cluster with 500+ bindings across 6 component types takes 4-8 hours and has a 15-25% human error rate.
With RBAC backup: All bindings are restored in under 5 minutes in the correct dependency order. Components start successfully on first attempt.

Quick Start

Standalone RBAC Backup

rbac-backup-config.yaml
mode: backup
backup_id: rbac-snapshot-2026-04-06
storage:
  backend: s3
  bucket: kafka-backups
  prefix: production

enterprise:
  confluent_rbac:
    mds_url: "https://mds.prod.company.com:8090"
    auth:
      username: ${MDS_USERNAME}
      password: ${MDS_PASSWORD}

kafka-backup rbac-backup --config rbac-backup-config.yaml

Combined Backup (Kafka Data + Schemas + RBAC)

Add confluent_rbac alongside your existing config — everything backs up together:

full-backup-config.yaml
mode: backup
backup_id: daily-2026-04-06
source:
  bootstrap_servers: ["kafka:9092"]
storage:
  backend: s3
  bucket: kafka-backups

enterprise:
  schema_registry:
    url: "https://schema-registry:8081"
    auth:
      type: basic
      username: ${SR_USERNAME}
      password: ${SR_PASSWORD}

  confluent_rbac:
    mds_url: "https://mds:8090"
    auth:
      username: ${MDS_USERNAME}
      password: ${MDS_PASSWORD}

kafka-backup backup --config full-backup-config.yaml

This runs Kafka data backup, Schema Registry backup, and RBAC backup in sequence.

What Gets Backed Up

Item	Description
Cluster registry	All clusters registered in MDS (Kafka, Connect, SR, ksqlDB, Flink)
Principals	All users and groups with role bindings
Cluster-scoped bindings	Roles granted at the cluster level (e.g., SystemAdmin, ClusterAdmin)
Resource-scoped bindings	Roles on specific resources (e.g., DeveloperRead on Topic:orders)
Resource patterns	Both LITERAL and PREFIXED pattern types
All predefined roles	SystemAdmin, UserAdmin, ClusterAdmin, SecurityAdmin, AuditAdmin, Operator, ResourceOwner, DeveloperRead, DeveloperWrite, DeveloperManage

Configuration Reference

MDS Authentication

MDS uses a two-phase authentication model: Basic credentials are exchanged for a JWT bearer token, which is then used for all subsequent API calls. The token is automatically refreshed when it expires.

enterprise:
  confluent_rbac:
    mds_url: "https://mds:8090"
    auth:
      username: ${MDS_USERNAME}
      password: ${MDS_PASSWORD}

Required MDS Permissions

For backup: SecurityAdmin on all scopes (read-only access to bindings)
For restore: UserAdmin on the root kafka-cluster (can create/modify bindings)

Recommended: Create a dedicated service account User:kafka-backup-rbac with these roles.

TLS Configuration

enterprise:
  confluent_rbac:
    mds_url: "https://mds:8090"
    auth:
      username: admin
      password: secret
    tls:
      ca_cert: /certs/ca.pem
      client_cert: /certs/client.pem    # For mTLS
      client_key: /certs/client-key.pem  # For mTLS

Principal Filtering

Control which principals are included in the backup using glob patterns:

enterprise:
  confluent_rbac:
    backup:
      # Include patterns (default: ["*"] = all principals)
      principals:
        - "User:app-*"
        - "Group:*-team"
      # Exclude patterns
      exclude_principals:
        - "User:kafka-backup-rbac"  # Exclude the backup service account itself
      # Filter by specific Kafka clusters (default: all discovered)
      cluster_filter:
        - "lkc-abc123"
      # Include sub-clusters: Connect, SR, ksqlDB, Flink (default: true)
      include_sub_clusters: true

Connection Tuning

MDS has a rate limit of 15 requests per second. The default is 12 RPS to leave headroom:

enterprise:
  confluent_rbac:
    connection:
      timeout_ms: 30000        # Request timeout (default: 30s)
      max_retries: 3           # Retry attempts on 429/5xx
      retry_backoff_ms: 1000   # Initial retry backoff (doubles per retry)
      rate_limit_rps: 12       # Max requests per second (default: 12)

Restore Ordering

RBAC bindings have implicit dependencies. The restore engine applies bindings in 5 tiers to ensure correct ordering:

Tier	Priority	Roles	Purpose
0	Bootstrap	SystemAdmin, UserAdmin (kafka-cluster scope)	Root admin access — must exist first
1	Components	Service accounts for SR, Connect, ksqlDB, Flink, C3	Components need bindings to start
2	Cluster Admin	ClusterAdmin, SecurityAdmin, AuditAdmin, Operator	Cluster-level administrative roles
3	Resource Owner	ResourceOwner	Ownership of specific resources
4	Developer	DeveloperRead, DeveloperWrite, DeveloperManage	Application-level access

This ordering ensures:

Admin accounts exist before they can delegate access
Component service accounts (Connect, Schema Registry, ksqlDB) have their required bindings before the components restart
Resource owners are established before developer access is granted

Storage Format

RBAC snapshots are stored in the backup storage alongside Kafka data:

{backup_id}/
  manifest.json                          # Main backup manifest
  topics/...                             # Kafka data
  schema-registry/...                    # Schema Registry backup
  enterprise/
    confluent-rbac/
      rbac-snapshot.json                 # Full RBAC snapshot (bindings + metadata)
      cluster-registry.json              # MDS cluster registry export
      rbac-metadata.json                 # Backup statistics

Snapshot Format

The rbac-snapshot.json contains the complete security posture:

{
  "metadata": {
    "mds_url": "https://mds.prod:8090",
    "backup_timestamp": "2026-04-06T10:00:00Z",
    "backup_id": "daily-2026-04-06"
  },
  "cluster_registry": [
    {
      "clusterName": "prod-kafka",
      "scope": { "clusters": { "kafka-cluster": "lkc-abc123" } }
    }
  ],
  "bindings": [
    {
      "principal": "User:admin",
      "role_name": "SystemAdmin",
      "scope": { "clusters": { "kafka-cluster": "lkc-abc123" } },
      "binding_type": "ClusterScoped"
    },
    {
      "principal": "User:alice",
      "role_name": "DeveloperRead",
      "scope": { "clusters": { "kafka-cluster": "lkc-abc123" } },
      "binding_type": "ResourceScoped",
      "resource_patterns": [
        { "resourceType": "Topic", "name": "orders", "patternType": "LITERAL" }
      ]
    }
  ],
  "stats": {
    "total_bindings": 347,
    "cluster_scoped_bindings": 23,
    "resource_scoped_bindings": 324,
    "unique_principals": 42,
    "unique_roles": 9,
    "cluster_scopes_count": 5,
    "duration_ms": 21400
  }
}

Use Cases

Disaster Recovery

# Backup from production
enterprise:
  confluent_rbac:
    mds_url: "https://mds.prod:8090"
    auth:
      username: ${MDS_PROD_USER}
      password: ${MDS_PROD_PASS}

# Restore to DR site (future milestone)
# enterprise:
#   confluent_rbac:
#     restore:
#       mds_url: "https://mds.dr:8090"
#       cluster_id_mapping:
#         kafka_cluster:
#           "lkc-prod": "lkc-dr"

Compliance Audit

Periodic RBAC snapshots prove security posture continuity:

# Schedule daily RBAC snapshots
kafka-backup rbac-backup --config rbac-audit.yaml

# Compare snapshots to detect drift (future milestone)
# kafka-backup rbac-diff --backup-id snap-1 --backup-id snap-2

Environment Cloning

Clone security posture from production to staging with principal remapping (future milestone).

Component Service Account Bindings

The backup captures all required bindings for Confluent Platform components:

Component	Required Bindings
Schema Registry	SecurityAdmin on SR scope + ResourceOwner on `_schemas` topic and coordination group
Connect	SecurityAdmin on Connect scope + ResourceOwner on configs, offsets, status topics + consumer group
ksqlDB	SecurityAdmin on ksqlDB scope + ResourceOwner on ksqlDB cluster, command topic, processing log, consumer group prefix
Control Center	SystemAdmin on kafka-cluster
Flink	SecurityAdmin on CMF/Flink environment scope

These bindings are automatically classified as Tier 1 (ComponentService) and restored early to ensure components can start.

Troubleshooting

Issue	Cause	Solution
Authentication failed	Invalid MDS credentials	Verify `MDS_USERNAME`/`MDS_PASSWORD` env vars
403 Forbidden	Backup user lacks SecurityAdmin	Grant SecurityAdmin on kafka-cluster scope to backup service account
429 Too Many Requests	MDS rate limit exceeded	Reduce `rate_limit_rps` (default 12 is usually safe)
No clusters found	MDS cluster registry is empty	Verify MDS is running and clusters are registered
Token expired errors	Long backup with token expiry	Token auto-refreshes; if persistent, check MDS token configuration
Partial backup	Some principals failed to enumerate	Check MDS logs; backup continues with available data

Requirements

Confluent Platform 5.4+ with RBAC enabled
MDS running on Confluent Server brokers
HTTP(S) access to MDS REST API (port 8090 by default)
SecurityAdmin or SystemAdmin role for the backup service account
Enterprise license with rbac feature enabled

The Problem​

Quick Start​

Standalone RBAC Backup​

Combined Backup (Kafka Data + Schemas + RBAC)​

What Gets Backed Up​

Configuration Reference​

MDS Authentication​

TLS Configuration​

Principal Filtering​

Connection Tuning​

Restore Ordering​

Storage Format​

Snapshot Format​

Use Cases​

Disaster Recovery​

Compliance Audit​

Environment Cloning​

Component Service Account Bindings​

Troubleshooting​

Requirements​