Skip to main content

Confluent RBAC Backup & Restore

OSO Kafka Backup Enterprise backs up Confluent Platform RBAC role bindings from the Metadata Service (MDS), ensuring your security posture is recoverable after a disaster.

The Problem

Confluent Platform customers running RBAC accumulate hundreds or thousands of role bindings across Kafka, Schema Registry, Connect, ksqlDB, and Flink. After a DR event:

  • Without RBAC backup: Every binding must be manually recreated. A cluster with 500+ bindings across 6 component types takes 4-8 hours and has a 15-25% human error rate.
  • With RBAC backup: All bindings are restored in under 5 minutes in the correct dependency order. Components start successfully on first attempt.

Quick Start

Standalone RBAC Backup

rbac-backup-config.yaml
mode: backup
backup_id: rbac-snapshot-2026-04-06
storage:
backend: s3
bucket: kafka-backups
prefix: production

enterprise:
confluent_rbac:
mds_url: "https://mds.prod.company.com:8090"
auth:
username: ${MDS_USERNAME}
password: ${MDS_PASSWORD}
kafka-backup rbac-backup --config rbac-backup-config.yaml

Combined Backup (Kafka Data + Schemas + RBAC)

Add confluent_rbac alongside your existing config — everything backs up together:

full-backup-config.yaml
mode: backup
backup_id: daily-2026-04-06
source:
bootstrap_servers: ["kafka:9092"]
storage:
backend: s3
bucket: kafka-backups

enterprise:
schema_registry:
url: "https://schema-registry:8081"
auth:
type: basic
username: ${SR_USERNAME}
password: ${SR_PASSWORD}

confluent_rbac:
mds_url: "https://mds:8090"
auth:
username: ${MDS_USERNAME}
password: ${MDS_PASSWORD}
kafka-backup backup --config full-backup-config.yaml

This runs Kafka data backup, Schema Registry backup, and RBAC backup in sequence.

What Gets Backed Up

ItemDescription
Cluster registryAll clusters registered in MDS (Kafka, Connect, SR, ksqlDB, Flink)
PrincipalsAll users and groups with role bindings
Cluster-scoped bindingsRoles granted at the cluster level (e.g., SystemAdmin, ClusterAdmin)
Resource-scoped bindingsRoles on specific resources (e.g., DeveloperRead on Topic:orders)
Resource patternsBoth LITERAL and PREFIXED pattern types
All predefined rolesSystemAdmin, UserAdmin, ClusterAdmin, SecurityAdmin, AuditAdmin, Operator, ResourceOwner, DeveloperRead, DeveloperWrite, DeveloperManage

Configuration Reference

MDS Authentication

MDS uses a two-phase authentication model: Basic credentials are exchanged for a JWT bearer token, which is then used for all subsequent API calls. The token is automatically refreshed when it expires.

enterprise:
confluent_rbac:
mds_url: "https://mds:8090"
auth:
username: ${MDS_USERNAME}
password: ${MDS_PASSWORD}
Required MDS Permissions
  • For backup: SecurityAdmin on all scopes (read-only access to bindings)
  • For restore: UserAdmin on the root kafka-cluster (can create/modify bindings)

Recommended: Create a dedicated service account User:kafka-backup-rbac with these roles.

TLS Configuration

enterprise:
confluent_rbac:
mds_url: "https://mds:8090"
auth:
username: admin
password: secret
tls:
ca_cert: /certs/ca.pem
client_cert: /certs/client.pem # For mTLS
client_key: /certs/client-key.pem # For mTLS

Principal Filtering

Control which principals are included in the backup using glob patterns:

enterprise:
confluent_rbac:
backup:
# Include patterns (default: ["*"] = all principals)
principals:
- "User:app-*"
- "Group:*-team"
# Exclude patterns
exclude_principals:
- "User:kafka-backup-rbac" # Exclude the backup service account itself
# Filter by specific Kafka clusters (default: all discovered)
cluster_filter:
- "lkc-abc123"
# Include sub-clusters: Connect, SR, ksqlDB, Flink (default: true)
include_sub_clusters: true

Connection Tuning

MDS has a rate limit of 15 requests per second. The default is 12 RPS to leave headroom:

enterprise:
confluent_rbac:
connection:
timeout_ms: 30000 # Request timeout (default: 30s)
max_retries: 3 # Retry attempts on 429/5xx
retry_backoff_ms: 1000 # Initial retry backoff (doubles per retry)
rate_limit_rps: 12 # Max requests per second (default: 12)

Restore Ordering

RBAC bindings have implicit dependencies. The restore engine applies bindings in 5 tiers to ensure correct ordering:

TierPriorityRolesPurpose
0BootstrapSystemAdmin, UserAdmin (kafka-cluster scope)Root admin access — must exist first
1ComponentsService accounts for SR, Connect, ksqlDB, Flink, C3Components need bindings to start
2Cluster AdminClusterAdmin, SecurityAdmin, AuditAdmin, OperatorCluster-level administrative roles
3Resource OwnerResourceOwnerOwnership of specific resources
4DeveloperDeveloperRead, DeveloperWrite, DeveloperManageApplication-level access

This ordering ensures:

  • Admin accounts exist before they can delegate access
  • Component service accounts (Connect, Schema Registry, ksqlDB) have their required bindings before the components restart
  • Resource owners are established before developer access is granted

Storage Format

RBAC snapshots are stored in the backup storage alongside Kafka data:

{backup_id}/
manifest.json # Main backup manifest
topics/... # Kafka data
schema-registry/... # Schema Registry backup
enterprise/
confluent-rbac/
rbac-snapshot.json # Full RBAC snapshot (bindings + metadata)
cluster-registry.json # MDS cluster registry export
rbac-metadata.json # Backup statistics

Snapshot Format

The rbac-snapshot.json contains the complete security posture:

{
"metadata": {
"mds_url": "https://mds.prod:8090",
"backup_timestamp": "2026-04-06T10:00:00Z",
"backup_id": "daily-2026-04-06"
},
"cluster_registry": [
{
"clusterName": "prod-kafka",
"scope": { "clusters": { "kafka-cluster": "lkc-abc123" } }
}
],
"bindings": [
{
"principal": "User:admin",
"role_name": "SystemAdmin",
"scope": { "clusters": { "kafka-cluster": "lkc-abc123" } },
"binding_type": "ClusterScoped"
},
{
"principal": "User:alice",
"role_name": "DeveloperRead",
"scope": { "clusters": { "kafka-cluster": "lkc-abc123" } },
"binding_type": "ResourceScoped",
"resource_patterns": [
{ "resourceType": "Topic", "name": "orders", "patternType": "LITERAL" }
]
}
],
"stats": {
"total_bindings": 347,
"cluster_scoped_bindings": 23,
"resource_scoped_bindings": 324,
"unique_principals": 42,
"unique_roles": 9,
"cluster_scopes_count": 5,
"duration_ms": 21400
}
}

Use Cases

Disaster Recovery

# Backup from production
enterprise:
confluent_rbac:
mds_url: "https://mds.prod:8090"
auth:
username: ${MDS_PROD_USER}
password: ${MDS_PROD_PASS}

# Restore to DR site (future milestone)
# enterprise:
# confluent_rbac:
# restore:
# mds_url: "https://mds.dr:8090"
# cluster_id_mapping:
# kafka_cluster:
# "lkc-prod": "lkc-dr"

Compliance Audit

Periodic RBAC snapshots prove security posture continuity:

# Schedule daily RBAC snapshots
kafka-backup rbac-backup --config rbac-audit.yaml

# Compare snapshots to detect drift (future milestone)
# kafka-backup rbac-diff --backup-id snap-1 --backup-id snap-2

Environment Cloning

Clone security posture from production to staging with principal remapping (future milestone).

Component Service Account Bindings

The backup captures all required bindings for Confluent Platform components:

ComponentRequired Bindings
Schema RegistrySecurityAdmin on SR scope + ResourceOwner on _schemas topic and coordination group
ConnectSecurityAdmin on Connect scope + ResourceOwner on configs, offsets, status topics + consumer group
ksqlDBSecurityAdmin on ksqlDB scope + ResourceOwner on ksqlDB cluster, command topic, processing log, consumer group prefix
Control CenterSystemAdmin on kafka-cluster
FlinkSecurityAdmin on CMF/Flink environment scope

These bindings are automatically classified as Tier 1 (ComponentService) and restored early to ensure components can start.

Troubleshooting

IssueCauseSolution
Authentication failedInvalid MDS credentialsVerify MDS_USERNAME/MDS_PASSWORD env vars
403 ForbiddenBackup user lacks SecurityAdminGrant SecurityAdmin on kafka-cluster scope to backup service account
429 Too Many RequestsMDS rate limit exceededReduce rate_limit_rps (default 12 is usually safe)
No clusters foundMDS cluster registry is emptyVerify MDS is running and clusters are registered
Token expired errorsLong backup with token expiryToken auto-refreshes; if persistent, check MDS token configuration
Partial backupSome principals failed to enumerateCheck MDS logs; backup continues with available data

Requirements

  • Confluent Platform 5.4+ with RBAC enabled
  • MDS running on Confluent Server brokers
  • HTTP(S) access to MDS REST API (port 8090 by default)
  • SecurityAdmin or SystemAdmin role for the backup service account
  • Enterprise license with rbac feature enabled