Confluent RBAC Backup & Restore
OSO Kafka Backup Enterprise backs up Confluent Platform RBAC role bindings from the Metadata Service (MDS), ensuring your security posture is recoverable after a disaster.
The Problem
Confluent Platform customers running RBAC accumulate hundreds or thousands of role bindings across Kafka, Schema Registry, Connect, ksqlDB, and Flink. After a DR event:
- Without RBAC backup: Every binding must be manually recreated. A cluster with 500+ bindings across 6 component types takes 4-8 hours and has a 15-25% human error rate.
- With RBAC backup: All bindings are restored in under 5 minutes in the correct dependency order. Components start successfully on first attempt.
Quick Start
Standalone RBAC Backup
mode: backup
backup_id: rbac-snapshot-2026-04-06
storage:
backend: s3
bucket: kafka-backups
prefix: production
enterprise:
confluent_rbac:
mds_url: "https://mds.prod.company.com:8090"
auth:
username: ${MDS_USERNAME}
password: ${MDS_PASSWORD}
kafka-backup rbac-backup --config rbac-backup-config.yaml
Combined Backup (Kafka Data + Schemas + RBAC)
Add confluent_rbac alongside your existing config — everything backs up together:
mode: backup
backup_id: daily-2026-04-06
source:
bootstrap_servers: ["kafka:9092"]
storage:
backend: s3
bucket: kafka-backups
enterprise:
schema_registry:
url: "https://schema-registry:8081"
auth:
type: basic
username: ${SR_USERNAME}
password: ${SR_PASSWORD}
confluent_rbac:
mds_url: "https://mds:8090"
auth:
username: ${MDS_USERNAME}
password: ${MDS_PASSWORD}
kafka-backup backup --config full-backup-config.yaml
This runs Kafka data backup, Schema Registry backup, and RBAC backup in sequence.
What Gets Backed Up
| Item | Description |
|---|---|
| Cluster registry | All clusters registered in MDS (Kafka, Connect, SR, ksqlDB, Flink) |
| Principals | All users and groups with role bindings |
| Cluster-scoped bindings | Roles granted at the cluster level (e.g., SystemAdmin, ClusterAdmin) |
| Resource-scoped bindings | Roles on specific resources (e.g., DeveloperRead on Topic:orders) |
| Resource patterns | Both LITERAL and PREFIXED pattern types |
| All predefined roles | SystemAdmin, UserAdmin, ClusterAdmin, SecurityAdmin, AuditAdmin, Operator, ResourceOwner, DeveloperRead, DeveloperWrite, DeveloperManage |
Configuration Reference
MDS Authentication
MDS uses a two-phase authentication model: Basic credentials are exchanged for a JWT bearer token, which is then used for all subsequent API calls. The token is automatically refreshed when it expires.
enterprise:
confluent_rbac:
mds_url: "https://mds:8090"
auth:
username: ${MDS_USERNAME}
password: ${MDS_PASSWORD}
- For backup:
SecurityAdminon all scopes (read-only access to bindings) - For restore:
UserAdminon the root kafka-cluster (can create/modify bindings)
Recommended: Create a dedicated service account User:kafka-backup-rbac with these roles.
TLS Configuration
enterprise:
confluent_rbac:
mds_url: "https://mds:8090"
auth:
username: admin
password: secret
tls:
ca_cert: /certs/ca.pem
client_cert: /certs/client.pem # For mTLS
client_key: /certs/client-key.pem # For mTLS
Principal Filtering
Control which principals are included in the backup using glob patterns:
enterprise:
confluent_rbac:
backup:
# Include patterns (default: ["*"] = all principals)
principals:
- "User:app-*"
- "Group:*-team"
# Exclude patterns
exclude_principals:
- "User:kafka-backup-rbac" # Exclude the backup service account itself
# Filter by specific Kafka clusters (default: all discovered)
cluster_filter:
- "lkc-abc123"
# Include sub-clusters: Connect, SR, ksqlDB, Flink (default: true)
include_sub_clusters: true
Connection Tuning
MDS has a rate limit of 15 requests per second. The default is 12 RPS to leave headroom:
enterprise:
confluent_rbac:
connection:
timeout_ms: 30000 # Request timeout (default: 30s)
max_retries: 3 # Retry attempts on 429/5xx
retry_backoff_ms: 1000 # Initial retry backoff (doubles per retry)
rate_limit_rps: 12 # Max requests per second (default: 12)
Restore Ordering
RBAC bindings have implicit dependencies. The restore engine applies bindings in 5 tiers to ensure correct ordering:
| Tier | Priority | Roles | Purpose |
|---|---|---|---|
| 0 | Bootstrap | SystemAdmin, UserAdmin (kafka-cluster scope) | Root admin access — must exist first |
| 1 | Components | Service accounts for SR, Connect, ksqlDB, Flink, C3 | Components need bindings to start |
| 2 | Cluster Admin | ClusterAdmin, SecurityAdmin, AuditAdmin, Operator | Cluster-level administrative roles |
| 3 | Resource Owner | ResourceOwner | Ownership of specific resources |
| 4 | Developer | DeveloperRead, DeveloperWrite, DeveloperManage | Application-level access |
This ordering ensures:
- Admin accounts exist before they can delegate access
- Component service accounts (Connect, Schema Registry, ksqlDB) have their required bindings before the components restart
- Resource owners are established before developer access is granted
Storage Format
RBAC snapshots are stored in the backup storage alongside Kafka data:
{backup_id}/
manifest.json # Main backup manifest
topics/... # Kafka data
schema-registry/... # Schema Registry backup
enterprise/
confluent-rbac/
rbac-snapshot.json # Full RBAC snapshot (bindings + metadata)
cluster-registry.json # MDS cluster registry export
rbac-metadata.json # Backup statistics
Snapshot Format
The rbac-snapshot.json contains the complete security posture:
{
"metadata": {
"mds_url": "https://mds.prod:8090",
"backup_timestamp": "2026-04-06T10:00:00Z",
"backup_id": "daily-2026-04-06"
},
"cluster_registry": [
{
"clusterName": "prod-kafka",
"scope": { "clusters": { "kafka-cluster": "lkc-abc123" } }
}
],
"bindings": [
{
"principal": "User:admin",
"role_name": "SystemAdmin",
"scope": { "clusters": { "kafka-cluster": "lkc-abc123" } },
"binding_type": "ClusterScoped"
},
{
"principal": "User:alice",
"role_name": "DeveloperRead",
"scope": { "clusters": { "kafka-cluster": "lkc-abc123" } },
"binding_type": "ResourceScoped",
"resource_patterns": [
{ "resourceType": "Topic", "name": "orders", "patternType": "LITERAL" }
]
}
],
"stats": {
"total_bindings": 347,
"cluster_scoped_bindings": 23,
"resource_scoped_bindings": 324,
"unique_principals": 42,
"unique_roles": 9,
"cluster_scopes_count": 5,
"duration_ms": 21400
}
}
Use Cases
Disaster Recovery
# Backup from production
enterprise:
confluent_rbac:
mds_url: "https://mds.prod:8090"
auth:
username: ${MDS_PROD_USER}
password: ${MDS_PROD_PASS}
# Restore to DR site (future milestone)
# enterprise:
# confluent_rbac:
# restore:
# mds_url: "https://mds.dr:8090"
# cluster_id_mapping:
# kafka_cluster:
# "lkc-prod": "lkc-dr"
Compliance Audit
Periodic RBAC snapshots prove security posture continuity:
# Schedule daily RBAC snapshots
kafka-backup rbac-backup --config rbac-audit.yaml
# Compare snapshots to detect drift (future milestone)
# kafka-backup rbac-diff --backup-id snap-1 --backup-id snap-2
Environment Cloning
Clone security posture from production to staging with principal remapping (future milestone).
Component Service Account Bindings
The backup captures all required bindings for Confluent Platform components:
| Component | Required Bindings |
|---|---|
| Schema Registry | SecurityAdmin on SR scope + ResourceOwner on _schemas topic and coordination group |
| Connect | SecurityAdmin on Connect scope + ResourceOwner on configs, offsets, status topics + consumer group |
| ksqlDB | SecurityAdmin on ksqlDB scope + ResourceOwner on ksqlDB cluster, command topic, processing log, consumer group prefix |
| Control Center | SystemAdmin on kafka-cluster |
| Flink | SecurityAdmin on CMF/Flink environment scope |
These bindings are automatically classified as Tier 1 (ComponentService) and restored early to ensure components can start.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Authentication failed | Invalid MDS credentials | Verify MDS_USERNAME/MDS_PASSWORD env vars |
| 403 Forbidden | Backup user lacks SecurityAdmin | Grant SecurityAdmin on kafka-cluster scope to backup service account |
| 429 Too Many Requests | MDS rate limit exceeded | Reduce rate_limit_rps (default 12 is usually safe) |
| No clusters found | MDS cluster registry is empty | Verify MDS is running and clusters are registered |
| Token expired errors | Long backup with token expiry | Token auto-refreshes; if persistent, check MDS token configuration |
| Partial backup | Some principals failed to enumerate | Check MDS logs; backup continues with available data |
Requirements
- Confluent Platform 5.4+ with RBAC enabled
- MDS running on Confluent Server brokers
- HTTP(S) access to MDS REST API (port 8090 by default)
- SecurityAdmin or SystemAdmin role for the backup service account
- Enterprise license with
rbacfeature enabled