MSK ZooKeeper to KRaft Migration
plan and precheck are completely free. Run them against your production clusters today to see exactly what a migration looks like — generated runbook, cost estimate, IAM policies, and infrastructure readiness report. No signup, no trial activation.
kafka-backup migrate msk-kraft plan --config migration.yaml --format all --out-dir ./migration-plan
kafka-backup migrate msk-kraft precheck --config migration.yaml
Migrate your AWS MSK clusters from ZooKeeper to KRaft mode with a short coordinated producer freeze, validated offset continuity, and a cryptographically signed evidence bundle that proves the migration succeeded. Consumers resume from translated target offsets so message continuity is preserved across the switch.
Why Migrate from ZooKeeper to KRaft?
Apache Kafka 4.0 removes ZooKeeper entirely. KRaft (Kafka Raft) replaces ZooKeeper as the metadata management layer, bringing:
- Faster controller failover — seconds instead of minutes
- Simplified operations — one system to manage instead of two
- Better scalability — millions of partitions per cluster
- Reduced infrastructure — no ZooKeeper ensemble to provision, monitor, or patch
AWS MSK supports KRaft from version 3.7.x onward. ZooKeeper-mode clusters on MSK will reach end of extended support as Kafka 4.x becomes the default. The migration window is now.
Is ZooKeeper Deprecated?
Yes. ZooKeeper was deprecated in Apache Kafka 3.5 (KIP-833) and removed in Kafka 4.0. AWS MSK's latest versions already support KRaft, and new clusters should be provisioned in KRaft mode.
The AWS MSK Migration Problem
AWS MSK does not support in-place ZooKeeper-to-KRaft conversion. You must create a new KRaft cluster and move everything over:
| What needs to migrate | What happens without tooling |
|---|---|
| Topic data (every partition, every record) | Manual MirrorMaker setup, ongoing maintenance |
| Topic configurations (retention, compaction, replication) | Manual recreation, error-prone |
| Consumer group offsets | Lost — consumers restart from earliest or latest |
| ACL bindings | Manual recreation, security gaps during transition |
| Proof that migration succeeded | Nothing — hope and prayer |
The gap between "data is on the new cluster" and "consumers resume from the right place" is where migrations fail. A single incorrect offset means lost messages or reprocessed duplicates — silent data corruption that surfaces days later in downstream systems.
How kafka-backup Enterprise Solves It
| Capability | kafka-backup Enterprise | MirrorMaker 2 | Manual |
|---|---|---|---|
| Controlled cutover | Yes (coordinated producer freeze) | Partial | No |
| Offset continuity (exact message resume) | Yes (offset-map translation) | No | No |
| ACL migration with drift handling | Yes (merge/replace/refuse) | No | Manual |
| Topic config preservation | Yes (automatic) | Partial | Manual |
| Cryptographic evidence bundle | Yes (Ed25519-signed) | No | No |
| 5-check automated validation | Yes | No | No |
| Rollback capability | Yes (pre-cutover) | No | No |
| Resume after failure | Yes (journal-based) | Restart from scratch | Restart from scratch |
| Cross-auth support (SCRAM → IAM) | Yes | No | Manual |
Migration Lifecycle
The migration runs through a deterministic 11-state machine. Every state transition is journaled and included in the final evidence bundle.
PLANNED → PRECHECK → TOPOLOGY_COPY → SEED → TAIL → DRAIN_READY
↓
FINALIZED ← VALIDATING ← AWAITING_CLIENT_SWITCH ← CUTOVER
| Phase | State | What happens |
|---|---|---|
| Plan & Precheck | planned → precheck | Read-only analysis of both clusters. Detects blockers (incompatible versions, unreachable brokers, S3 permission issues) and warnings (cross-region egress, compacted topics, static members). |
| Topology Copy | topology_copy | Creates missing topics on target with matching partition counts and configurations. Copies ACL bindings (filtering MSK internals like User:ANONYMOUS). |
| Seed | seed | Bulk-copies all existing data through S3 — source → backup → S3 → restore → target. Builds the offset map that enables consumer group translation. |
| Tail | tail | Continuously bridges the gap between seed and cutover. Replays new records as they arrive on source. Tracks per-partition lag. |
| Drain Ready | drain_ready | All partitions within lag tolerance. Execution halts. Operator decides when to proceed. |
| Cutover | cutover | Freezes producers (via webhook or manual), publishes sentinel records, drains final records, translates all consumer group offsets, commits translated offsets on target. |
| Client Switch | awaiting_client_switch | Operator updates application configs to point to new KRaft cluster bootstrap servers. |
| Validation | validating | Runs 5 automated checks: topic parity, record counts, spot-check record equality, sentinel presence, consumer group reconciliation. |
| Finalize | finalized | Signs the evidence bundle with Ed25519 and uploads to S3. Migration complete. |
At any point before cutover, you can rollback — the source cluster is never modified.
Authentication Matrix
kafka-backup supports every MSK authentication mode and cross-auth migration:
| Source Auth | Target Auth | Supported | Notes |
|---|---|---|---|
| IAM | IAM | Yes | Most common MSK configuration |
| SCRAM-SHA-512 | SCRAM-SHA-512 | Yes | Pre-provision SCRAM users on target |
| SCRAM-SHA-512 | IAM | Yes | Auth modernization — ACLs emitted as access-map.json |
| IAM | SCRAM-SHA-512 | Yes | |
| mTLS | IAM | Yes | |
| mTLS | mTLS | Yes | |
| PLAINTEXT | Any | Yes | Dev/test environments |
Cross-auth migration (e.g., SCRAM source → IAM target) is a first-class feature. When the target uses IAM, Kafka ACLs don't apply — instead, the tool generates an access-map.json that maps each principal's permissions to the IAM policies you need to create.
5-Check Automated Validation
Before finalizing, the tool runs five independent validation checks:
| Check | What it verifies | Pass criteria |
|---|---|---|
| Topic Parity | Partition counts match between source and target | All topics match |
| Counts & Offsets | Record counts within tolerance (default ±1 for sentinel) | Per-partition span difference ≤ count_tolerance |
| Spot-Check Records | Sampled records byte-equal between source and target | All samples match (compacted topics allow warnings) |
| Sentinel Presence | Cutover marker records landed on target | All sentinels found |
| Consumer Group Reconciliation | Translated offsets committed correctly on target | All offsets match expected values |
The overall outcome is PASSED, WARNING (compacted topic drift — expected), or FAILED. Failed validation blocks finalization — you must investigate and remediate before the migration can complete.
Cryptographic Evidence Bundle
Every migration produces an Ed25519-signed JSON evidence bundle uploaded to S3. This is your auditable proof that the migration succeeded:
- Complete state transition journal (every phase, with timestamps)
- Source and target cluster metadata snapshots
- Topology diff (topics created, configs applied)
- ACL plan (bindings copied, internals filtered)
- Seed and tail statistics (records, bytes, partitions)
- Full validation report with per-partition detail
- Offset translation map
- Cutover report (sentinel positions, freeze timing)
The signature is verifiable offline with the Ed25519 public key. For regulated environments, the evidence bucket supports S3 Object Lock (COMPLIANCE mode) to prevent tampering.
The evidence bundle answers: "Prove that every record made it to the new cluster and every consumer will resume from the right place." It's the difference between "we think it worked" and "here's the cryptographic proof."
Quick Start
1. Create a minimal config
enterprise:
msk_kraft_migration:
source:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/my-zk-cluster/abc-123
auth:
mode: iam
target:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/my-kraft-cluster/def-456
auth:
mode: iam
backup:
s3_bucket: my-migration-segments
s3_prefix: migrations/
evidence:
s3_bucket: my-migration-evidence
s3_prefix: evidence/
2. Generate the migration plan (free)
kafka-backup migrate msk-kraft plan \
--config migration.yaml \
--format all \
--out-dir ./migration-plan
This generates:
plan.json— machine-readable migration planrunbook.md— step-by-step operator runbookaws-cli.sh— AWS CLI commands for infrastructure setupiam-policy-templated.json— IAM policy templateiam-policy-concrete.json— IAM policy with your ARNs filled incost-estimate.json— estimated S3 and data transfer costs
3. Run precheck (free)
kafka-backup migrate msk-kraft precheck --config migration.yaml
Precheck analyzes both clusters and reports blockers, warnings, and informational findings. See the Precheck Codes Reference for remediation guidance.
4. Execute the migration (license required)
kafka-backup migrate msk-kraft execute \
--config migration.yaml \
--journal-dir ./journal
See the Production Migration Runbook for the complete step-by-step process.
Pricing and Licensing
MSK KRaft migration requires the migrations:msk-kraft feature in your enterprise license. A 14-day free trial activates automatically on first run — no signup, no credit card.
planandprecheckare always free, even without a licenseexecute,cutover,finalize, and other mutation commands require an active license- Licenses are Ed25519-signed files validated offline — no license server, no phone-home
Learn more about licensing | Get a license
Frequently Asked Questions
Does Kafka still need ZooKeeper?
No. Apache Kafka 3.3+ supports KRaft mode (ZooKeeper-free). Kafka 4.0 removes ZooKeeper entirely. AWS MSK supports KRaft from version 3.7.x.
Can Kafka run without ZooKeeper?
Yes. KRaft mode replaces ZooKeeper with an internal Raft-based metadata quorum. New clusters should be provisioned in KRaft mode.
Is ZooKeeper removed from Kafka?
ZooKeeper was deprecated in Kafka 3.5 and removed in Kafka 4.0. Existing ZooKeeper-mode clusters must migrate to KRaft before upgrading to Kafka 4.x.
Is Kafka KRaft production ready?
Yes. KRaft has been production-ready since Kafka 3.3 (KIP-833). AWS MSK supports KRaft in production from version 3.7.x. Major organizations have been running KRaft in production since 2024.
What is KRaft in Kafka?
KRaft (Kafka Raft) is the consensus protocol that replaces ZooKeeper for Kafka metadata management. It uses the Raft algorithm to elect a controller and replicate metadata across the cluster, eliminating the need for a separate ZooKeeper ensemble. See KRaft Architecture for a deep dive.
How long does a migration take?
Migration time depends on data volume and network bandwidth. Rough estimates for a 3-broker cluster:
| Data volume | Seed phase | Total (including tail + cutover) |
|---|---|---|
| 10 GB | ~5 minutes | ~10 minutes |
| 100 GB | ~30 minutes | ~45 minutes |
| 1 TB | ~4 hours | ~5 hours |
| 10 TB | ~36 hours | ~40 hours |
The producer freeze window during cutover is typically under 60 seconds regardless of data volume.
Next Steps
- Production Migration Runbook — step-by-step guide
- Configuration Reference — every YAML field explained
- Precheck Codes Reference — blocker and warning remediation
- CLI Reference — all 9 migration commands
- Architecture Deep Dive — how offset continuity works
- IAM-to-IAM Example — complete worked example
- Cross-Auth Example — SCRAM to IAM migration