Skip to main content

MSK ZooKeeper to KRaft Migration

Try it free — no license needed

plan and precheck are completely free. Run them against your production clusters today to see exactly what a migration looks like — generated runbook, cost estimate, IAM policies, and infrastructure readiness report. No signup, no trial activation.

kafka-backup migrate msk-kraft plan --config migration.yaml --format all --out-dir ./migration-plan
kafka-backup migrate msk-kraft precheck --config migration.yaml

Migrate your AWS MSK clusters from ZooKeeper to KRaft mode with a short coordinated producer freeze, validated offset continuity, and a cryptographically signed evidence bundle that proves the migration succeeded. Consumers resume from translated target offsets so message continuity is preserved across the switch.

Why Migrate from ZooKeeper to KRaft?

Apache Kafka 4.0 removes ZooKeeper entirely. KRaft (Kafka Raft) replaces ZooKeeper as the metadata management layer, bringing:

  • Faster controller failover — seconds instead of minutes
  • Simplified operations — one system to manage instead of two
  • Better scalability — millions of partitions per cluster
  • Reduced infrastructure — no ZooKeeper ensemble to provision, monitor, or patch

AWS MSK supports KRaft from version 3.7.x onward. ZooKeeper-mode clusters on MSK will reach end of extended support as Kafka 4.x becomes the default. The migration window is now.

Is ZooKeeper Deprecated?

Yes. ZooKeeper was deprecated in Apache Kafka 3.5 (KIP-833) and removed in Kafka 4.0. AWS MSK's latest versions already support KRaft, and new clusters should be provisioned in KRaft mode.

The AWS MSK Migration Problem

AWS MSK does not support in-place ZooKeeper-to-KRaft conversion. You must create a new KRaft cluster and move everything over:

What needs to migrateWhat happens without tooling
Topic data (every partition, every record)Manual MirrorMaker setup, ongoing maintenance
Topic configurations (retention, compaction, replication)Manual recreation, error-prone
Consumer group offsetsLost — consumers restart from earliest or latest
ACL bindingsManual recreation, security gaps during transition
Proof that migration succeededNothing — hope and prayer

The gap between "data is on the new cluster" and "consumers resume from the right place" is where migrations fail. A single incorrect offset means lost messages or reprocessed duplicates — silent data corruption that surfaces days later in downstream systems.

How kafka-backup Enterprise Solves It

Capabilitykafka-backup EnterpriseMirrorMaker 2Manual
Controlled cutoverYes (coordinated producer freeze)PartialNo
Offset continuity (exact message resume)Yes (offset-map translation)NoNo
ACL migration with drift handlingYes (merge/replace/refuse)NoManual
Topic config preservationYes (automatic)PartialManual
Cryptographic evidence bundleYes (Ed25519-signed)NoNo
5-check automated validationYesNoNo
Rollback capabilityYes (pre-cutover)NoNo
Resume after failureYes (journal-based)Restart from scratchRestart from scratch
Cross-auth support (SCRAM → IAM)YesNoManual

Migration Lifecycle

The migration runs through a deterministic 11-state machine. Every state transition is journaled and included in the final evidence bundle.

PLANNED → PRECHECK → TOPOLOGY_COPY → SEED → TAIL → DRAIN_READY

FINALIZED ← VALIDATING ← AWAITING_CLIENT_SWITCH ← CUTOVER
PhaseStateWhat happens
Plan & PrecheckplannedprecheckRead-only analysis of both clusters. Detects blockers (incompatible versions, unreachable brokers, S3 permission issues) and warnings (cross-region egress, compacted topics, static members).
Topology Copytopology_copyCreates missing topics on target with matching partition counts and configurations. Copies ACL bindings (filtering MSK internals like User:ANONYMOUS).
SeedseedBulk-copies all existing data through S3 — source → backup → S3 → restore → target. Builds the offset map that enables consumer group translation.
TailtailContinuously bridges the gap between seed and cutover. Replays new records as they arrive on source. Tracks per-partition lag.
Drain Readydrain_readyAll partitions within lag tolerance. Execution halts. Operator decides when to proceed.
CutovercutoverFreezes producers (via webhook or manual), publishes sentinel records, drains final records, translates all consumer group offsets, commits translated offsets on target.
Client Switchawaiting_client_switchOperator updates application configs to point to new KRaft cluster bootstrap servers.
ValidationvalidatingRuns 5 automated checks: topic parity, record counts, spot-check record equality, sentinel presence, consumer group reconciliation.
FinalizefinalizedSigns the evidence bundle with Ed25519 and uploads to S3. Migration complete.

At any point before cutover, you can rollback — the source cluster is never modified.

Authentication Matrix

kafka-backup supports every MSK authentication mode and cross-auth migration:

Source AuthTarget AuthSupportedNotes
IAMIAMYesMost common MSK configuration
SCRAM-SHA-512SCRAM-SHA-512YesPre-provision SCRAM users on target
SCRAM-SHA-512IAMYesAuth modernization — ACLs emitted as access-map.json
IAMSCRAM-SHA-512Yes
mTLSIAMYes
mTLSmTLSYes
PLAINTEXTAnyYesDev/test environments

Cross-auth migration (e.g., SCRAM source → IAM target) is a first-class feature. When the target uses IAM, Kafka ACLs don't apply — instead, the tool generates an access-map.json that maps each principal's permissions to the IAM policies you need to create.

5-Check Automated Validation

Before finalizing, the tool runs five independent validation checks:

CheckWhat it verifiesPass criteria
Topic ParityPartition counts match between source and targetAll topics match
Counts & OffsetsRecord counts within tolerance (default ±1 for sentinel)Per-partition span difference ≤ count_tolerance
Spot-Check RecordsSampled records byte-equal between source and targetAll samples match (compacted topics allow warnings)
Sentinel PresenceCutover marker records landed on targetAll sentinels found
Consumer Group ReconciliationTranslated offsets committed correctly on targetAll offsets match expected values

The overall outcome is PASSED, WARNING (compacted topic drift — expected), or FAILED. Failed validation blocks finalization — you must investigate and remediate before the migration can complete.

Cryptographic Evidence Bundle

Every migration produces an Ed25519-signed JSON evidence bundle uploaded to S3. This is your auditable proof that the migration succeeded:

  • Complete state transition journal (every phase, with timestamps)
  • Source and target cluster metadata snapshots
  • Topology diff (topics created, configs applied)
  • ACL plan (bindings copied, internals filtered)
  • Seed and tail statistics (records, bytes, partitions)
  • Full validation report with per-partition detail
  • Offset translation map
  • Cutover report (sentinel positions, freeze timing)

The signature is verifiable offline with the Ed25519 public key. For regulated environments, the evidence bucket supports S3 Object Lock (COMPLIANCE mode) to prevent tampering.

For compliance teams

The evidence bundle answers: "Prove that every record made it to the new cluster and every consumer will resume from the right place." It's the difference between "we think it worked" and "here's the cryptographic proof."

Quick Start

1. Create a minimal config

migration.yaml
enterprise:
msk_kraft_migration:
source:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/my-zk-cluster/abc-123
auth:
mode: iam
target:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/my-kraft-cluster/def-456
auth:
mode: iam
backup:
s3_bucket: my-migration-segments
s3_prefix: migrations/
evidence:
s3_bucket: my-migration-evidence
s3_prefix: evidence/

2. Generate the migration plan (free)

kafka-backup migrate msk-kraft plan \
--config migration.yaml \
--format all \
--out-dir ./migration-plan

This generates:

  • plan.json — machine-readable migration plan
  • runbook.md — step-by-step operator runbook
  • aws-cli.sh — AWS CLI commands for infrastructure setup
  • iam-policy-templated.json — IAM policy template
  • iam-policy-concrete.json — IAM policy with your ARNs filled in
  • cost-estimate.json — estimated S3 and data transfer costs

3. Run precheck (free)

kafka-backup migrate msk-kraft precheck --config migration.yaml

Precheck analyzes both clusters and reports blockers, warnings, and informational findings. See the Precheck Codes Reference for remediation guidance.

4. Execute the migration (license required)

kafka-backup migrate msk-kraft execute \
--config migration.yaml \
--journal-dir ./journal

See the Production Migration Runbook for the complete step-by-step process.

Pricing and Licensing

MSK KRaft migration requires the migrations:msk-kraft feature in your enterprise license. A 14-day free trial activates automatically on first run — no signup, no credit card.

  • plan and precheck are always free, even without a license
  • execute, cutover, finalize, and other mutation commands require an active license
  • Licenses are Ed25519-signed files validated offline — no license server, no phone-home

Learn more about licensing | Get a license

Frequently Asked Questions

Does Kafka still need ZooKeeper?

No. Apache Kafka 3.3+ supports KRaft mode (ZooKeeper-free). Kafka 4.0 removes ZooKeeper entirely. AWS MSK supports KRaft from version 3.7.x.

Can Kafka run without ZooKeeper?

Yes. KRaft mode replaces ZooKeeper with an internal Raft-based metadata quorum. New clusters should be provisioned in KRaft mode.

Is ZooKeeper removed from Kafka?

ZooKeeper was deprecated in Kafka 3.5 and removed in Kafka 4.0. Existing ZooKeeper-mode clusters must migrate to KRaft before upgrading to Kafka 4.x.

Is Kafka KRaft production ready?

Yes. KRaft has been production-ready since Kafka 3.3 (KIP-833). AWS MSK supports KRaft in production from version 3.7.x. Major organizations have been running KRaft in production since 2024.

What is KRaft in Kafka?

KRaft (Kafka Raft) is the consensus protocol that replaces ZooKeeper for Kafka metadata management. It uses the Raft algorithm to elect a controller and replicate metadata across the cluster, eliminating the need for a separate ZooKeeper ensemble. See KRaft Architecture for a deep dive.

How long does a migration take?

Migration time depends on data volume and network bandwidth. Rough estimates for a 3-broker cluster:

Data volumeSeed phaseTotal (including tail + cutover)
10 GB~5 minutes~10 minutes
100 GB~30 minutes~45 minutes
1 TB~4 hours~5 hours
10 TB~36 hours~40 hours

The producer freeze window during cutover is typically under 60 seconds regardless of data volume.

Next Steps