MSK KRaft Migration Runbook
This runbook walks through a complete production migration from an AWS MSK ZooKeeper cluster to a KRaft cluster using kafka-backup Enterprise.
Prerequisites
Before starting, ensure you have:
- Source cluster: MSK Provisioned cluster in ZooKeeper mode (Kafka 2.8+)
- Target cluster: MSK Provisioned cluster in KRaft mode (Kafka 3.7+), pre-provisioned with the same or larger broker count
- S3 buckets: Two buckets in the same region as your clusters — one for migration segments, one for evidence
- IAM permissions: The migration runner needs access to both MSK clusters and both S3 buckets (see Step 2)
- kafka-backup Enterprise: Installed on a host that can reach both clusters and S3 (Installation Guide)
- License: Enterprise license with
migrations:msk-kraftfeature, or the 14-day free trial (activates automatically) - Network: The migration runner must have network access to both clusters' bootstrap servers
You can validate all prerequisites without a license. plan generates the IAM policies you need, and precheck verifies network connectivity, S3 access, and cluster compatibility — all free.
Step 1: Create the Migration Config
Create a YAML configuration file that describes your source, target, and migration parameters.
enterprise:
msk_kraft_migration:
source:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-zk-cluster/abc-def-123
auth:
mode: iam
target:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-kraft-cluster/ghi-jkl-456
auth:
mode: iam
backup:
s3_bucket: prod-migration-segments
s3_prefix: zk-to-kraft/
evidence:
s3_bucket: prod-migration-evidence
s3_prefix: migrations/
retention: 7y
cutover:
drain_timeout: 30m
drain_max_partition_lag: 100
drain_stable_window: 30s
max_producer_freeze: 60s
producer_freeze_webhook: https://internal-api.example.com/kafka/freeze
validation:
count_tolerance: 1
spot_check_records_per_partition: 3
acl:
on_drift: merge
See the Configuration Reference for every field and its defaults.
Auth mode examples
IAM (most common):
auth:
mode: iam
SCRAM-SHA-512:
auth:
mode: scram-sha-512
username: ${KAFKA_USERNAME}
password: ${KAFKA_PASSWORD}
mTLS:
auth:
mode: mtls
keystore: /path/to/keystore.jks
keystore_password: ${KEYSTORE_PASS}
truststore: /path/to/truststore.jks
truststore_password: ${TRUSTSTORE_PASS}
Step 2: Generate the Migration Plan
kafka-backup migrate msk-kraft plan \
--config migration.yaml \
--format all \
--out-dir ./migration-plan
This generates six artifacts:
| File | Purpose |
|---|---|
plan.json | Machine-readable migration plan (topic list, partition counts, estimated data volume) |
runbook.md | Auto-generated step-by-step runbook customized to your clusters |
aws-cli.sh | AWS CLI commands for any infrastructure setup needed |
iam-policy-templated.json | IAM policy template with placeholder ARNs |
iam-policy-concrete.json | IAM policy with your actual cluster and bucket ARNs |
cost-estimate.json | Estimated S3 storage and data transfer costs |
Attach iam-policy-concrete.json to the IAM role running the migration. Without these permissions, execute will fail with S3 or MSK access errors.
Example plan summary from a production IAM migration:
source=oso-msk-prod metadata=ZOOKEEPER kafka=3.9.x brokers=3
target=oso-msk-prod-kraft metadata=KRAFT kafka=3.9.x.kraft brokers=3
topics_to_create=36
partitions_to_create=306
estimated_seed_bytes=86128640
skipped_internal=["__amazon_msk_canary","__consumer_offsets"]
sample topics:
cdc.postgres.public.line_items partitions=6 rf=3
streams.orders partitions=6 rf=3
cdc.mongo.catalog.reviews partitions=6 rf=3
kb-bench-source partitions=96 rf=3
Step 3: Run Precheck
kafka-backup migrate msk-kraft precheck --config migration.yaml
Precheck performs read-only analysis of both clusters and reports:
- Blockers (B-codes): Must be resolved before migration can proceed
- Warnings (W-codes): Proceed with awareness
- Info (I-codes): Informational, no action needed
Common blockers and their fixes:
| Code | Issue | Fix |
|---|---|---|
| B03 | Source is not ZK mode | Verify source ARN points to a ZooKeeper-mode cluster |
| B04 | Target is not KRaft mode | Provision target in KRaft mode |
| B07/B08 | S3 bucket not reachable | Create buckets or fix IAM permissions |
| B09/B10 | Kafka brokers not reachable | Check security groups and bootstrap servers |
| B11/B12 | Target message size too small | Increase target message.max.bytes |
See the Precheck Codes Reference for all codes with detailed remediation.
Example precheck output with no blockers:
W04 warn: could not verify target message-size floor (target broker DescribeConfigs returned no message.max.bytes or replica.fetch.max.bytes (dynamic-config only on this broker)) — ensure target `message.max.bytes` and `replica.fetch.max.bytes` ≥ largest source topic's effective max.message.bytes
W03 info: KMS key ARN set on backup channel — CMK access is not verified by this precheck phase; ensure the caller has kms:Encrypt/Decrypt/GenerateDataKey
I01 info: target is IAM-auth — ACLs will be emitted as access-map.json for customer IaC to translate to IAM policies (tool does not apply IAM)
Step 4: Execute the Migration
kafka-backup migrate msk-kraft execute \
--config migration.yaml \
--journal-dir ./journal
Execute runs through these phases automatically:
- Precheck — re-verifies cluster compatibility
- Topology Copy — creates topics and ACLs on target
- Seed — bulk-copies all existing data through S3
- Tail — continuously replicates new records until lag is within tolerance
- Drain Ready — halts and waits for you to proceed
The command blocks until drain_ready and then exits with code 0. This is your signal that the data is caught up and you can proceed to cutover.
Example journal excerpt at drain-ready:
2026-04-25T06:15:07.512121Z topology_copy -> seed
2026-04-25T06:25:30.777840Z seed -> tail
2026-04-25T06:26:19.079610Z tail -> drain_ready drain ready: max_partition_lag=0 records_replayed=0 bytes_replayed=0
Seed phase duration depends on data volume. For 100GB of data, expect ~30 minutes for seed. Tail converges quickly for steady-state workloads. The entire execute phase typically completes in under an hour for clusters under 500GB.
Monitor progress
In another terminal, check migration status:
kafka-backup migrate msk-kraft status \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal
Step 5: Coordinate the Cutover
When execute completes with drain_ready, coordinate with your application teams:
- Schedule a maintenance window — the producer freeze is typically under 60 seconds, but applications should be prepared
- Prepare client configs — have new bootstrap servers ready to deploy (K8s ConfigMap, SSM Parameter, Consul KV, etc.)
Producer freeze options
Webhook (recommended): Configure cutover.producer_freeze_webhook in your config. The tool sends a POST request to freeze producers and a second POST to unfreeze them.
Manual TTY: If no webhook is configured, the tool prompts in the terminal. You manually confirm when producers are frozen.
Run cutover
kafka-backup migrate msk-kraft cutover \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal
Cutover performs:
- Freezes producers (webhook or manual)
- Publishes sentinel records to every partition
- Drains final records from source to target
- Snapshots all consumer group offsets from source
- Translates offsets using the offset map (source → target)
- Commits translated offsets on target
- Verifies target log-start offsets have not advanced past the copied data
- Logs
READY_FOR_CLIENT_SWITCH
Example cutover-ready output:
2026-04-25T06:40:46.854037Z cutover -> awaiting_client_switch READY_FOR_CLIENT_SWITCH: groups_translated=0 offsets_committed=0 warnings=0
Step 6: Switch Clients
After cutover completes, update your applications to point to the new KRaft cluster:
# Example: update Kubernetes ConfigMap
kubectl patch configmap kafka-config -p '{"data":{"bootstrap.servers":"b-1.kraft.abc123.kafka.us-east-1.amazonaws.com:9098,b-2.kraft.abc123.kafka.us-east-1.amazonaws.com:9098"}}'
# Roll deployments
kubectl rollout restart deployment/order-service deployment/analytics-service
Consumers resume from translated target offsets so message continuity is preserved across the switch.
Verify consumer resume
Spot-check a consumer group to confirm it reads from the expected position:
kafka-consumer-groups.sh \
--bootstrap-server <target-bootstrap> \
--group <consumer-group> \
--describe
Step 7: Acknowledge the Client Switch
Once all clients are running against the target:
kafka-backup migrate msk-kraft cutover-ack \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal
This moves the migration to the validating state.
Step 8: Finalize
kafka-backup migrate msk-kraft finalize \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal
Finalize runs the 5-check validation suite:
- Topic parity — partition counts match
- Counts & offsets — record counts within tolerance
- Spot-check records — sampled records are byte-equal
- Sentinel presence — cutover markers landed
- Consumer group reconciliation — translated offsets committed correctly
On success, the Ed25519-signed evidence bundle is uploaded to S3.
Example successful post-finalize verification:
partitions_checked=306
target_behind_or_missing=0
earliest_partitions_checked=306
earliest_mismatches=0
latest_partitions_checked=306
latest_mismatches=0
Step 9: Verify the Evidence Bundle
Download and inspect the evidence:
aws s3 cp \
s3://prod-migration-evidence/migrations/<MIGRATION_ID>/evidence.json \
./evidence.json
# Verify the signature (optional)
cat evidence.json | jq '.bundle_json' -r | sha256sum
# Check validation outcome
cat evidence.json | jq -r '.bundle_json' | jq -r '.validation.overall'
# Check individual validation outcomes
cat evidence.json | jq -r '.bundle_json' | jq '{
topic_parity: .validation.topic_parity.outcome,
counts_and_offsets: .validation.counts_and_offsets.outcome,
spot_check_records: .validation.spot_check_records.outcome,
sentinel_presence: .validation.sentinel_presence.outcome,
consumer_group_reconciliation: .validation.consumer_group_reconciliation.outcome
}'
The evidence bundle contains the complete migration journal, cluster snapshots, topology diff, ACL plan, seed/tail statistics, cutover report, and validation results. Share it with your compliance team.
Rollback Procedure
Once cutover commits translated offsets to the target, rollback is no longer available. The source cluster remains untouched throughout — if you need to abort post-cutover, point applications back to the source cluster manually.
To rollback a migration before cutover:
kafka-backup migrate msk-kraft rollback \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal
Rollback:
- Unfreezes producers (if frozen)
- Marks the migration as
rolled_backin the journal - Uploads a rollback report to the evidence bucket
- Does not delete topics/data on the target (manual cleanup)
Resume After Failure
If the migration fails mid-execution (network error, timeout, crash):
kafka-backup migrate msk-kraft resume \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal
Resume reads the journal to find the last successful state and re-enters execution from there. The offset-map sidecar and tail checkpoints are persisted in S3, so no data is re-transferred.
If the resume fingerprint doesn't match (config changed between runs), use --force-restart to override — but note this may cause duplicate records on the target in the seed phase.
Timeline Expectations
Rough estimates for a 3-broker MSK cluster:
| Data volume | Seed phase | Tail convergence | Cutover window | Total |
|---|---|---|---|---|
| 10 GB | ~5 min | ~2 min | < 30s | ~10 min |
| 100 GB | ~30 min | ~5 min | < 30s | ~45 min |
| 1 TB | ~4 hrs | ~15 min | < 60s | ~5 hrs |
| 10 TB | ~36 hrs | ~30 min | < 60s | ~40 hrs |
Factors that affect timing:
- Network bandwidth between runner, MSK, and S3
- Partition count — more partitions = more parallelism in seed
- Message size — large messages consume bandwidth faster
- Producer throughput during tail — high-throughput topics take longer to converge
Next Steps
- Configuration Reference — tune cutover, validation, and seed parameters
- Monitoring Guide — what to watch during migration
- Troubleshooting — common errors and fixes
- Precheck Codes — all precheck findings explained