Skip to main content

MSK KRaft Migration Runbook

This runbook walks through a complete production migration from an AWS MSK ZooKeeper cluster to a KRaft cluster using kafka-backup Enterprise.

Prerequisites

Before starting, ensure you have:

  • Source cluster: MSK Provisioned cluster in ZooKeeper mode (Kafka 2.8+)
  • Target cluster: MSK Provisioned cluster in KRaft mode (Kafka 3.7+), pre-provisioned with the same or larger broker count
  • S3 buckets: Two buckets in the same region as your clusters — one for migration segments, one for evidence
  • IAM permissions: The migration runner needs access to both MSK clusters and both S3 buckets (see Step 2)
  • kafka-backup Enterprise: Installed on a host that can reach both clusters and S3 (Installation Guide)
  • License: Enterprise license with migrations:msk-kraft feature, or the 14-day free trial (activates automatically)
  • Network: The migration runner must have network access to both clusters' bootstrap servers
Pre-flight with free tools

You can validate all prerequisites without a license. plan generates the IAM policies you need, and precheck verifies network connectivity, S3 access, and cluster compatibility — all free.

Step 1: Create the Migration Config

Create a YAML configuration file that describes your source, target, and migration parameters.

migration.yaml
enterprise:
msk_kraft_migration:
source:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-zk-cluster/abc-def-123
auth:
mode: iam
target:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-kraft-cluster/ghi-jkl-456
auth:
mode: iam
backup:
s3_bucket: prod-migration-segments
s3_prefix: zk-to-kraft/
evidence:
s3_bucket: prod-migration-evidence
s3_prefix: migrations/
retention: 7y
cutover:
drain_timeout: 30m
drain_max_partition_lag: 100
drain_stable_window: 30s
max_producer_freeze: 60s
producer_freeze_webhook: https://internal-api.example.com/kafka/freeze
validation:
count_tolerance: 1
spot_check_records_per_partition: 3
acl:
on_drift: merge

See the Configuration Reference for every field and its defaults.

Auth mode examples

IAM (most common):

auth:
mode: iam

SCRAM-SHA-512:

auth:
mode: scram-sha-512
username: ${KAFKA_USERNAME}
password: ${KAFKA_PASSWORD}

mTLS:

auth:
mode: mtls
keystore: /path/to/keystore.jks
keystore_password: ${KEYSTORE_PASS}
truststore: /path/to/truststore.jks
truststore_password: ${TRUSTSTORE_PASS}

Step 2: Generate the Migration Plan

kafka-backup migrate msk-kraft plan \
--config migration.yaml \
--format all \
--out-dir ./migration-plan

This generates six artifacts:

FilePurpose
plan.jsonMachine-readable migration plan (topic list, partition counts, estimated data volume)
runbook.mdAuto-generated step-by-step runbook customized to your clusters
aws-cli.shAWS CLI commands for any infrastructure setup needed
iam-policy-templated.jsonIAM policy template with placeholder ARNs
iam-policy-concrete.jsonIAM policy with your actual cluster and bucket ARNs
cost-estimate.jsonEstimated S3 storage and data transfer costs
Review the IAM policy

Attach iam-policy-concrete.json to the IAM role running the migration. Without these permissions, execute will fail with S3 or MSK access errors.

Example plan summary from a production IAM migration:

source=oso-msk-prod metadata=ZOOKEEPER kafka=3.9.x brokers=3
target=oso-msk-prod-kraft metadata=KRAFT kafka=3.9.x.kraft brokers=3
topics_to_create=36
partitions_to_create=306
estimated_seed_bytes=86128640
skipped_internal=["__amazon_msk_canary","__consumer_offsets"]

sample topics:
cdc.postgres.public.line_items partitions=6 rf=3
streams.orders partitions=6 rf=3
cdc.mongo.catalog.reviews partitions=6 rf=3
kb-bench-source partitions=96 rf=3

Step 3: Run Precheck

kafka-backup migrate msk-kraft precheck --config migration.yaml

Precheck performs read-only analysis of both clusters and reports:

  • Blockers (B-codes): Must be resolved before migration can proceed
  • Warnings (W-codes): Proceed with awareness
  • Info (I-codes): Informational, no action needed

Common blockers and their fixes:

CodeIssueFix
B03Source is not ZK modeVerify source ARN points to a ZooKeeper-mode cluster
B04Target is not KRaft modeProvision target in KRaft mode
B07/B08S3 bucket not reachableCreate buckets or fix IAM permissions
B09/B10Kafka brokers not reachableCheck security groups and bootstrap servers
B11/B12Target message size too smallIncrease target message.max.bytes

See the Precheck Codes Reference for all codes with detailed remediation.

Example precheck output with no blockers:

W04 warn: could not verify target message-size floor (target broker DescribeConfigs returned no message.max.bytes or replica.fetch.max.bytes (dynamic-config only on this broker)) — ensure target `message.max.bytes` and `replica.fetch.max.bytes` ≥ largest source topic's effective max.message.bytes
W03 info: KMS key ARN set on backup channel — CMK access is not verified by this precheck phase; ensure the caller has kms:Encrypt/Decrypt/GenerateDataKey
I01 info: target is IAM-auth — ACLs will be emitted as access-map.json for customer IaC to translate to IAM policies (tool does not apply IAM)

Step 4: Execute the Migration

kafka-backup migrate msk-kraft execute \
--config migration.yaml \
--journal-dir ./journal

Execute runs through these phases automatically:

  1. Precheck — re-verifies cluster compatibility
  2. Topology Copy — creates topics and ACLs on target
  3. Seed — bulk-copies all existing data through S3
  4. Tail — continuously replicates new records until lag is within tolerance
  5. Drain Ready — halts and waits for you to proceed

The command blocks until drain_ready and then exits with code 0. This is your signal that the data is caught up and you can proceed to cutover.

Example journal excerpt at drain-ready:

2026-04-25T06:15:07.512121Z topology_copy -> seed
2026-04-25T06:25:30.777840Z seed -> tail
2026-04-25T06:26:19.079610Z tail -> drain_ready drain ready: max_partition_lag=0 records_replayed=0 bytes_replayed=0
Expected timeline

Seed phase duration depends on data volume. For 100GB of data, expect ~30 minutes for seed. Tail converges quickly for steady-state workloads. The entire execute phase typically completes in under an hour for clusters under 500GB.

Monitor progress

In another terminal, check migration status:

kafka-backup migrate msk-kraft status \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal

Step 5: Coordinate the Cutover

When execute completes with drain_ready, coordinate with your application teams:

  1. Schedule a maintenance window — the producer freeze is typically under 60 seconds, but applications should be prepared
  2. Prepare client configs — have new bootstrap servers ready to deploy (K8s ConfigMap, SSM Parameter, Consul KV, etc.)

Producer freeze options

Webhook (recommended): Configure cutover.producer_freeze_webhook in your config. The tool sends a POST request to freeze producers and a second POST to unfreeze them.

Manual TTY: If no webhook is configured, the tool prompts in the terminal. You manually confirm when producers are frozen.

Run cutover

kafka-backup migrate msk-kraft cutover \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal

Cutover performs:

  1. Freezes producers (webhook or manual)
  2. Publishes sentinel records to every partition
  3. Drains final records from source to target
  4. Snapshots all consumer group offsets from source
  5. Translates offsets using the offset map (source → target)
  6. Commits translated offsets on target
  7. Verifies target log-start offsets have not advanced past the copied data
  8. Logs READY_FOR_CLIENT_SWITCH

Example cutover-ready output:

2026-04-25T06:40:46.854037Z cutover -> awaiting_client_switch READY_FOR_CLIENT_SWITCH: groups_translated=0 offsets_committed=0 warnings=0

Step 6: Switch Clients

After cutover completes, update your applications to point to the new KRaft cluster:

# Example: update Kubernetes ConfigMap
kubectl patch configmap kafka-config -p '{"data":{"bootstrap.servers":"b-1.kraft.abc123.kafka.us-east-1.amazonaws.com:9098,b-2.kraft.abc123.kafka.us-east-1.amazonaws.com:9098"}}'

# Roll deployments
kubectl rollout restart deployment/order-service deployment/analytics-service

Consumers resume from translated target offsets so message continuity is preserved across the switch.

Verify consumer resume

Spot-check a consumer group to confirm it reads from the expected position:

kafka-consumer-groups.sh \
--bootstrap-server <target-bootstrap> \
--group <consumer-group> \
--describe

Step 7: Acknowledge the Client Switch

Once all clients are running against the target:

kafka-backup migrate msk-kraft cutover-ack \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal

This moves the migration to the validating state.

Step 8: Finalize

kafka-backup migrate msk-kraft finalize \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal

Finalize runs the 5-check validation suite:

  1. Topic parity — partition counts match
  2. Counts & offsets — record counts within tolerance
  3. Spot-check records — sampled records are byte-equal
  4. Sentinel presence — cutover markers landed
  5. Consumer group reconciliation — translated offsets committed correctly

On success, the Ed25519-signed evidence bundle is uploaded to S3.

Example successful post-finalize verification:

partitions_checked=306
target_behind_or_missing=0
earliest_partitions_checked=306
earliest_mismatches=0
latest_partitions_checked=306
latest_mismatches=0

Step 9: Verify the Evidence Bundle

Download and inspect the evidence:

aws s3 cp \
s3://prod-migration-evidence/migrations/<MIGRATION_ID>/evidence.json \
./evidence.json

# Verify the signature (optional)
cat evidence.json | jq '.bundle_json' -r | sha256sum

# Check validation outcome
cat evidence.json | jq -r '.bundle_json' | jq -r '.validation.overall'

# Check individual validation outcomes
cat evidence.json | jq -r '.bundle_json' | jq '{
topic_parity: .validation.topic_parity.outcome,
counts_and_offsets: .validation.counts_and_offsets.outcome,
spot_check_records: .validation.spot_check_records.outcome,
sentinel_presence: .validation.sentinel_presence.outcome,
consumer_group_reconciliation: .validation.consumer_group_reconciliation.outcome
}'

The evidence bundle contains the complete migration journal, cluster snapshots, topology diff, ACL plan, seed/tail statistics, cutover report, and validation results. Share it with your compliance team.

Rollback Procedure

Rollback is only available before cutover completes

Once cutover commits translated offsets to the target, rollback is no longer available. The source cluster remains untouched throughout — if you need to abort post-cutover, point applications back to the source cluster manually.

To rollback a migration before cutover:

kafka-backup migrate msk-kraft rollback \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal

Rollback:

  • Unfreezes producers (if frozen)
  • Marks the migration as rolled_back in the journal
  • Uploads a rollback report to the evidence bucket
  • Does not delete topics/data on the target (manual cleanup)

Resume After Failure

If the migration fails mid-execution (network error, timeout, crash):

kafka-backup migrate msk-kraft resume \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal

Resume reads the journal to find the last successful state and re-enters execution from there. The offset-map sidecar and tail checkpoints are persisted in S3, so no data is re-transferred.

If the resume fingerprint doesn't match (config changed between runs), use --force-restart to override — but note this may cause duplicate records on the target in the seed phase.

Timeline Expectations

Rough estimates for a 3-broker MSK cluster:

Data volumeSeed phaseTail convergenceCutover windowTotal
10 GB~5 min~2 min< 30s~10 min
100 GB~30 min~5 min< 30s~45 min
1 TB~4 hrs~15 min< 60s~5 hrs
10 TB~36 hrs~30 min< 60s~40 hrs

Factors that affect timing:

  • Network bandwidth between runner, MSK, and S3
  • Partition count — more partitions = more parallelism in seed
  • Message size — large messages consume bandwidth faster
  • Producer throughput during tail — high-throughput topics take longer to converge

Next Steps