MSK KRaft Migration Runbook

This runbook walks through a complete production migration from an AWS MSK ZooKeeper cluster to a KRaft cluster using kafka-backup Enterprise.

Prerequisites

Before starting, ensure you have:

Source cluster: MSK Provisioned cluster in ZooKeeper mode (Kafka 2.8+)
Target cluster: MSK Provisioned cluster in KRaft mode (Kafka 3.7+), pre-provisioned with the same or larger broker count
S3 buckets: Two buckets in the same region as your clusters — one for migration segments, one for evidence
IAM permissions: The migration runner needs access to both MSK clusters and both S3 buckets (see Step 2)
kafka-backup Enterprise: Installed on a host that can reach both clusters and S3 (Installation Guide)
License: Enterprise license with migrations:msk-kraft feature, or the 14-day free trial (activates automatically)
Network: The migration runner must have network access to both clusters' bootstrap servers

Pre-flight with free tools

You can validate all prerequisites without a license. plan generates the IAM policies you need, and precheck verifies network connectivity, S3 access, and cluster compatibility — all free.

Step 1: Create the Migration Config

Create a YAML configuration file that describes your source, target, and migration parameters.

migration.yaml
enterprise:
  msk_kraft_migration:
    source:
      cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-zk-cluster/abc-def-123
      auth:
        mode: iam
    target:
      cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-kraft-cluster/ghi-jkl-456
      auth:
        mode: iam
    backup:
      s3_bucket: prod-migration-segments
      s3_prefix: zk-to-kraft/
    evidence:
      s3_bucket: prod-migration-evidence
      s3_prefix: migrations/
      retention: 7y
    cutover:
      drain_timeout: 30m
      drain_max_partition_lag: 100
      drain_stable_window: 30s
      max_producer_freeze: 60s
      producer_freeze_webhook: https://internal-api.example.com/kafka/freeze
    validation:
      count_tolerance: 1
      spot_check_records_per_partition: 3
    acl:
      on_drift: merge

See the Configuration Reference for every field and its defaults.

Auth mode examples

IAM (most common):

auth:
  mode: iam

SCRAM-SHA-512:

auth:
  mode: scram-sha-512
  username: ${KAFKA_USERNAME}
  password: ${KAFKA_PASSWORD}

mTLS:

auth:
  mode: mtls
  keystore: /path/to/keystore.jks
  keystore_password: ${KEYSTORE_PASS}
  truststore: /path/to/truststore.jks
  truststore_password: ${TRUSTSTORE_PASS}

Step 2: Generate the Migration Plan

kafka-backup migrate msk-kraft plan \
  --config migration.yaml \
  --format all \
  --out-dir ./migration-plan

This generates six artifacts:

File	Purpose
`plan.json`	Machine-readable migration plan (topic list, partition counts, estimated data volume)
`runbook.md`	Auto-generated step-by-step runbook customized to your clusters
`aws-cli.sh`	AWS CLI commands for any infrastructure setup needed
`iam-policy-templated.json`	IAM policy template with placeholder ARNs
`iam-policy-concrete.json`	IAM policy with your actual cluster and bucket ARNs
`cost-estimate.json`	Estimated S3 storage and data transfer costs

Review the IAM policy

Attach iam-policy-concrete.json to the IAM role running the migration. Without these permissions, execute will fail with S3 or MSK access errors.

Example plan summary from a production IAM migration:

source=oso-msk-prod metadata=ZOOKEEPER kafka=3.9.x brokers=3
target=oso-msk-prod-kraft metadata=KRAFT kafka=3.9.x.kraft brokers=3
topics_to_create=36
partitions_to_create=306
estimated_seed_bytes=86128640
skipped_internal=["__amazon_msk_canary","__consumer_offsets"]

sample topics:
  cdc.postgres.public.line_items partitions=6 rf=3
  streams.orders partitions=6 rf=3
  cdc.mongo.catalog.reviews partitions=6 rf=3
  kb-bench-source partitions=96 rf=3

Step 3: Run Precheck

kafka-backup migrate msk-kraft precheck --config migration.yaml

Precheck performs read-only analysis of both clusters and reports:

Blockers (B-codes): Must be resolved before migration can proceed
Warnings (W-codes): Proceed with awareness
Info (I-codes): Informational, no action needed

Common blockers and their fixes:

Code	Issue	Fix
B03	Source is not ZK mode	Verify source ARN points to a ZooKeeper-mode cluster
B04	Target is not KRaft mode	Provision target in KRaft mode
B07/B08	S3 bucket not reachable	Create buckets or fix IAM permissions
B09/B10	Kafka brokers not reachable	Check security groups and bootstrap servers
B11/B12	Target message size too small	Increase target `message.max.bytes`

See the Precheck Codes Reference for all codes with detailed remediation.

Example precheck output with no blockers:

W04 warn: could not verify target message-size floor (target broker DescribeConfigs returned no message.max.bytes or replica.fetch.max.bytes (dynamic-config only on this broker)) — ensure target `message.max.bytes` and `replica.fetch.max.bytes` ≥ largest source topic's effective max.message.bytes
W03 info: KMS key ARN set on backup channel — CMK access is not verified by this precheck phase; ensure the caller has kms:Encrypt/Decrypt/GenerateDataKey
I01 info: target is IAM-auth — ACLs will be emitted as access-map.json for customer IaC to translate to IAM policies (tool does not apply IAM)

Step 4: Execute the Migration

kafka-backup migrate msk-kraft execute \
  --config migration.yaml \
  --journal-dir ./journal

Execute runs through these phases automatically:

Precheck — re-verifies cluster compatibility
Topology Copy — creates topics and ACLs on target
Seed — bulk-copies all existing data through S3
Tail — continuously replicates new records until lag is within tolerance
Drain Ready — halts and waits for you to proceed

The command blocks until drain_ready and then exits with code 0. This is your signal that the data is caught up and you can proceed to cutover.

Example journal excerpt at drain-ready:

2026-04-25T06:15:07.512121Z topology_copy -> seed
2026-04-25T06:25:30.777840Z seed -> tail
2026-04-25T06:26:19.079610Z tail -> drain_ready drain ready: max_partition_lag=0 records_replayed=0 bytes_replayed=0

Expected timeline

Seed phase duration depends on data volume. For 100GB of data, expect ~30 minutes for seed. Tail converges quickly for steady-state workloads. The entire execute phase typically completes in under an hour for clusters under 500GB.

Monitor progress

In another terminal, check migration status:

kafka-backup migrate msk-kraft status \
  --config migration.yaml \
  --migration-id <ID> \
  --journal-dir ./journal

Step 5: Coordinate the Cutover

When execute completes with drain_ready, coordinate with your application teams:

Schedule a maintenance window — the producer freeze is typically under 60 seconds, but applications should be prepared
Prepare client configs — have new bootstrap servers ready to deploy (K8s ConfigMap, SSM Parameter, Consul KV, etc.)

Producer freeze options

Webhook (recommended): Configure cutover.producer_freeze_webhook in your config. The tool sends a POST request to freeze producers and a second POST to unfreeze them.

Manual TTY: If no webhook is configured, the tool prompts in the terminal. You manually confirm when producers are frozen.

Run cutover

kafka-backup migrate msk-kraft cutover \
  --config migration.yaml \
  --migration-id <ID> \
  --journal-dir ./journal

Cutover performs:

Freezes producers (webhook or manual)
Publishes sentinel records to every partition
Drains final records from source to target
Snapshots all consumer group offsets from source
Translates offsets using the offset map (source → target)
Commits translated offsets on target
Verifies target log-start offsets have not advanced past the copied data
Logs READY_FOR_CLIENT_SWITCH

Example cutover-ready output:

2026-04-25T06:40:46.854037Z cutover -> awaiting_client_switch READY_FOR_CLIENT_SWITCH: groups_translated=0 offsets_committed=0 warnings=0

Step 6: Switch Clients

After cutover completes, update your applications to point to the new KRaft cluster:

# Example: update Kubernetes ConfigMap
kubectl patch configmap kafka-config -p '{"data":{"bootstrap.servers":"b-1.kraft.abc123.kafka.us-east-1.amazonaws.com:9098,b-2.kraft.abc123.kafka.us-east-1.amazonaws.com:9098"}}'

# Roll deployments
kubectl rollout restart deployment/order-service deployment/analytics-service

Consumers resume from translated target offsets so message continuity is preserved across the switch.

Verify consumer resume

Spot-check a consumer group to confirm it reads from the expected position:

kafka-consumer-groups.sh \
  --bootstrap-server <target-bootstrap> \
  --group <consumer-group> \
  --describe

Step 7: Acknowledge the Client Switch

Once all clients are running against the target:

kafka-backup migrate msk-kraft cutover-ack \
  --config migration.yaml \
  --migration-id <ID> \
  --journal-dir ./journal

This moves the migration to the validating state.

Step 8: Finalize

kafka-backup migrate msk-kraft finalize \
  --config migration.yaml \
  --migration-id <ID> \
  --journal-dir ./journal

Finalize runs the 5-check validation suite:

Topic parity — partition counts match
Counts & offsets — record counts within tolerance
Spot-check records — sampled records are byte-equal
Sentinel presence — cutover markers landed
Consumer group reconciliation — translated offsets committed correctly

On success, the Ed25519-signed evidence bundle is uploaded to S3.

Example successful post-finalize verification:

partitions_checked=306
target_behind_or_missing=0
earliest_partitions_checked=306
earliest_mismatches=0
latest_partitions_checked=306
latest_mismatches=0

Step 9: Verify the Evidence Bundle

Download and inspect the evidence:

aws s3 cp \
  s3://prod-migration-evidence/migrations/<MIGRATION_ID>/evidence.json \
  ./evidence.json

# Verify the signature (optional)
cat evidence.json | jq '.bundle_json' -r | sha256sum

# Check validation outcome
cat evidence.json | jq -r '.bundle_json' | jq -r '.validation.overall'

# Check individual validation outcomes
cat evidence.json | jq -r '.bundle_json' | jq '{
  topic_parity: .validation.topic_parity.outcome,
  counts_and_offsets: .validation.counts_and_offsets.outcome,
  spot_check_records: .validation.spot_check_records.outcome,
  sentinel_presence: .validation.sentinel_presence.outcome,
  consumer_group_reconciliation: .validation.consumer_group_reconciliation.outcome
}'

The evidence bundle contains the complete migration journal, cluster snapshots, topology diff, ACL plan, seed/tail statistics, cutover report, and validation results. Share it with your compliance team.

Rollback Procedure

Rollback is only available before cutover completes

Once cutover commits translated offsets to the target, rollback is no longer available. The source cluster remains untouched throughout — if you need to abort post-cutover, point applications back to the source cluster manually.

To rollback a migration before cutover:

kafka-backup migrate msk-kraft rollback \
  --config migration.yaml \
  --migration-id <ID> \
  --journal-dir ./journal

Rollback:

Unfreezes producers (if frozen)
Marks the migration as rolled_back in the journal
Uploads a rollback report to the evidence bucket
Does not delete topics/data on the target (manual cleanup)

Resume After Failure

If the migration fails mid-execution (network error, timeout, crash):

kafka-backup migrate msk-kraft resume \
  --config migration.yaml \
  --migration-id <ID> \
  --journal-dir ./journal

Resume reads the journal to find the last successful state and re-enters execution from there. The offset-map sidecar and tail checkpoints are persisted in S3, so no data is re-transferred.

If the resume fingerprint doesn't match (config changed between runs), use --force-restart to override — but note this may cause duplicate records on the target in the seed phase.

Timeline Expectations

Rough estimates for a 3-broker MSK cluster:

Data volume	Seed phase	Tail convergence	Cutover window	Total
10 GB	~5 min	~2 min	< 30s	~10 min
100 GB	~30 min	~5 min	< 30s	~45 min
1 TB	~4 hrs	~15 min	< 60s	~5 hrs
10 TB	~36 hrs	~30 min	< 60s	~40 hrs

Factors that affect timing:

Network bandwidth between runner, MSK, and S3
Partition count — more partitions = more parallelism in seed
Message size — large messages consume bandwidth faster
Producer throughput during tail — high-throughput topics take longer to converge

Next Steps

Configuration Reference — tune cutover, validation, and seed parameters
Monitoring Guide — what to watch during migration
Troubleshooting — common errors and fixes
Precheck Codes — all precheck findings explained

Prerequisites​

Step 1: Create the Migration Config​

Auth mode examples​

Step 2: Generate the Migration Plan​

Step 3: Run Precheck​

Step 4: Execute the Migration​

Monitor progress​

Step 5: Coordinate the Cutover​

Producer freeze options​

Run cutover​

Step 6: Switch Clients​

Verify consumer resume​

Step 7: Acknowledge the Client Switch​

Step 8: Finalize​

Step 9: Verify the Evidence Bundle​

Rollback Procedure​

Resume After Failure​

Timeline Expectations​

Next Steps​