MSK KRaft Migration Troubleshooting

Precheck Failures

Precheck reports blockers that must be resolved before migration can proceed. See the Precheck Codes Reference for every code with detailed remediation.

Most common blockers:

Code	Issue	Quick fix
B09/B10	Kafka brokers unreachable	Check security groups, VPC peering, bootstrap servers
B07/B08	S3 bucket unreachable	Create bucket or fix IAM policy
B11/B12	Target message size too small	Increase `message.max.bytes` on target

Execute Phase Errors

S3 Access Denied

Error: AccessDenied when calling PutObject on s3://bucket/prefix/...

Cause: The migration runner's IAM role lacks S3 permissions.

Fix: Use the IAM policy generated by plan --format iam-policy. Ensure the role has s3:PutObject, s3:GetObject, s3:ListBucket, s3:DeleteObject on both the segments and evidence buckets.

Source Cluster Timeout During Seed

Error: Timeout waiting for metadata from source cluster

Cause: Source cluster is under heavy load or network path is slow.

Fix:

Reduce seed.max_concurrent_partitions (default 4) to lower source cluster load
Verify network bandwidth between runner and source
Check source broker CPU and memory in CloudWatch

KRaft Admin-Routing Flake (B12 pattern)

Error: DescribeConfigs failed on target broker: NOT_CONTROLLER

Cause: KRaft clusters route admin requests to the controller. Metadata may briefly point to a non-controller broker after elections.

Fix: Re-run execute. The retry will typically succeed. If persistent, check target cluster health in the MSK console.

Tail Phase Issues

Lag Not Converging

The tail phase reports per-partition lag. If lag is not decreasing:

Cause 1: Source producer throughput exceeds replication speed.

Reduce source producer throughput during migration, or
Increase runner resources (CPU, network bandwidth)

Cause 2: Compacted topics with high churn.

Compaction deletes records between seed and tail, causing apparent drift
This is expected — precheck warning W07 flags compacted topics

Cause 3: Large messages consuming bandwidth.

Check message.max.bytes on source topics
Consider scheduling migration during off-peak hours

Drain Timeout

Error: drain_timeout exceeded — lag not within threshold after 30m

Fix: Increase cutover.drain_timeout in config, or reduce cutover.drain_max_partition_lag threshold.

Cutover Failures

Manual Producer Freeze in a Non-Interactive Shell

cutover failed: Cutover failed: producer-freeze freeze: stdin is not a TTY — configure cutover.producer_freeze_webhook or run in an interactive shell

Cause: No cutover.producer_freeze_webhook is configured and the cutover command is running without an interactive terminal, so the tool cannot prompt for manual freeze confirmation.

Fix: Configure cutover.producer_freeze_webhook, or rerun cutover from an interactive shell where the operator can confirm the producer freeze.

Producer Freeze Webhook Timeout

Error: producer freeze webhook did not respond within max_producer_freeze

Cause: The webhook URL is unreachable or took too long to respond.

Fix:

Verify the webhook URL is accessible from the migration runner
Increase cutover.max_producer_freeze if producers need more time to drain
Check webhook service health and logs

Producer Freeze Webhook Error

Error: producer freeze webhook returned HTTP 500

Cause: The freeze endpoint returned a non-2xx status.

Fix: Check the webhook service logs. The tool expects HTTP 2xx for success. On failure, producers are automatically unfrozen (best-effort) and the migration enters failed state. Resume after fixing the webhook.

Offset Commit Failed

Error: failed to commit translated offsets: CoordinatorNotAvailable

Cause: The target cluster's consumer group coordinator is unavailable.

Fix: Check target cluster health. Resume the migration — the cutover phase is idempotent and will retry the offset commit.

Validation Failures

Count Mismatch (counts_and_offsets)

Validation FAILED: counts_and_offsets: 3 partitions exceed count_tolerance=1

Cause 1: Records were produced to the target after cutover but before finalize. The validation compares source snapshot to current target state.

Fix: Run finalize immediately after cutover-ack, before any applications write to the target. If this already happened, the extra records are from your applications — increase validation.count_tolerance or accept this as expected.

Cause 2: Compacted topics have different retention on source and target.

Fix: Compacted topic drift is expected and reported as W07. The validation downgrades count mismatches on compacted topics to warnings.

Cause 3: A delete-retention topic aged out restored records on the target before the final switch. This can happen when source records are restored with their original CreateTime timestamps and the target topic has finite retention.ms.

Fix: Temporarily extend topic retention for the migration window, rerun the repair/replay step for the affected partition, and finalize again. Recent versions also run a cutover log-start guard before READY_FOR_CLIENT_SWITCH; if truncation is detected there, the client switch is blocked.

Spot-Check Record Mismatch

Validation FAILED: spot_check_records: 1 mismatched on orders:0

Cause: A sampled record differs between source and target. This can happen with compacted topics where records are deleted between seed and tail.

Fix: Check if the topic uses cleanup.policy=compact. If so, this is expected — the validation reports it as a warning, not a failure, for compacted topics. For non-compacted topics, investigate the specific offset in the validation report.

Sentinel Not Found

Validation FAILED: sentinel_presence: partition 3 sentinel not found

Cause: The cutover sentinel record was not found at the expected offset on the target.

Fix: This is rare. Check the cutover log for the expected sentinel offsets. The sentinel may have been compacted away if the topic has aggressive compaction. Resume finalize — the check will retry.

FAQ

Can finalize succeed after a validation mismatch is repaired?

Yes. finalize writes evidence for the validation attempt. If validation surfaces a repairable issue, fix the affected topic or partition, rerun the verification, and finalize again. The successful terminal state is still finalized.

Example repair/finalize journal path from an IAM migration:

2026-04-25T06:47:45.748014Z awaiting_client_switch -> validating operator confirmed client cutover
2026-04-25T07:13:25.030284Z validating -> failed validation failed: counts_and_offsets: 1 partition(s) exceed count_tolerance=1 (evidence at s3://oso-msk-prod-kraft-migration-evidence-510999144577-eu-west-2/migrations/d0efd3c5-5308-48ff-b5a6-5ea8d4b24708-3--efffc490-733e-46c4-9d0b-57418c0a3906-3--20260425T061502Z/evidence.json)
2026-04-25T07:19:09.528774Z failed -> finalized evidence signed + uploaded to s3://oso-msk-prod-kraft-migration-evidence-510999144577-eu-west-2/migrations/d0efd3c5-5308-48ff-b5a6-5ea8d4b24708-3--efffc490-733e-46c4-9d0b-57418c0a3906-3--20260425T061502Z/evidence.json (validation=PASSED)

The validation details identified one affected partition:

overall=FAILED
counts_and_offsets: 1 partition(s) exceed count_tolerance=1
spot_check_records: 1050 samples compared, 1049 matched, 1 mismatched, 0 skipped
cdc.mongo.catalog.reviews/2 source_span=330 target_span=295 diff=35
fetch cdc.mongo.catalog.reviews/2@0: Kafka error: Broker returned error code 1

The source topic configuration showed finite delete retention:

cleanup.policy=delete
message.timestamp.type=CreateTime
retention.ms=604800000
segment.ms=604800000

After repair, independent source/target comparisons passed:

partitions_checked=306
target_behind_or_missing=0
earliest_partitions_checked=306
earliest_mismatches=0
latest_partitions_checked=306
latest_mismatches=0

Resume Errors

Fingerprint Mismatch

Error: resume fingerprint mismatch — config may have changed between runs

Cause: The migration config (source ARN, target ARN, bucket names) changed between the original execute and the resume attempt.

Fix:

If the config change was intentional, use --force-restart to override (accepts risk of duplicates)
If unintentional, restore the original config and retry

Journal Corruption

Error: failed to parse journal entry at line 5

Cause: The journal file was manually edited or partially written during a crash.

Fix: If using local journal (--journal-dir), check the journal.jsonl file. Remove the incomplete last line if it was mid-write during a crash. The migration will resume from the last complete entry.

Evidence Upload Failures

S3 Object Lock Error

Error: PutObject failed: InvalidRequest — bucket does not have Object Lock enabled

Cause: The config specifies evidence.retention but the bucket was not created with Object Lock.

Fix: Object Lock cannot be added retroactively. Either:

Create a new bucket with Object Lock enabled and update the config
Remove evidence.retention from the config (evidence uploads without retention)

Evidence Signing Key Not Found

Error: signing key not found at path: /etc/kafka-backup/evidence-signing.key

Cause: evidence.signing_key_path points to a file that doesn't exist.

Fix: Either provide the Ed25519 signing key at the configured path, or remove signing_key_path from the config to use the built-in demo key.

Common Patterns

"Works in Dev, Fails in Prod"

Typical causes:

IAM permissions: Dev uses broad admin policies; prod has scoped permissions. Use the generated iam-policy-concrete.json from plan --format iam-policy.
Security groups: Dev runner is in the same VPC; prod runner needs VPC peering or PrivateLink to reach MSK brokers.
Cross-account: Source and target in different AWS accounts need cross-account IAM roles and resource policies.
KMS encryption: Prod S3 bucket uses KMS — the runner needs kms:GenerateDataKey and kms:Decrypt.

Migration Stuck in "Tailing"

The tail phase runs indefinitely until lag converges. If it seems stuck:

Check status output for per-partition lag
Identify the lagging partitions
Check if those partitions have high producer throughput
Consider reducing source throughput or increasing runner resources
If acceptable, increase cutover.drain_max_partition_lag to allow more lag before declaring drain-ready

Cutover Succeeded But Consumers See Old Data

Cause: Client applications have cached the old bootstrap servers or are connecting to the source cluster via a DNS alias that hasn't been updated.

Fix: Verify that application configs point to the target cluster's bootstrap servers. Check DNS TTLs if using CNAME-based routing. Force-restart consumer applications to pick up the new config.

Next Steps

Precheck Codes Reference — all precheck findings
Configuration Reference — tuning parameters
Monitoring Guide — what to watch during migration

Precheck Failures​

Execute Phase Errors​

S3 Access Denied​

Source Cluster Timeout During Seed​

KRaft Admin-Routing Flake (B12 pattern)​

Tail Phase Issues​

Lag Not Converging​

Drain Timeout​

Cutover Failures​

Manual Producer Freeze in a Non-Interactive Shell​

Producer Freeze Webhook Timeout​

Producer Freeze Webhook Error​

Offset Commit Failed​

Validation Failures​

Count Mismatch (counts_and_offsets)​

Spot-Check Record Mismatch​

Sentinel Not Found​

FAQ​

Can finalize succeed after a validation mismatch is repaired?​

Resume Errors​

Fingerprint Mismatch​

Journal Corruption​

Evidence Upload Failures​

S3 Object Lock Error​

Evidence Signing Key Not Found​

Common Patterns​

"Works in Dev, Fails in Prod"​

Migration Stuck in "Tailing"​

Cutover Succeeded But Consumers See Old Data​

Next Steps​

Precheck Failures

Execute Phase Errors

S3 Access Denied

Source Cluster Timeout During Seed

KRaft Admin-Routing Flake (B12 pattern)

Tail Phase Issues

Lag Not Converging

Drain Timeout

Cutover Failures

Manual Producer Freeze in a Non-Interactive Shell

Producer Freeze Webhook Timeout

Producer Freeze Webhook Error

Offset Commit Failed

Validation Failures

Count Mismatch (counts_and_offsets)

Spot-Check Record Mismatch

Sentinel Not Found

FAQ

Can finalize succeed after a validation mismatch is repaired?

Resume Errors

Fingerprint Mismatch

Journal Corruption

Evidence Upload Failures

S3 Object Lock Error

Evidence Signing Key Not Found

Common Patterns

"Works in Dev, Fails in Prod"

Migration Stuck in "Tailing"

Cutover Succeeded But Consumers See Old Data

Next Steps