MSK KRaft Migration Troubleshooting
Precheck Failures
Precheck reports blockers that must be resolved before migration can proceed. See the Precheck Codes Reference for every code with detailed remediation.
Most common blockers:
| Code | Issue | Quick fix |
|---|---|---|
| B09/B10 | Kafka brokers unreachable | Check security groups, VPC peering, bootstrap servers |
| B07/B08 | S3 bucket unreachable | Create bucket or fix IAM policy |
| B11/B12 | Target message size too small | Increase message.max.bytes on target |
Execute Phase Errors
S3 Access Denied
Error: AccessDenied when calling PutObject on s3://bucket/prefix/...
Cause: The migration runner's IAM role lacks S3 permissions.
Fix: Use the IAM policy generated by plan --format iam-policy. Ensure the role has s3:PutObject, s3:GetObject, s3:ListBucket, s3:DeleteObject on both the segments and evidence buckets.
Source Cluster Timeout During Seed
Error: Timeout waiting for metadata from source cluster
Cause: Source cluster is under heavy load or network path is slow.
Fix:
- Reduce
seed.max_concurrent_partitions(default 4) to lower source cluster load - Verify network bandwidth between runner and source
- Check source broker CPU and memory in CloudWatch
KRaft Admin-Routing Flake (B12 pattern)
Error: DescribeConfigs failed on target broker: NOT_CONTROLLER
Cause: KRaft clusters route admin requests to the controller. Metadata may briefly point to a non-controller broker after elections.
Fix: Re-run execute. The retry will typically succeed. If persistent, check target cluster health in the MSK console.
Tail Phase Issues
Lag Not Converging
The tail phase reports per-partition lag. If lag is not decreasing:
Cause 1: Source producer throughput exceeds replication speed.
- Reduce source producer throughput during migration, or
- Increase runner resources (CPU, network bandwidth)
Cause 2: Compacted topics with high churn.
- Compaction deletes records between seed and tail, causing apparent drift
- This is expected — precheck warning W07 flags compacted topics
Cause 3: Large messages consuming bandwidth.
- Check
message.max.byteson source topics - Consider scheduling migration during off-peak hours
Drain Timeout
Error: drain_timeout exceeded — lag not within threshold after 30m
Fix: Increase cutover.drain_timeout in config, or reduce cutover.drain_max_partition_lag threshold.
Cutover Failures
Manual Producer Freeze in a Non-Interactive Shell
cutover failed: Cutover failed: producer-freeze freeze: stdin is not a TTY — configure cutover.producer_freeze_webhook or run in an interactive shell
Cause: No cutover.producer_freeze_webhook is configured and the cutover command is running without an interactive terminal, so the tool cannot prompt for manual freeze confirmation.
Fix: Configure cutover.producer_freeze_webhook, or rerun cutover from an interactive shell where the operator can confirm the producer freeze.
Producer Freeze Webhook Timeout
Error: producer freeze webhook did not respond within max_producer_freeze
Cause: The webhook URL is unreachable or took too long to respond.
Fix:
- Verify the webhook URL is accessible from the migration runner
- Increase
cutover.max_producer_freezeif producers need more time to drain - Check webhook service health and logs
Producer Freeze Webhook Error
Error: producer freeze webhook returned HTTP 500
Cause: The freeze endpoint returned a non-2xx status.
Fix: Check the webhook service logs. The tool expects HTTP 2xx for success. On failure, producers are automatically unfrozen (best-effort) and the migration enters failed state. Resume after fixing the webhook.
Offset Commit Failed
Error: failed to commit translated offsets: CoordinatorNotAvailable
Cause: The target cluster's consumer group coordinator is unavailable.
Fix: Check target cluster health. Resume the migration — the cutover phase is idempotent and will retry the offset commit.
Validation Failures
Count Mismatch (counts_and_offsets)
Validation FAILED: counts_and_offsets: 3 partitions exceed count_tolerance=1
Cause 1: Records were produced to the target after cutover but before finalize. The validation compares source snapshot to current target state.
Fix: Run finalize immediately after cutover-ack, before any applications write to the target. If this already happened, the extra records are from your applications — increase validation.count_tolerance or accept this as expected.
Cause 2: Compacted topics have different retention on source and target.
Fix: Compacted topic drift is expected and reported as W07. The validation downgrades count mismatches on compacted topics to warnings.
Cause 3: A delete-retention topic aged out restored records on the target before the final switch. This can happen when source records are restored with their original CreateTime timestamps and the target topic has finite retention.ms.
Fix: Temporarily extend topic retention for the migration window, rerun the repair/replay step for the affected partition, and finalize again. Recent versions also run a cutover log-start guard before READY_FOR_CLIENT_SWITCH; if truncation is detected there, the client switch is blocked.
Spot-Check Record Mismatch
Validation FAILED: spot_check_records: 1 mismatched on orders:0
Cause: A sampled record differs between source and target. This can happen with compacted topics where records are deleted between seed and tail.
Fix: Check if the topic uses cleanup.policy=compact. If so, this is expected — the validation reports it as a warning, not a failure, for compacted topics. For non-compacted topics, investigate the specific offset in the validation report.
Sentinel Not Found
Validation FAILED: sentinel_presence: partition 3 sentinel not found
Cause: The cutover sentinel record was not found at the expected offset on the target.
Fix: This is rare. Check the cutover log for the expected sentinel offsets. The sentinel may have been compacted away if the topic has aggressive compaction. Resume finalize — the check will retry.
FAQ
Can finalize succeed after a validation mismatch is repaired?
Yes. finalize writes evidence for the validation attempt. If validation surfaces a repairable issue, fix the affected topic or partition, rerun the verification, and finalize again. The successful terminal state is still finalized.
Example repair/finalize journal path from an IAM migration:
2026-04-25T06:47:45.748014Z awaiting_client_switch -> validating operator confirmed client cutover
2026-04-25T07:13:25.030284Z validating -> failed validation failed: counts_and_offsets: 1 partition(s) exceed count_tolerance=1 (evidence at s3://oso-msk-prod-kraft-migration-evidence-510999144577-eu-west-2/migrations/d0efd3c5-5308-48ff-b5a6-5ea8d4b24708-3--efffc490-733e-46c4-9d0b-57418c0a3906-3--20260425T061502Z/evidence.json)
2026-04-25T07:19:09.528774Z failed -> finalized evidence signed + uploaded to s3://oso-msk-prod-kraft-migration-evidence-510999144577-eu-west-2/migrations/d0efd3c5-5308-48ff-b5a6-5ea8d4b24708-3--efffc490-733e-46c4-9d0b-57418c0a3906-3--20260425T061502Z/evidence.json (validation=PASSED)
The validation details identified one affected partition:
overall=FAILED
counts_and_offsets: 1 partition(s) exceed count_tolerance=1
spot_check_records: 1050 samples compared, 1049 matched, 1 mismatched, 0 skipped
cdc.mongo.catalog.reviews/2 source_span=330 target_span=295 diff=35
fetch cdc.mongo.catalog.reviews/2@0: Kafka error: Broker returned error code 1
The source topic configuration showed finite delete retention:
cleanup.policy=delete
message.timestamp.type=CreateTime
retention.ms=604800000
segment.ms=604800000
After repair, independent source/target comparisons passed:
partitions_checked=306
target_behind_or_missing=0
earliest_partitions_checked=306
earliest_mismatches=0
latest_partitions_checked=306
latest_mismatches=0
Resume Errors
Fingerprint Mismatch
Error: resume fingerprint mismatch — config may have changed between runs
Cause: The migration config (source ARN, target ARN, bucket names) changed between the original execute and the resume attempt.
Fix:
- If the config change was intentional, use
--force-restartto override (accepts risk of duplicates) - If unintentional, restore the original config and retry
Journal Corruption
Error: failed to parse journal entry at line 5
Cause: The journal file was manually edited or partially written during a crash.
Fix: If using local journal (--journal-dir), check the journal.jsonl file. Remove the incomplete last line if it was mid-write during a crash. The migration will resume from the last complete entry.
Evidence Upload Failures
S3 Object Lock Error
Error: PutObject failed: InvalidRequest — bucket does not have Object Lock enabled
Cause: The config specifies evidence.retention but the bucket was not created with Object Lock.
Fix: Object Lock cannot be added retroactively. Either:
- Create a new bucket with Object Lock enabled and update the config
- Remove
evidence.retentionfrom the config (evidence uploads without retention)
Evidence Signing Key Not Found
Error: signing key not found at path: /etc/kafka-backup/evidence-signing.key
Cause: evidence.signing_key_path points to a file that doesn't exist.
Fix: Either provide the Ed25519 signing key at the configured path, or remove signing_key_path from the config to use the built-in demo key.
Common Patterns
"Works in Dev, Fails in Prod"
Typical causes:
- IAM permissions: Dev uses broad admin policies; prod has scoped permissions. Use the generated
iam-policy-concrete.jsonfromplan --format iam-policy. - Security groups: Dev runner is in the same VPC; prod runner needs VPC peering or PrivateLink to reach MSK brokers.
- Cross-account: Source and target in different AWS accounts need cross-account IAM roles and resource policies.
- KMS encryption: Prod S3 bucket uses KMS — the runner needs
kms:GenerateDataKeyandkms:Decrypt.
Migration Stuck in "Tailing"
The tail phase runs indefinitely until lag converges. If it seems stuck:
- Check
statusoutput for per-partition lag - Identify the lagging partitions
- Check if those partitions have high producer throughput
- Consider reducing source throughput or increasing runner resources
- If acceptable, increase
cutover.drain_max_partition_lagto allow more lag before declaring drain-ready
Cutover Succeeded But Consumers See Old Data
Cause: Client applications have cached the old bootstrap servers or are connecting to the source cluster via a DNS alias that hasn't been updated.
Fix: Verify that application configs point to the target cluster's bootstrap servers. Check DNS TTLs if using CNAME-based routing. Force-restart consumer applications to pick up the new config.
Next Steps
- Precheck Codes Reference — all precheck findings
- Configuration Reference — tuning parameters
- Monitoring Guide — what to watch during migration