MSK KRaft Migration Monitoring
What to Monitor
During a migration, track these key signals:
| Phase | Monitor | Healthy signal |
|---|---|---|
| Seed | S3 write throughput, source cluster CPU | Steady throughput, CPU < 70% |
| Tail | Per-partition lag, records replayed | Lag decreasing toward 0 |
| Drain Ready | Partition lag stability | All partitions below threshold for drain_stable_window |
| Cutover | Producer freeze duration, offset commit | Freeze < max_producer_freeze, commits succeed |
| Validation | Check outcomes | All 5 checks PASSED or WARNING |
Using the Status Command
Check migration progress at any time:
kafka-backup migrate msk-kraft status \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal
The status output includes:
- Current state (e.g.,
tail,drain_ready) - Per-partition lag (records behind)
- Total lag across all partitions
- Time in current state
- Next expected action
Run status in a watch loop for continuous monitoring:
watch -n 10 'kafka-backup migrate msk-kraft status \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal'
Key Metrics by Phase
Seed Phase
- Duration: Proportional to data volume. ~30 min for 100GB, ~4h for 1TB.
- Source cluster CPU: Should stay below 70%. If higher, reduce
seed.max_concurrent_partitions. - S3 write throughput: Monitor via CloudWatch S3 metrics. Expect sustained writes for the duration.
Tail Phase
- Per-partition lag: The most important metric. All partitions must reach
drain_max_partition_lagfor drain to succeed. - Lag trend: Decreasing lag = healthy. Flat or increasing = source throughput exceeds replication speed.
- Records replayed: Steady count increasing = healthy. Plateaus may indicate network issues.
Cutover Phase
- Producer freeze duration: Should be well under
max_producer_freeze(default 60s). If approaching the limit, producers may need more time to drain. - Sentinel delivery: All partitions should receive sentinels within seconds.
- Offset translation: Consumer group offsets should translate without errors.
Alerting Recommendations
| Condition | Action |
|---|---|
| Execute phase running > 2x expected duration | Investigate: check source CPU, S3 throughput, network |
| Tail lag not decreasing for 10+ minutes | Check source producer throughput, runner resources |
| Cutover producer freeze approaching timeout | Check webhook health, increase max_producer_freeze |
| Validation check FAILED | Keep clients on the target only if the issue is understood; repair the affected topic/partition and retry finalize |
Migration in failed state | Check logs, resolve issue, resume |
Interpreting the Journal
The journal (journal.jsonl) records every state transition:
cat ./journal/<MIGRATION_ID>/journal.jsonl | jq .
Each entry contains:
from/to: State transitionat: UTC timestampreason: Phase-specific summary (records processed, lag status, etc.)
Healthy journal progression
This excerpt is from a 3-broker IAM migration with 36 topics and 306 partitions:
2026-04-25T06:15:06.445707Z null -> planned planned | resume_fp=c27b080b829c382a
2026-04-25T06:15:06.458133Z planned -> precheck
2026-04-25T06:15:06.464229Z precheck -> topology_copy
2026-04-25T06:15:07.512121Z topology_copy -> seed
2026-04-25T06:25:30.777840Z seed -> tail
2026-04-25T06:26:19.079610Z tail -> drain_ready drain ready: max_partition_lag=0 records_replayed=0 bytes_replayed=0
2026-04-25T06:40:46.854037Z cutover -> awaiting_client_switch READY_FOR_CLIENT_SWITCH: groups_translated=0 offsets_committed=0 warnings=0
2026-04-25T06:47:45.748014Z awaiting_client_switch -> validating operator confirmed client cutover
Failure and recovery
{"from": "seed", "to": "failed", "at": "2026-04-24T12:00:00Z", "reason": "S3 PutObject AccessDenied"}
{"from": "failed", "to": "seed", "at": "2026-04-24T12:30:00Z", "reason": "resumed from seed"}
{"from": "seed", "to": "tail", "at": "2026-04-24T16:00:00Z", "reason": "seed complete"}
Troubleshooting During Migration
Migration seems stuck
- Run
statusto see the current state and lag - Check RUST_LOG=debug output for the last operation
- If in
tail, check per-partition lag — one lagging partition can hold up drain-ready - If in
seed, check S3 write throughput and source cluster health
Migration failed
- Read the journal's last entry for the failure reason
- Check the full log output (redirect to file with
2>&1 | tee migration.log) - Resolve the issue (permissions, network, cluster health)
- Run
resume— it re-enters from the last successful state
Want to abort
- Before cutover: Run
rollbackto cleanly abort. Source is untouched. - After cutover: The source is untouched but offsets are committed on target. Point applications back to source if needed.
Next Steps
- Production Runbook — step-by-step guide
- Troubleshooting — detailed error reference
- CLI Reference — status command options