MSK KRaft Migration Monitoring

What to Monitor

During a migration, track these key signals:

Phase	Monitor	Healthy signal
Seed	S3 write throughput, source cluster CPU	Steady throughput, CPU < 70%
Tail	Per-partition lag, records replayed	Lag decreasing toward 0
Drain Ready	Partition lag stability	All partitions below threshold for `drain_stable_window`
Cutover	Producer freeze duration, offset commit	Freeze < `max_producer_freeze`, commits succeed
Validation	Check outcomes	All 5 checks PASSED or WARNING

Using the Status Command

Check migration progress at any time:

kafka-backup migrate msk-kraft status \
  --config migration.yaml \
  --migration-id <ID> \
  --journal-dir ./journal

The status output includes:

Current state (e.g., tail, drain_ready)
Per-partition lag (records behind)
Total lag across all partitions
Time in current state
Next expected action

Run status in a watch loop for continuous monitoring:

watch -n 10 'kafka-backup migrate msk-kraft status \
  --config migration.yaml \
  --migration-id <ID> \
  --journal-dir ./journal'

Key Metrics by Phase

Seed Phase

Duration: Proportional to data volume. ~30 min for 100GB, ~4h for 1TB.
Source cluster CPU: Should stay below 70%. If higher, reduce seed.max_concurrent_partitions.
S3 write throughput: Monitor via CloudWatch S3 metrics. Expect sustained writes for the duration.

Tail Phase

Per-partition lag: The most important metric. All partitions must reach drain_max_partition_lag for drain to succeed.
Lag trend: Decreasing lag = healthy. Flat or increasing = source throughput exceeds replication speed.
Records replayed: Steady count increasing = healthy. Plateaus may indicate network issues.

Cutover Phase

Producer freeze duration: Should be well under max_producer_freeze (default 60s). If approaching the limit, producers may need more time to drain.
Sentinel delivery: All partitions should receive sentinels within seconds.
Offset translation: Consumer group offsets should translate without errors.

Alerting Recommendations

Condition	Action
Execute phase running > 2x expected duration	Investigate: check source CPU, S3 throughput, network
Tail lag not decreasing for 10+ minutes	Check source producer throughput, runner resources
Cutover producer freeze approaching timeout	Check webhook health, increase `max_producer_freeze`
Validation check FAILED	Keep clients on the target only if the issue is understood; repair the affected topic/partition and retry `finalize`
Migration in `failed` state	Check logs, resolve issue, `resume`

Interpreting the Journal

The journal (journal.jsonl) records every state transition:

cat ./journal/<MIGRATION_ID>/journal.jsonl | jq .

Each entry contains:

from / to: State transition
at: UTC timestamp
reason: Phase-specific summary (records processed, lag status, etc.)

Healthy journal progression

This excerpt is from a 3-broker IAM migration with 36 topics and 306 partitions:

2026-04-25T06:15:06.445707Z null -> planned planned | resume_fp=c27b080b829c382a
2026-04-25T06:15:06.458133Z planned -> precheck
2026-04-25T06:15:06.464229Z precheck -> topology_copy
2026-04-25T06:15:07.512121Z topology_copy -> seed
2026-04-25T06:25:30.777840Z seed -> tail
2026-04-25T06:26:19.079610Z tail -> drain_ready drain ready: max_partition_lag=0 records_replayed=0 bytes_replayed=0
2026-04-25T06:40:46.854037Z cutover -> awaiting_client_switch READY_FOR_CLIENT_SWITCH: groups_translated=0 offsets_committed=0 warnings=0
2026-04-25T06:47:45.748014Z awaiting_client_switch -> validating operator confirmed client cutover

Failure and recovery

{"from": "seed", "to": "failed", "at": "2026-04-24T12:00:00Z", "reason": "S3 PutObject AccessDenied"}
{"from": "failed", "to": "seed", "at": "2026-04-24T12:30:00Z", "reason": "resumed from seed"}
{"from": "seed", "to": "tail", "at": "2026-04-24T16:00:00Z", "reason": "seed complete"}

Troubleshooting During Migration

Migration seems stuck

Run status to see the current state and lag
Check RUST_LOG=debug output for the last operation
If in tail, check per-partition lag — one lagging partition can hold up drain-ready
If in seed, check S3 write throughput and source cluster health

Migration failed

Read the journal's last entry for the failure reason
Check the full log output (redirect to file with 2>&1 | tee migration.log)
Resolve the issue (permissions, network, cluster health)
Run resume — it re-enters from the last successful state

Want to abort

Before cutover: Run rollback to cleanly abort. Source is untouched.
After cutover: The source is untouched but offsets are committed on target. Point applications back to source if needed.

Next Steps

Production Runbook — step-by-step guide
Troubleshooting — detailed error reference
CLI Reference — status command options

What to Monitor​

Using the Status Command​

Key Metrics by Phase​

Seed Phase​

Tail Phase​

Cutover Phase​

Alerting Recommendations​

Interpreting the Journal​

Healthy journal progression​

Failure and recovery​

Troubleshooting During Migration​

Migration seems stuck​

Migration failed​

Want to abort​

Next Steps​