Skip to main content

MSK KRaft Migration Monitoring

What to Monitor

During a migration, track these key signals:

PhaseMonitorHealthy signal
SeedS3 write throughput, source cluster CPUSteady throughput, CPU < 70%
TailPer-partition lag, records replayedLag decreasing toward 0
Drain ReadyPartition lag stabilityAll partitions below threshold for drain_stable_window
CutoverProducer freeze duration, offset commitFreeze < max_producer_freeze, commits succeed
ValidationCheck outcomesAll 5 checks PASSED or WARNING

Using the Status Command

Check migration progress at any time:

kafka-backup migrate msk-kraft status \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal

The status output includes:

  • Current state (e.g., tail, drain_ready)
  • Per-partition lag (records behind)
  • Total lag across all partitions
  • Time in current state
  • Next expected action

Run status in a watch loop for continuous monitoring:

watch -n 10 'kafka-backup migrate msk-kraft status \
--config migration.yaml \
--migration-id <ID> \
--journal-dir ./journal'

Key Metrics by Phase

Seed Phase

  • Duration: Proportional to data volume. ~30 min for 100GB, ~4h for 1TB.
  • Source cluster CPU: Should stay below 70%. If higher, reduce seed.max_concurrent_partitions.
  • S3 write throughput: Monitor via CloudWatch S3 metrics. Expect sustained writes for the duration.

Tail Phase

  • Per-partition lag: The most important metric. All partitions must reach drain_max_partition_lag for drain to succeed.
  • Lag trend: Decreasing lag = healthy. Flat or increasing = source throughput exceeds replication speed.
  • Records replayed: Steady count increasing = healthy. Plateaus may indicate network issues.

Cutover Phase

  • Producer freeze duration: Should be well under max_producer_freeze (default 60s). If approaching the limit, producers may need more time to drain.
  • Sentinel delivery: All partitions should receive sentinels within seconds.
  • Offset translation: Consumer group offsets should translate without errors.

Alerting Recommendations

ConditionAction
Execute phase running > 2x expected durationInvestigate: check source CPU, S3 throughput, network
Tail lag not decreasing for 10+ minutesCheck source producer throughput, runner resources
Cutover producer freeze approaching timeoutCheck webhook health, increase max_producer_freeze
Validation check FAILEDKeep clients on the target only if the issue is understood; repair the affected topic/partition and retry finalize
Migration in failed stateCheck logs, resolve issue, resume

Interpreting the Journal

The journal (journal.jsonl) records every state transition:

cat ./journal/<MIGRATION_ID>/journal.jsonl | jq .

Each entry contains:

  • from / to: State transition
  • at: UTC timestamp
  • reason: Phase-specific summary (records processed, lag status, etc.)

Healthy journal progression

This excerpt is from a 3-broker IAM migration with 36 topics and 306 partitions:

2026-04-25T06:15:06.445707Z null -> planned planned | resume_fp=c27b080b829c382a
2026-04-25T06:15:06.458133Z planned -> precheck
2026-04-25T06:15:06.464229Z precheck -> topology_copy
2026-04-25T06:15:07.512121Z topology_copy -> seed
2026-04-25T06:25:30.777840Z seed -> tail
2026-04-25T06:26:19.079610Z tail -> drain_ready drain ready: max_partition_lag=0 records_replayed=0 bytes_replayed=0
2026-04-25T06:40:46.854037Z cutover -> awaiting_client_switch READY_FOR_CLIENT_SWITCH: groups_translated=0 offsets_committed=0 warnings=0
2026-04-25T06:47:45.748014Z awaiting_client_switch -> validating operator confirmed client cutover

Failure and recovery

{"from": "seed", "to": "failed", "at": "2026-04-24T12:00:00Z", "reason": "S3 PutObject AccessDenied"}
{"from": "failed", "to": "seed", "at": "2026-04-24T12:30:00Z", "reason": "resumed from seed"}
{"from": "seed", "to": "tail", "at": "2026-04-24T16:00:00Z", "reason": "seed complete"}

Troubleshooting During Migration

Migration seems stuck

  1. Run status to see the current state and lag
  2. Check RUST_LOG=debug output for the last operation
  3. If in tail, check per-partition lag — one lagging partition can hold up drain-ready
  4. If in seed, check S3 write throughput and source cluster health

Migration failed

  1. Read the journal's last entry for the failure reason
  2. Check the full log output (redirect to file with 2>&1 | tee migration.log)
  3. Resolve the issue (permissions, network, cluster health)
  4. Run resume — it re-enters from the last successful state

Want to abort

  • Before cutover: Run rollback to cleanly abort. Source is untouched.
  • After cutover: The source is untouched but offsets are committed on target. Point applications back to source if needed.

Next Steps