Skip to main content

MSK KRaft Migration Architecture

This page covers the internal architecture of the MSK ZK→KRaft migration pipeline — how data flows, how offsets are translated, and how the evidence bundle is constructed.

Design Philosophy

Three principles guide the architecture:

  1. No in-place migration. S3 is the replication channel between source and target. The source cluster is never modified — it remains a safe rollback target until you decommission it.

  2. Deterministic state machine. Every phase transition is journaled. The migration can be resumed from any point, rolled back before cutover, or audited post-facto from the evidence bundle.

  3. Offset continuity over offset identity. Target offsets will differ from source offsets (compaction, replication timing). What matters is that consumers resume from the same message — the offset map translates between the two.

State Machine

The migration runs through 11 states with deterministic transitions:

                    ┌─────────────────────────────────────────────────┐
│ FORWARD PATH │
│ │
PLANNED ──► PRECHECK ──► TOPOLOGY_COPY ──► SEED ──► TAIL ──► DRAIN_READY


FINALIZED ◄── VALIDATING ◄── AWAITING_CLIENT_SWITCH ◄── CUTOVER │

┌─────────────────────────────────────────────────┘
│ ROLLBACK (pre-cutover only)

ROLLED_BACK

FAILED ◄── (reachable from any state)
StateSerde nameDescription
PlannedplannedFresh migration, no mutations yet
PrecheckprecheckRead-only cluster analysis
TopologyCopytopology_copyTopics and ACLs created on target
SeedseedBulk data transfer via S3
TailtailContinuous replication bridging seed to cutover
DrainReadydrain_readyLag converged, awaiting operator
CutovercutoverProducer freeze, sentinel, offset translation
AwaitingClientSwitchawaiting_client_switchOperator switching applications
Validatingvalidating5-check validation running
FinalizedfinalizedEvidence signed and uploaded (terminal)
RolledBackrolled_backMigration aborted (terminal)
FailedfailedPhase failed (resumable)

Resume logic

On resume, the system reads the journal and dispatches based on the tip state:

  • Planned through Tail: Re-enter execute from the last non-failed state
  • DrainReady / AwaitingClientSwitch: Return "waiting for operator" (exit code 10)
  • Failed: Walk backward (max 4 hops) to find the last non-failed state; re-enter from there
  • Finalized / RolledBack: No-op (terminal)

A resume fingerprint (SHA256 of source ARN, target ARN, bucket names) is embedded in the first journal entry. On resume, the fingerprint is verified — this detects silent config changes between crash and resume.

Data Flow: Seed Phase

The seed phase performs a bulk copy of all source data through S3:

 Source Cluster                S3                    Target Cluster
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ │ │ │ │
│ Topic A:0 │──────► │ segments/ │──────► │ Topic A:0 │
│ Topic A:1 │ read │ A/0/... │ write │ Topic A:1 │
│ Topic B:0 │ │ A/1/... │ │ Topic B:0 │
│ ... │ │ B/0/... │ │ ... │
│ │ │ │ │ │
└──────────────┘ │ offset-map │ └──────────────┘
│ .json │
└──────────────┘
  1. BackupEngine reads source partitions and writes S3 segments (up to segment_max_bytes each)
  2. RestoreEngine reads segments from S3 and produces to target partitions
  3. Offset map sidecar records the mapping: for partition P, source offsets [src_first, src_last] landed at target offsets [tgt_first, tgt_last]

The seed phase runs max_concurrent_partitions transfers in parallel (default 4).

The current artifact layout stores the migration offset map as:

s3://<evidence_bucket>/<evidence_prefix>/<migration_id>/offset-map.json

Data Flow: Tail Phase

After seed, the tail phase bridges the gap between the seed snapshot and real-time:

 Source Cluster          Tail Bridge           Target Cluster
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Consumer │────►│ Producer │────►│ │
│ (from seed │ │ (to target) │ │ New records │
│ end offset) │ │ │ │ appended │
└──────────────┘ │ Updates │ └──────────────┘
│ offset-map │
└──────────────┘

Tail is a direct consumer→producer bridge:

  • Consumes from source starting at the seed end offsets
  • Produces to target, updating the offset map continuously
  • Tracks per-partition lag (source LEO - consumer position)
  • Declares drain_ready when all partitions are within drain_max_partition_lag for drain_stable_window

Tail progress is checkpointed to S3 — on resume, it continues from the checkpoint without re-streaming from seed offsets.

Cutover Protocol

Cutover is the critical phase that ensures offset continuity:

1. Producer Freeze

Producers on the source cluster are frozen (via webhook or manual confirmation). This creates a stable "end of source" boundary.

2. Sentinel Records

A sentinel record is published to every partition on the source:

Header: x-kbe-cutover-sentinel = <migration_id>
Key: __kbe_sentinel__
Value: {"migration_id": "<id>", "partition": <p>, "timestamp": "<iso8601>"}

The sentinel marks the exact boundary between "migrated data" and "nothing else."

3. Final Drain

Tail continues until every sentinel is replicated to the target. At this point, the target has all source data up to and including the sentinel.

4. Consumer Group Snapshot

All consumer group committed offsets are fetched from the source cluster.

5. Offset Translation

For each consumer group's committed offset on each partition:

target_offset = target_first + (source_committed - source_first)

Where source_first/target_first come from the offset map sidecar.

Edge cases:

  • source_committed < source_firstResetToStart: consumer was behind the earliest retained offset. Translated to target_first.
  • source_committed > source_lastClamped: consumer was ahead of the migration boundary. Translated to target_last.
  • target_first missing → Skip: partition has no data on target. Logged as warning.

6. Offset Commit

Translated offsets are committed to the target cluster's consumer group coordinators via the Kafka OffsetCommit API.

7. Target Offset-Floor Guard

Before the tool logs READY_FOR_CLIENT_SWITCH, it fetches target earliest and latest offsets for every migrated partition. It blocks the client switch if the target log-start has advanced past the first copied offset or if the target end offset is behind the expected copied range. This catches retention or DeleteRecords truncation that a latest-offset-only drain check would miss.

Offset Continuity

This is the core differentiator. After cutover:

Source (ZK cluster):                Target (KRaft cluster):
┌────────────────────────┐ ┌────────────────────────┐
│ Partition 0: │ │ Partition 0: │
│ [0] msg-A │ │ [0] msg-A │
│ [1] msg-B │ │ [1] msg-B │
│ [2] msg-C ◄── committed│ │ [2] msg-C ◄── committed│
│ [3] msg-D ◄── next msg │ │ [3] msg-D ◄── next msg │
│ [4] msg-E │ │ [4] msg-E │
│ [5] sentinel │ │ [5] sentinel │
└────────────────────────┘ └────────────────────────┘

The consumer group's committed offset points to the same message content on both clusters. When the consumer reconnects to the target, it reads msg-D next — exactly where it left off on the source.

The offset numbers may differ between source and target (due to compaction or replication timing), but the offset map ensures the translation is correct.

ACL Migration

Enumeration

All ACL bindings are fetched from the source via the DescribeAcls API.

Filtering

MSK internal bindings are filtered:

  • User:ANONYMOUS (MSK internal)
  • Bindings on __consumer_offsets, __transaction_state, and other internal topics
  • kafka-cluster:ClusterAction on cluster resource (MSK auto-manages these)

Drift Handling

PolicySource-only ACLsTarget-only ACLs
mergeCreated on targetLeft in place
replaceCreated on targetReported (not deleted)
refuseError — migration stopsError — migration stops

IAM Target

When the target uses IAM auth, Kafka ACLs don't apply. Instead, the tool generates access-map.json �� a mapping of each principal's permissions to the equivalent IAM actions. The operator applies these via their IAM tooling.

Evidence Bundle

The evidence bundle is a JSON document signed with Ed25519:

{
"bundle_json": "<canonical JSON payload>",
"signature_b64": "<Ed25519 signature, base64>",
"public_key_b64": "<Ed25519 public key, base64>"
}

The bundle_json payload contains:

SectionContents
migration_idUnique migration identifier
tool_versionkafka-backup version
signed_atUTC timestamp
config_fingerprintHash of the migration config
journalComplete state transition history
source / targetCluster metadata snapshots
planFull migration plan
topologyTopics created/updated, configs applied
aclsACL bindings copied, internals filtered
seedRecords, bytes, partitions transferred
tailRecords replayed during tail phase
drain_finalPer-partition lag at finalize time
cutoverSentinel positions, freeze timing, offset translations
validationAll 5 check outcomes with per-partition detail

Upload

Evidence is uploaded to two keys:

  1. Attempt-scoped immutable key: s3://<evidence_bucket>/<prefix>/<migration_id>/evidence-attempts/<timestamp>-<outcome>-<hash>.json
  2. Latest alias for the migration: s3://<evidence_bucket>/<prefix>/<migration_id>/evidence.json

Each upload first tries S3 PutObject with COMPLIANCE-mode Object Lock retention derived from evidence.retention. If the bucket lacks Object Lock, the tool uploads without retention and logs a warning.

Resume and Rollback

Journal-Based State Recovery

Every state transition is appended to journal.jsonl:

{"migration_id":"...","from":"seed","to":"tail","at":"2026-04-24T10:40:00Z","reason":"seed complete: 900 records, 4 partitions"}
{"migration_id":"...","from":"tail","to":"drain_ready","at":"2026-04-24T10:41:00Z","reason":"all partitions within lag tolerance"}

On resume, the journal is loaded from S3 (or local directory), the tip state is determined, and execution re-enters from the appropriate phase.

Failed State Recovery

If the tip state is failed, the system walks backward through the journal to find the last non-failed state (capped at 4 hops to prevent oscillation). It then re-enters from that state.

Rollback

Available from: planned, precheck, topology_copy, seed, tail, drain_ready

Rollback:

  1. Verifies the resume fingerprint
  2. Best-effort producer unfreeze (if frozen)
  3. Uploads rollback-report.json to evidence bucket
  4. Appends → rolled_back to journal
  5. Does not delete topics/data on target (manual cleanup)

Next Steps