MSK KRaft Migration Architecture

This page covers the internal architecture of the MSK ZK→KRaft migration pipeline — how data flows, how offsets are translated, and how the evidence bundle is constructed.

Design Philosophy

Three principles guide the architecture:

No in-place migration. S3 is the replication channel between source and target. The source cluster is never modified — it remains a safe rollback target until you decommission it.
Deterministic state machine. Every phase transition is journaled. The migration can be resumed from any point, rolled back before cutover, or audited post-facto from the evidence bundle.
Offset continuity over offset identity. Target offsets will differ from source offsets (compaction, replication timing). What matters is that consumers resume from the same message — the offset map translates between the two.

State Machine

The migration runs through 11 states with deterministic transitions:

                    ┌─────────────────────────────────────────────────┐
                    │                  FORWARD PATH                   │
                    │                                                 │
  PLANNED ──► PRECHECK ──► TOPOLOGY_COPY ──► SEED ──► TAIL ──► DRAIN_READY
                                                                     │
                                                                     ▼
  FINALIZED ◄── VALIDATING ◄── AWAITING_CLIENT_SWITCH ◄── CUTOVER  │
                                                                     │
                    ┌─────────────────────────────────────────────────┘
                    │           ROLLBACK (pre-cutover only)
                    ▼
              ROLLED_BACK

              FAILED ◄── (reachable from any state)

State	Serde name	Description
Planned	`planned`	Fresh migration, no mutations yet
Precheck	`precheck`	Read-only cluster analysis
TopologyCopy	`topology_copy`	Topics and ACLs created on target
Seed	`seed`	Bulk data transfer via S3
Tail	`tail`	Continuous replication bridging seed to cutover
DrainReady	`drain_ready`	Lag converged, awaiting operator
Cutover	`cutover`	Producer freeze, sentinel, offset translation
AwaitingClientSwitch	`awaiting_client_switch`	Operator switching applications
Validating	`validating`	5-check validation running
Finalized	`finalized`	Evidence signed and uploaded (terminal)
RolledBack	`rolled_back`	Migration aborted (terminal)
Failed	`failed`	Phase failed (resumable)

Resume logic

On resume, the system reads the journal and dispatches based on the tip state:

Planned through Tail: Re-enter execute from the last non-failed state
DrainReady / AwaitingClientSwitch: Return "waiting for operator" (exit code 10)
Failed: Walk backward (max 4 hops) to find the last non-failed state; re-enter from there
Finalized / RolledBack: No-op (terminal)

A resume fingerprint (SHA256 of source ARN, target ARN, bucket names) is embedded in the first journal entry. On resume, the fingerprint is verified — this detects silent config changes between crash and resume.

Data Flow: Seed Phase

The seed phase performs a bulk copy of all source data through S3:

 Source Cluster                S3                    Target Cluster
┌──────────────┐        ┌──────────────┐        ┌──────────────┐
│              │        │              │        │              │
│  Topic A:0   │──────► │  segments/   │──────► │  Topic A:0   │
│  Topic A:1   │  read  │  A/0/...     │ write  │  Topic A:1   │
│  Topic B:0   │        │  A/1/...     │        │  Topic B:0   │
│  ...         │        │  B/0/...     │        │  ...         │
│              │        │              │        │              │
└──────────────┘        │  offset-map  │        └──────────────┘
                        │  .json       │
                        └──────────────┘

BackupEngine reads source partitions and writes S3 segments (up to segment_max_bytes each)
RestoreEngine reads segments from S3 and produces to target partitions
Offset map sidecar records the mapping: for partition P, source offsets [src_first, src_last] landed at target offsets [tgt_first, tgt_last]

The seed phase runs max_concurrent_partitions transfers in parallel (default 4).

The current artifact layout stores the migration offset map as:

s3://<evidence_bucket>/<evidence_prefix>/<migration_id>/offset-map.json

Data Flow: Tail Phase

After seed, the tail phase bridges the gap between the seed snapshot and real-time:

 Source Cluster          Tail Bridge           Target Cluster
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Consumer     │────►│  Producer     │────►│              │
│  (from seed   │     │  (to target)  │     │  New records  │
│   end offset) │     │              │     │  appended    │
└──────────────┘     │  Updates      │     └──────────────┘
                     │  offset-map   │
                     └──────────────┘

Tail is a direct consumer→producer bridge:

Consumes from source starting at the seed end offsets
Produces to target, updating the offset map continuously
Tracks per-partition lag (source LEO - consumer position)
Declares drain_ready when all partitions are within drain_max_partition_lag for drain_stable_window

Tail progress is checkpointed to S3 — on resume, it continues from the checkpoint without re-streaming from seed offsets.

Cutover Protocol

Cutover is the critical phase that ensures offset continuity:

1. Producer Freeze

Producers on the source cluster are frozen (via webhook or manual confirmation). This creates a stable "end of source" boundary.

2. Sentinel Records

A sentinel record is published to every partition on the source:

Header: x-kbe-cutover-sentinel = <migration_id>
Key: __kbe_sentinel__
Value: {"migration_id": "<id>", "partition": <p>, "timestamp": "<iso8601>"}

The sentinel marks the exact boundary between "migrated data" and "nothing else."

3. Final Drain

Tail continues until every sentinel is replicated to the target. At this point, the target has all source data up to and including the sentinel.

4. Consumer Group Snapshot

All consumer group committed offsets are fetched from the source cluster.

5. Offset Translation

For each consumer group's committed offset on each partition:

target_offset = target_first + (source_committed - source_first)

Where source_first/target_first come from the offset map sidecar.

Edge cases:

source_committed < source_first → ResetToStart: consumer was behind the earliest retained offset. Translated to target_first.
source_committed > source_last → Clamped: consumer was ahead of the migration boundary. Translated to target_last.
target_first missing → Skip: partition has no data on target. Logged as warning.

6. Offset Commit

Translated offsets are committed to the target cluster's consumer group coordinators via the Kafka OffsetCommit API.

7. Target Offset-Floor Guard

Before the tool logs READY_FOR_CLIENT_SWITCH, it fetches target earliest and latest offsets for every migrated partition. It blocks the client switch if the target log-start has advanced past the first copied offset or if the target end offset is behind the expected copied range. This catches retention or DeleteRecords truncation that a latest-offset-only drain check would miss.

Offset Continuity

This is the core differentiator. After cutover:

Source (ZK cluster):                Target (KRaft cluster):
┌────────────────────────┐          ┌────────────────────────┐
│ Partition 0:           │          │ Partition 0:           │
│ [0] msg-A              │          │ [0] msg-A              │
│ [1] msg-B              │          │ [1] msg-B              │
│ [2] msg-C  ◄── committed│         │ [2] msg-C  ◄── committed│
│ [3] msg-D  ◄── next msg │         │ [3] msg-D  ◄── next msg │
│ [4] msg-E              │          │ [4] msg-E              │
│ [5] sentinel           │          │ [5] sentinel           │
└────────────────────────┘          └────────────────────────┘

The consumer group's committed offset points to the same message content on both clusters. When the consumer reconnects to the target, it reads msg-D next — exactly where it left off on the source.

The offset numbers may differ between source and target (due to compaction or replication timing), but the offset map ensures the translation is correct.

ACL Migration

Enumeration

All ACL bindings are fetched from the source via the DescribeAcls API.

Filtering

MSK internal bindings are filtered:

User:ANONYMOUS (MSK internal)
Bindings on __consumer_offsets, __transaction_state, and other internal topics
kafka-cluster:ClusterAction on cluster resource (MSK auto-manages these)

Drift Handling

Policy	Source-only ACLs	Target-only ACLs
`merge`	Created on target	Left in place
`replace`	Created on target	Reported (not deleted)
`refuse`	Error — migration stops	Error — migration stops

IAM Target

When the target uses IAM auth, Kafka ACLs don't apply. Instead, the tool generates access-map.json �� a mapping of each principal's permissions to the equivalent IAM actions. The operator applies these via their IAM tooling.

Evidence Bundle

The evidence bundle is a JSON document signed with Ed25519:

{
  "bundle_json": "<canonical JSON payload>",
  "signature_b64": "<Ed25519 signature, base64>",
  "public_key_b64": "<Ed25519 public key, base64>"
}

The bundle_json payload contains:

Section	Contents
`migration_id`	Unique migration identifier
`tool_version`	kafka-backup version
`signed_at`	UTC timestamp
`config_fingerprint`	Hash of the migration config
`journal`	Complete state transition history
`source` / `target`	Cluster metadata snapshots
`plan`	Full migration plan
`topology`	Topics created/updated, configs applied
`acls`	ACL bindings copied, internals filtered
`seed`	Records, bytes, partitions transferred
`tail`	Records replayed during tail phase
`drain_final`	Per-partition lag at finalize time
`cutover`	Sentinel positions, freeze timing, offset translations
`validation`	All 5 check outcomes with per-partition detail

Upload

Evidence is uploaded to two keys:

Attempt-scoped immutable key: s3://<evidence_bucket>/<prefix>/<migration_id>/evidence-attempts/<timestamp>-<outcome>-<hash>.json
Latest alias for the migration: s3://<evidence_bucket>/<prefix>/<migration_id>/evidence.json

Each upload first tries S3 PutObject with COMPLIANCE-mode Object Lock retention derived from evidence.retention. If the bucket lacks Object Lock, the tool uploads without retention and logs a warning.

Resume and Rollback

Journal-Based State Recovery

Every state transition is appended to journal.jsonl:

{"migration_id":"...","from":"seed","to":"tail","at":"2026-04-24T10:40:00Z","reason":"seed complete: 900 records, 4 partitions"}
{"migration_id":"...","from":"tail","to":"drain_ready","at":"2026-04-24T10:41:00Z","reason":"all partitions within lag tolerance"}

On resume, the journal is loaded from S3 (or local directory), the tip state is determined, and execution re-enters from the appropriate phase.

Failed State Recovery

If the tip state is failed, the system walks backward through the journal to find the last non-failed state (capped at 4 hops to prevent oscillation). It then re-enters from that state.

Rollback

Available from: planned, precheck, topology_copy, seed, tail, drain_ready

Rollback:

Verifies the resume fingerprint
Best-effort producer unfreeze (if frozen)
Uploads rollback-report.json to evidence bucket
Appends → rolled_back to journal
Does not delete topics/data on target (manual cleanup)

Next Steps

MSK KRaft Migration Overview — feature summary and quick start
Configuration Reference — every config field
Production Runbook — step-by-step guide

Design Philosophy​

State Machine​

Resume logic​

Data Flow: Seed Phase​

Data Flow: Tail Phase​

Cutover Protocol​

1. Producer Freeze​

2. Sentinel Records​

3. Final Drain​

4. Consumer Group Snapshot​

5. Offset Translation​

6. Offset Commit​

7. Target Offset-Floor Guard​

Offset Continuity​

ACL Migration​

Enumeration​

Filtering​

Drift Handling​

IAM Target​

Evidence Bundle​

Upload​

Resume and Rollback​

Journal-Based State Recovery​

Failed State Recovery​

Rollback​

Next Steps​