MSK KRaft Migration Architecture
This page covers the internal architecture of the MSK ZK→KRaft migration pipeline — how data flows, how offsets are translated, and how the evidence bundle is constructed.
Design Philosophy
Three principles guide the architecture:
-
No in-place migration. S3 is the replication channel between source and target. The source cluster is never modified — it remains a safe rollback target until you decommission it.
-
Deterministic state machine. Every phase transition is journaled. The migration can be resumed from any point, rolled back before cutover, or audited post-facto from the evidence bundle.
-
Offset continuity over offset identity. Target offsets will differ from source offsets (compaction, replication timing). What matters is that consumers resume from the same message — the offset map translates between the two.
State Machine
The migration runs through 11 states with deterministic transitions:
┌─────────────────────────────────────────────────┐
│ FORWARD PATH │
│ │
PLANNED ──► PRECHECK ──► TOPOLOGY_COPY ──► SEED ──► TAIL ──► DRAIN_READY
│
▼
FINALIZED ◄── VALIDATING ◄── AWAITING_CLIENT_SWITCH ◄── CUTOVER │
│
┌─────────────────────────────────────────────────┘
│ ROLLBACK (pre-cutover only)
▼
ROLLED_BACK
FAILED ◄── (reachable from any state)
| State | Serde name | Description |
|---|---|---|
| Planned | planned | Fresh migration, no mutations yet |
| Precheck | precheck | Read-only cluster analysis |
| TopologyCopy | topology_copy | Topics and ACLs created on target |
| Seed | seed | Bulk data transfer via S3 |
| Tail | tail | Continuous replication bridging seed to cutover |
| DrainReady | drain_ready | Lag converged, awaiting operator |
| Cutover | cutover | Producer freeze, sentinel, offset translation |
| AwaitingClientSwitch | awaiting_client_switch | Operator switching applications |
| Validating | validating | 5-check validation running |
| Finalized | finalized | Evidence signed and uploaded (terminal) |
| RolledBack | rolled_back | Migration aborted (terminal) |
| Failed | failed | Phase failed (resumable) |
Resume logic
On resume, the system reads the journal and dispatches based on the tip state:
- Planned through Tail: Re-enter execute from the last non-failed state
- DrainReady / AwaitingClientSwitch: Return "waiting for operator" (exit code 10)
- Failed: Walk backward (max 4 hops) to find the last non-failed state; re-enter from there
- Finalized / RolledBack: No-op (terminal)
A resume fingerprint (SHA256 of source ARN, target ARN, bucket names) is embedded in the first journal entry. On resume, the fingerprint is verified — this detects silent config changes between crash and resume.
Data Flow: Seed Phase
The seed phase performs a bulk copy of all source data through S3:
Source Cluster S3 Target Cluster
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ │ │ │ │
│ Topic A:0 │──────► │ segments/ │──────► │ Topic A:0 │
│ Topic A:1 │ read │ A/0/... │ write │ Topic A:1 │
│ Topic B:0 │ │ A/1/... │ │ Topic B:0 │
│ ... │ │ B/0/... │ │ ... │
│ │ │ │ │ │
└──────────────┘ │ offset-map │ └──────────────┘
│ .json │
└──────────────┘
- BackupEngine reads source partitions and writes S3 segments (up to
segment_max_byteseach) - RestoreEngine reads segments from S3 and produces to target partitions
- Offset map sidecar records the mapping: for partition P, source offsets
[src_first, src_last]landed at target offsets[tgt_first, tgt_last]
The seed phase runs max_concurrent_partitions transfers in parallel (default 4).
The current artifact layout stores the migration offset map as:
s3://<evidence_bucket>/<evidence_prefix>/<migration_id>/offset-map.json
Data Flow: Tail Phase
After seed, the tail phase bridges the gap between the seed snapshot and real-time:
Source Cluster Tail Bridge Target Cluster
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Consumer │────►│ Producer │────►│ │
│ (from seed │ │ (to target) │ │ New records │
│ end offset) │ │ │ │ appended │
└──────────────┘ │ Updates │ └──────────────┘
│ offset-map │
└──────────────┘
Tail is a direct consumer→producer bridge:
- Consumes from source starting at the seed end offsets
- Produces to target, updating the offset map continuously
- Tracks per-partition lag (source LEO - consumer position)
- Declares
drain_readywhen all partitions are withindrain_max_partition_lagfordrain_stable_window
Tail progress is checkpointed to S3 — on resume, it continues from the checkpoint without re-streaming from seed offsets.
Cutover Protocol
Cutover is the critical phase that ensures offset continuity:
1. Producer Freeze
Producers on the source cluster are frozen (via webhook or manual confirmation). This creates a stable "end of source" boundary.
2. Sentinel Records
A sentinel record is published to every partition on the source:
Header: x-kbe-cutover-sentinel = <migration_id>
Key: __kbe_sentinel__
Value: {"migration_id": "<id>", "partition": <p>, "timestamp": "<iso8601>"}
The sentinel marks the exact boundary between "migrated data" and "nothing else."
3. Final Drain
Tail continues until every sentinel is replicated to the target. At this point, the target has all source data up to and including the sentinel.
4. Consumer Group Snapshot
All consumer group committed offsets are fetched from the source cluster.
5. Offset Translation
For each consumer group's committed offset on each partition:
target_offset = target_first + (source_committed - source_first)
Where source_first/target_first come from the offset map sidecar.
Edge cases:
source_committed < source_first→ ResetToStart: consumer was behind the earliest retained offset. Translated totarget_first.source_committed > source_last→ Clamped: consumer was ahead of the migration boundary. Translated totarget_last.target_firstmissing → Skip: partition has no data on target. Logged as warning.
6. Offset Commit
Translated offsets are committed to the target cluster's consumer group coordinators via the Kafka OffsetCommit API.
7. Target Offset-Floor Guard
Before the tool logs READY_FOR_CLIENT_SWITCH, it fetches target earliest and latest offsets for every migrated partition. It blocks the client switch if the target log-start has advanced past the first copied offset or if the target end offset is behind the expected copied range. This catches retention or DeleteRecords truncation that a latest-offset-only drain check would miss.
Offset Continuity
This is the core differentiator. After cutover:
Source (ZK cluster): Target (KRaft cluster):
┌────────────────────────┐ ┌────────────────────────┐
│ Partition 0: │ │ Partition 0: │
│ [0] msg-A │ │ [0] msg-A │
│ [1] msg-B │ │ [1] msg-B │
│ [2] msg-C ◄── committed│ │ [2] msg-C ◄── committed│
│ [3] msg-D ◄── next msg │ │ [3] msg-D ◄── next msg │
│ [4] msg-E │ │ [4] msg-E │
│ [5] sentinel │ │ [5] sentinel │
└────────────────────────┘ └────────────────────────┘
The consumer group's committed offset points to the same message content on both clusters. When the consumer reconnects to the target, it reads msg-D next — exactly where it left off on the source.
The offset numbers may differ between source and target (due to compaction or replication timing), but the offset map ensures the translation is correct.
ACL Migration
Enumeration
All ACL bindings are fetched from the source via the DescribeAcls API.
Filtering
MSK internal bindings are filtered:
User:ANONYMOUS(MSK internal)- Bindings on
__consumer_offsets,__transaction_state, and other internal topics kafka-cluster:ClusterActionon cluster resource (MSK auto-manages these)
Drift Handling
| Policy | Source-only ACLs | Target-only ACLs |
|---|---|---|
merge | Created on target | Left in place |
replace | Created on target | Reported (not deleted) |
refuse | Error — migration stops | Error — migration stops |
IAM Target
When the target uses IAM auth, Kafka ACLs don't apply. Instead, the tool generates access-map.json �� a mapping of each principal's permissions to the equivalent IAM actions. The operator applies these via their IAM tooling.
Evidence Bundle
The evidence bundle is a JSON document signed with Ed25519:
{
"bundle_json": "<canonical JSON payload>",
"signature_b64": "<Ed25519 signature, base64>",
"public_key_b64": "<Ed25519 public key, base64>"
}
The bundle_json payload contains:
| Section | Contents |
|---|---|
migration_id | Unique migration identifier |
tool_version | kafka-backup version |
signed_at | UTC timestamp |
config_fingerprint | Hash of the migration config |
journal | Complete state transition history |
source / target | Cluster metadata snapshots |
plan | Full migration plan |
topology | Topics created/updated, configs applied |
acls | ACL bindings copied, internals filtered |
seed | Records, bytes, partitions transferred |
tail | Records replayed during tail phase |
drain_final | Per-partition lag at finalize time |
cutover | Sentinel positions, freeze timing, offset translations |
validation | All 5 check outcomes with per-partition detail |
Upload
Evidence is uploaded to two keys:
- Attempt-scoped immutable key:
s3://<evidence_bucket>/<prefix>/<migration_id>/evidence-attempts/<timestamp>-<outcome>-<hash>.json - Latest alias for the migration:
s3://<evidence_bucket>/<prefix>/<migration_id>/evidence.json
Each upload first tries S3 PutObject with COMPLIANCE-mode Object Lock retention derived from evidence.retention. If the bucket lacks Object Lock, the tool uploads without retention and logs a warning.
Resume and Rollback
Journal-Based State Recovery
Every state transition is appended to journal.jsonl:
{"migration_id":"...","from":"seed","to":"tail","at":"2026-04-24T10:40:00Z","reason":"seed complete: 900 records, 4 partitions"}
{"migration_id":"...","from":"tail","to":"drain_ready","at":"2026-04-24T10:41:00Z","reason":"all partitions within lag tolerance"}
On resume, the journal is loaded from S3 (or local directory), the tip state is determined, and execution re-enters from the appropriate phase.
Failed State Recovery
If the tip state is failed, the system walks backward through the journal to find the last non-failed state (capped at 4 hops to prevent oscillation). It then re-enters from that state.
Rollback
Available from: planned, precheck, topology_copy, seed, tail, drain_ready
Rollback:
- Verifies the resume fingerprint
- Best-effort producer unfreeze (if frozen)
- Uploads
rollback-report.jsonto evidence bucket - Appends
→ rolled_backto journal - Does not delete topics/data on target (manual cleanup)
Next Steps
- MSK KRaft Migration Overview — feature summary and quick start
- Configuration Reference — every config field
- Production Runbook — step-by-step guide