MSK KRaft Migration Configuration Reference
All migration settings live under enterprise.msk_kraft_migration in your config YAML.
Minimal Configuration
The smallest valid config requires source, target, backup, and evidence:
enterprise:
msk_kraft_migration:
source:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/source-zk/abc-123
auth:
mode: iam
target:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/target-kraft/def-456
auth:
mode: iam
backup:
s3_bucket: my-migration-segments
s3_prefix: replay/
evidence:
s3_bucket: my-migration-evidence
s3_prefix: migrations/
All other sections (cutover, validation, seed, topology, acl) have sensible defaults.
Full Annotated Example
enterprise:
msk_kraft_migration:
enabled: true # default: true
source:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-zk/abc-123
auth:
mode: iam
target:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-kraft/def-456
auth:
mode: iam
backup:
s3_bucket: prod-migration-segments
s3_prefix: zk-to-kraft/
kms_key_arn: arn:aws:kms:us-east-1:123456789012:key/mrk-abc123 # optional
evidence:
s3_bucket: prod-migration-evidence
s3_prefix: migrations/
retention: 7y # S3 Object Lock retention
signing_key_path: /etc/kafka-backup/evidence-signing.key # optional
cutover:
drain_timeout: 30m # max time waiting for lag to converge
drain_max_partition_lag: 100 # records; all partitions must be below this
drain_stable_window: 30s # lag must stay below threshold for this long
max_producer_freeze: 60s # max time producers are frozen during cutover
producer_freeze_webhook: https://api.example.com/kafka/freeze
reverse_replication_enabled: false # not yet implemented
validation:
count_tolerance: 1 # ±1 record allowed (sentinel offset shift)
spot_check_records_per_partition: 3 # records sampled per partition
seed:
max_concurrent_partitions: 4 # parallel partition transfers
segment_max_bytes: 33554432 # 32 MB per S3 segment
topology:
on_config_drift: overwrite_with_source # overwrite_with_source | keep_target | refuse
acl:
on_drift: merge # merge | replace | refuse
Source / Target (source, target)
Each cluster reference describes how to connect and authenticate.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
cluster_arn | string | Yes | — | AWS MSK cluster ARN. Region and account are extracted from this. |
auth | object | Yes | — | Authentication configuration (see below). |
local_dev | object | No | — | Dev-only override for Docker/local testing. Bypasses MSK API calls. |
Authentication modes (auth)
Authentication is configured with mode as a discriminator:
IAM
auth:
mode: iam
Uses AWS IAM SASL/OAUTHBEARER. Credentials come from the standard AWS SDK chain (environment, instance profile, SSO, etc.).
SCRAM-SHA-512
auth:
mode: scram-sha-512
username: admin
password: ${KAFKA_PASSWORD}
| Field | Type | Required | Description |
|---|---|---|---|
username | string | Yes | SASL username |
password | string | Yes | SASL password (supports ${ENV_VAR} interpolation) |
mTLS
auth:
mode: mtls
keystore: /path/to/client-keystore.jks
keystore_password: ${KEYSTORE_PASS}
truststore: /path/to/truststore.jks
truststore_password: ${TRUSTSTORE_PASS}
| Field | Type | Required | Description |
|---|---|---|---|
keystore | string | Yes | Path to JKS keystore |
keystore_password | string | Yes | Keystore password |
truststore | string | Yes | Path to JKS truststore |
truststore_password | string | Yes | Truststore password |
Plain
auth:
mode: plain
PLAINTEXT connection with no authentication. For dev/test only.
Local dev override (local_dev)
Bypasses MSK DescribeCluster and GetBootstrapBrokers API calls. Used for Docker-based testing.
local_dev:
bootstrap: "localhost:9092,localhost:9093,localhost:9094"
metadata_mode: "ZOOKEEPER" # or "KRAFT"
kafka_version: "3.7.1" # default: 3.7.1
broker_count: 3 # default: 3
region: "local" # default: local
disable_tls: false # default: false; set true for SASL_PLAINTEXT in Docker
| Field | Type | Default | Description |
|---|---|---|---|
bootstrap | string | — | Comma-separated host:port list (required) |
metadata_mode | string | — | "ZOOKEEPER" or "KRAFT" (required) |
kafka_version | string | "3.7.1" | Kafka version string |
broker_count | integer | 3 | Number of brokers |
region | string | "local" | AWS region (used for S3 client) |
disable_tls | boolean | false | Downgrade SASL_SSL to SASL_PLAINTEXT |
Backup Channel (backup)
S3 bucket used as the intermediate replication channel during seed and tail phases.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
s3_bucket | string | Yes | — | S3 bucket name |
s3_prefix | string | Yes | — | S3 key prefix for migration data |
kms_key_arn | string | No | — | AWS KMS key ARN for server-side encryption (SSE-KMS) |
Evidence Store (evidence)
S3 bucket for the signed migration evidence bundle.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
s3_bucket | string | Yes | — | S3 bucket name |
s3_prefix | string | No | "migrations/" | S3 key prefix |
retention | string | No | "7y" | S3 Object Lock retention period (e.g., "7y", "90d") |
signing_key_path | string | No | — | Path to Ed25519 private key for evidence signing |
If the evidence bucket has S3 Object Lock enabled, the evidence bundle is uploaded in COMPLIANCE mode with the configured retention. If Object Lock is not configured, the bundle is uploaded without retention and a warning is logged.
Cutover Options (cutover)
Controls the cutover phase behavior — drain convergence, producer freeze, and sentinel timing.
| Field | Type | Default | Description |
|---|---|---|---|
drain_timeout | duration | 10m | Maximum time to wait for tail lag to converge before giving up. |
drain_max_partition_lag | integer | 1000 | Maximum per-partition lag (records) to consider "caught up". |
drain_stable_window | duration | 60s | Lag must stay below threshold for this duration before drain is declared ready. |
max_producer_freeze | duration | 60s | Maximum time producers remain frozen during cutover. Cutover aborts if exceeded. |
producer_freeze_webhook | string | — | URL to POST for producer freeze/unfreeze. If omitted, the tool prompts on TTY. |
reverse_replication_grace | duration | 1h | Grace period for reverse replication (reserved for future use). |
reverse_replication_enabled | boolean | false | Enable reverse replication after cutover. Not yet implemented — setting to true is a blocker (B13). |
Duration values use human-readable format: 30s, 5m, 1h, 2h30m.
Producer freeze webhook
When configured, the tool sends HTTP POST requests:
POST <webhook_url>
Content-Type: application/json
{"action": "freeze", "migration_id": "<id>", "reason": "cutover"}
The webhook must return HTTP 2xx within max_producer_freeze. On failure or timeout, cutover aborts and the migration enters failed state.
Validation Options (validation)
Controls the 5-check validation suite that runs during finalize.
| Field | Type | Default | Description |
|---|---|---|---|
count_tolerance | integer | 1 | Absolute difference in record count allowed per partition. Set to 1 to account for the sentinel record. |
spot_check_records_per_partition | integer | 3 | Number of records sampled per partition for byte-equality check. Minimum 1. |
Seed Options (seed)
Controls the bulk data transfer (seed) phase.
| Field | Type | Default | Description |
|---|---|---|---|
max_concurrent_partitions | integer | 4 | Number of partitions transferred in parallel. Higher values increase source cluster load. |
segment_max_bytes | integer | 33554432 (32 MB) | Maximum S3 segment size in bytes. Larger segments reduce S3 API calls but use more memory. |
For clusters with 500+ partitions, consider increasing max_concurrent_partitions to 8-16 to reduce seed phase duration. Monitor source cluster CPU and network during seed.
Topology Options (topology)
Controls how topic configuration drift is handled when the target already has pre-existing topics.
| Field | Type | Default | Description |
|---|---|---|---|
on_config_drift | enum | overwrite_with_source | How to handle config differences between source and target topics. |
Drift policies
| Policy | Behavior |
|---|---|
overwrite_with_source | Push source config values to target. Source keys absent on target are added. Default. |
keep_target | Keep pre-existing target values. Source keys absent on target are still added. |
refuse | Refuse to proceed if any config drift exists. Safest for regulated workloads. |
ACL Options (acl)
Controls how ACL binding drift is handled between source and target.
| Field | Type | Default | Description |
|---|---|---|---|
on_drift | enum | merge | How to handle ACL differences between source and target. |
Drift policies
| Policy | Behavior |
|---|---|
merge | Create missing bindings on target. Leave extra target-only bindings in place. Default. |
replace | Create missing bindings. Surface extra target-only bindings in the report. Deletion is deferred to the operator. |
refuse | Refuse to proceed if any target-only ACL binding exists. |
ACL bindings for MSK internal principals (e.g., User:ANONYMOUS, __consumer_offsets) are automatically filtered and never copied to the target. This filtering is reported as warning W09 in precheck.
Environment Variables
| Variable | Description |
|---|---|
AWS_ACCESS_KEY_ID | AWS credentials (standard SDK chain) |
AWS_SECRET_ACCESS_KEY | AWS credentials (standard SDK chain) |
AWS_SESSION_TOKEN | AWS session token for temporary credentials |
AWS_REGION | Default AWS region (overridden by cluster ARN region) |
AWS_ENDPOINT_URL_S3 | Custom S3 endpoint (for LocalStack or S3-compatible storage) |
RUST_LOG | Log level control (e.g., RUST_LOG=debug) |
All credential fields in YAML support ${ENV_VAR} interpolation.
Next Steps
- Production Migration Runbook — step-by-step guide
- Precheck Codes Reference — every finding explained
- CLI Reference — command flags and options