Skip to main content

MSK KRaft Migration Configuration Reference

All migration settings live under enterprise.msk_kraft_migration in your config YAML.

Minimal Configuration

The smallest valid config requires source, target, backup, and evidence:

migration-minimal.yaml
enterprise:
msk_kraft_migration:
source:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/source-zk/abc-123
auth:
mode: iam
target:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/target-kraft/def-456
auth:
mode: iam
backup:
s3_bucket: my-migration-segments
s3_prefix: replay/
evidence:
s3_bucket: my-migration-evidence
s3_prefix: migrations/

All other sections (cutover, validation, seed, topology, acl) have sensible defaults.

Full Annotated Example

migration-production.yaml
enterprise:
msk_kraft_migration:
enabled: true # default: true

source:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-zk/abc-123
auth:
mode: iam

target:
cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-kraft/def-456
auth:
mode: iam

backup:
s3_bucket: prod-migration-segments
s3_prefix: zk-to-kraft/
kms_key_arn: arn:aws:kms:us-east-1:123456789012:key/mrk-abc123 # optional

evidence:
s3_bucket: prod-migration-evidence
s3_prefix: migrations/
retention: 7y # S3 Object Lock retention
signing_key_path: /etc/kafka-backup/evidence-signing.key # optional

cutover:
drain_timeout: 30m # max time waiting for lag to converge
drain_max_partition_lag: 100 # records; all partitions must be below this
drain_stable_window: 30s # lag must stay below threshold for this long
max_producer_freeze: 60s # max time producers are frozen during cutover
producer_freeze_webhook: https://api.example.com/kafka/freeze
reverse_replication_enabled: false # not yet implemented

validation:
count_tolerance: 1 # ±1 record allowed (sentinel offset shift)
spot_check_records_per_partition: 3 # records sampled per partition

seed:
max_concurrent_partitions: 4 # parallel partition transfers
segment_max_bytes: 33554432 # 32 MB per S3 segment

topology:
on_config_drift: overwrite_with_source # overwrite_with_source | keep_target | refuse

acl:
on_drift: merge # merge | replace | refuse

Source / Target (source, target)

Each cluster reference describes how to connect and authenticate.

FieldTypeRequiredDefaultDescription
cluster_arnstringYesAWS MSK cluster ARN. Region and account are extracted from this.
authobjectYesAuthentication configuration (see below).
local_devobjectNoDev-only override for Docker/local testing. Bypasses MSK API calls.

Authentication modes (auth)

Authentication is configured with mode as a discriminator:

IAM

auth:
mode: iam

Uses AWS IAM SASL/OAUTHBEARER. Credentials come from the standard AWS SDK chain (environment, instance profile, SSO, etc.).

SCRAM-SHA-512

auth:
mode: scram-sha-512
username: admin
password: ${KAFKA_PASSWORD}
FieldTypeRequiredDescription
usernamestringYesSASL username
passwordstringYesSASL password (supports ${ENV_VAR} interpolation)

mTLS

auth:
mode: mtls
keystore: /path/to/client-keystore.jks
keystore_password: ${KEYSTORE_PASS}
truststore: /path/to/truststore.jks
truststore_password: ${TRUSTSTORE_PASS}
FieldTypeRequiredDescription
keystorestringYesPath to JKS keystore
keystore_passwordstringYesKeystore password
truststorestringYesPath to JKS truststore
truststore_passwordstringYesTruststore password

Plain

auth:
mode: plain

PLAINTEXT connection with no authentication. For dev/test only.

Local dev override (local_dev)

Bypasses MSK DescribeCluster and GetBootstrapBrokers API calls. Used for Docker-based testing.

local_dev:
bootstrap: "localhost:9092,localhost:9093,localhost:9094"
metadata_mode: "ZOOKEEPER" # or "KRAFT"
kafka_version: "3.7.1" # default: 3.7.1
broker_count: 3 # default: 3
region: "local" # default: local
disable_tls: false # default: false; set true for SASL_PLAINTEXT in Docker
FieldTypeDefaultDescription
bootstrapstringComma-separated host:port list (required)
metadata_modestring"ZOOKEEPER" or "KRAFT" (required)
kafka_versionstring"3.7.1"Kafka version string
broker_countinteger3Number of brokers
regionstring"local"AWS region (used for S3 client)
disable_tlsbooleanfalseDowngrade SASL_SSL to SASL_PLAINTEXT

Backup Channel (backup)

S3 bucket used as the intermediate replication channel during seed and tail phases.

FieldTypeRequiredDefaultDescription
s3_bucketstringYesS3 bucket name
s3_prefixstringYesS3 key prefix for migration data
kms_key_arnstringNoAWS KMS key ARN for server-side encryption (SSE-KMS)

Evidence Store (evidence)

S3 bucket for the signed migration evidence bundle.

FieldTypeRequiredDefaultDescription
s3_bucketstringYesS3 bucket name
s3_prefixstringNo"migrations/"S3 key prefix
retentionstringNo"7y"S3 Object Lock retention period (e.g., "7y", "90d")
signing_key_pathstringNoPath to Ed25519 private key for evidence signing
Object Lock

If the evidence bucket has S3 Object Lock enabled, the evidence bundle is uploaded in COMPLIANCE mode with the configured retention. If Object Lock is not configured, the bundle is uploaded without retention and a warning is logged.


Cutover Options (cutover)

Controls the cutover phase behavior — drain convergence, producer freeze, and sentinel timing.

FieldTypeDefaultDescription
drain_timeoutduration10mMaximum time to wait for tail lag to converge before giving up.
drain_max_partition_laginteger1000Maximum per-partition lag (records) to consider "caught up".
drain_stable_windowduration60sLag must stay below threshold for this duration before drain is declared ready.
max_producer_freezeduration60sMaximum time producers remain frozen during cutover. Cutover aborts if exceeded.
producer_freeze_webhookstringURL to POST for producer freeze/unfreeze. If omitted, the tool prompts on TTY.
reverse_replication_graceduration1hGrace period for reverse replication (reserved for future use).
reverse_replication_enabledbooleanfalseEnable reverse replication after cutover. Not yet implemented — setting to true is a blocker (B13).

Duration values use human-readable format: 30s, 5m, 1h, 2h30m.

Producer freeze webhook

When configured, the tool sends HTTP POST requests:

POST <webhook_url>
Content-Type: application/json

{"action": "freeze", "migration_id": "<id>", "reason": "cutover"}

The webhook must return HTTP 2xx within max_producer_freeze. On failure or timeout, cutover aborts and the migration enters failed state.


Validation Options (validation)

Controls the 5-check validation suite that runs during finalize.

FieldTypeDefaultDescription
count_toleranceinteger1Absolute difference in record count allowed per partition. Set to 1 to account for the sentinel record.
spot_check_records_per_partitioninteger3Number of records sampled per partition for byte-equality check. Minimum 1.

Seed Options (seed)

Controls the bulk data transfer (seed) phase.

FieldTypeDefaultDescription
max_concurrent_partitionsinteger4Number of partitions transferred in parallel. Higher values increase source cluster load.
segment_max_bytesinteger33554432 (32 MB)Maximum S3 segment size in bytes. Larger segments reduce S3 API calls but use more memory.
Tuning for large clusters

For clusters with 500+ partitions, consider increasing max_concurrent_partitions to 8-16 to reduce seed phase duration. Monitor source cluster CPU and network during seed.


Topology Options (topology)

Controls how topic configuration drift is handled when the target already has pre-existing topics.

FieldTypeDefaultDescription
on_config_driftenumoverwrite_with_sourceHow to handle config differences between source and target topics.

Drift policies

PolicyBehavior
overwrite_with_sourcePush source config values to target. Source keys absent on target are added. Default.
keep_targetKeep pre-existing target values. Source keys absent on target are still added.
refuseRefuse to proceed if any config drift exists. Safest for regulated workloads.

ACL Options (acl)

Controls how ACL binding drift is handled between source and target.

FieldTypeDefaultDescription
on_driftenummergeHow to handle ACL differences between source and target.

Drift policies

PolicyBehavior
mergeCreate missing bindings on target. Leave extra target-only bindings in place. Default.
replaceCreate missing bindings. Surface extra target-only bindings in the report. Deletion is deferred to the operator.
refuseRefuse to proceed if any target-only ACL binding exists.
MSK internal ACLs

ACL bindings for MSK internal principals (e.g., User:ANONYMOUS, __consumer_offsets) are automatically filtered and never copied to the target. This filtering is reported as warning W09 in precheck.


Environment Variables

VariableDescription
AWS_ACCESS_KEY_IDAWS credentials (standard SDK chain)
AWS_SECRET_ACCESS_KEYAWS credentials (standard SDK chain)
AWS_SESSION_TOKENAWS session token for temporary credentials
AWS_REGIONDefault AWS region (overridden by cluster ARN region)
AWS_ENDPOINT_URL_S3Custom S3 endpoint (for LocalStack or S3-compatible storage)
RUST_LOGLog level control (e.g., RUST_LOG=debug)

All credential fields in YAML support ${ENV_VAR} interpolation.

Next Steps