MSK KRaft Migration Configuration Reference

All migration settings live under enterprise.msk_kraft_migration in your config YAML.

Minimal Configuration

The smallest valid config requires source, target, backup, and evidence:

migration-minimal.yaml
enterprise:
  msk_kraft_migration:
    source:
      cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/source-zk/abc-123
      auth:
        mode: iam
    target:
      cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/target-kraft/def-456
      auth:
        mode: iam
    backup:
      s3_bucket: my-migration-segments
      s3_prefix: replay/
    evidence:
      s3_bucket: my-migration-evidence
      s3_prefix: migrations/

All other sections (cutover, validation, seed, topology, acl) have sensible defaults.

Full Annotated Example

migration-production.yaml
enterprise:
  msk_kraft_migration:
    enabled: true                              # default: true

    source:
      cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-zk/abc-123
      auth:
        mode: iam

    target:
      cluster_arn: arn:aws:kafka:us-east-1:123456789012:cluster/prod-kraft/def-456
      auth:
        mode: iam

    backup:
      s3_bucket: prod-migration-segments
      s3_prefix: zk-to-kraft/
      kms_key_arn: arn:aws:kms:us-east-1:123456789012:key/mrk-abc123  # optional

    evidence:
      s3_bucket: prod-migration-evidence
      s3_prefix: migrations/
      retention: 7y                            # S3 Object Lock retention
      signing_key_path: /etc/kafka-backup/evidence-signing.key  # optional

    cutover:
      drain_timeout: 30m                       # max time waiting for lag to converge
      drain_max_partition_lag: 100             # records; all partitions must be below this
      drain_stable_window: 30s                 # lag must stay below threshold for this long
      max_producer_freeze: 60s                 # max time producers are frozen during cutover
      producer_freeze_webhook: https://api.example.com/kafka/freeze
      reverse_replication_enabled: false        # not yet implemented

    validation:
      count_tolerance: 1                       # ±1 record allowed (sentinel offset shift)
      spot_check_records_per_partition: 3       # records sampled per partition

    seed:
      max_concurrent_partitions: 4             # parallel partition transfers
      segment_max_bytes: 33554432              # 32 MB per S3 segment

    topology:
      on_config_drift: overwrite_with_source   # overwrite_with_source | keep_target | refuse

    acl:
      on_drift: merge                          # merge | replace | refuse

Source / Target (`source`, `target`)

Each cluster reference describes how to connect and authenticate.

Field	Type	Required	Default	Description
`cluster_arn`	string	Yes	—	AWS MSK cluster ARN. Region and account are extracted from this.
`auth`	object	Yes	—	Authentication configuration (see below).
`local_dev`	object	No	—	Dev-only override for Docker/local testing. Bypasses MSK API calls.

Authentication modes (`auth`)

Authentication is configured with mode as a discriminator:

IAM

auth:
  mode: iam

Uses AWS IAM SASL/OAUTHBEARER. Credentials come from the standard AWS SDK chain (environment, instance profile, SSO, etc.).

SCRAM-SHA-512

auth:
  mode: scram-sha-512
  username: admin
  password: ${KAFKA_PASSWORD}

Field	Type	Required	Description
`username`	string	Yes	SASL username
`password`	string	Yes	SASL password (supports `${ENV_VAR}` interpolation)

mTLS

auth:
  mode: mtls
  keystore: /path/to/client-keystore.jks
  keystore_password: ${KEYSTORE_PASS}
  truststore: /path/to/truststore.jks
  truststore_password: ${TRUSTSTORE_PASS}

Field	Type	Required	Description
`keystore`	string	Yes	Path to JKS keystore
`keystore_password`	string	Yes	Keystore password
`truststore`	string	Yes	Path to JKS truststore
`truststore_password`	string	Yes	Truststore password

Plain

auth:
  mode: plain

PLAINTEXT connection with no authentication. For dev/test only.

Local dev override (`local_dev`)

Bypasses MSK DescribeCluster and GetBootstrapBrokers API calls. Used for Docker-based testing.

local_dev:
  bootstrap: "localhost:9092,localhost:9093,localhost:9094"
  metadata_mode: "ZOOKEEPER"    # or "KRAFT"
  kafka_version: "3.7.1"        # default: 3.7.1
  broker_count: 3               # default: 3
  region: "local"               # default: local
  disable_tls: false            # default: false; set true for SASL_PLAINTEXT in Docker

Field	Type	Default	Description
`bootstrap`	string	—	Comma-separated `host:port` list (required)
`metadata_mode`	string	—	`"ZOOKEEPER"` or `"KRAFT"` (required)
`kafka_version`	string	`"3.7.1"`	Kafka version string
`broker_count`	integer	`3`	Number of brokers
`region`	string	`"local"`	AWS region (used for S3 client)
`disable_tls`	boolean	`false`	Downgrade SASL_SSL to SASL_PLAINTEXT

Backup Channel (`backup`)

S3 bucket used as the intermediate replication channel during seed and tail phases.

Field	Type	Required	Default	Description
`s3_bucket`	string	Yes	—	S3 bucket name
`s3_prefix`	string	Yes	—	S3 key prefix for migration data
`kms_key_arn`	string	No	—	AWS KMS key ARN for server-side encryption (SSE-KMS)

Evidence Store (`evidence`)

S3 bucket for the signed migration evidence bundle.

Field	Type	Required	Default	Description
`s3_bucket`	string	Yes	—	S3 bucket name
`s3_prefix`	string	No	`"migrations/"`	S3 key prefix
`retention`	string	No	`"7y"`	S3 Object Lock retention period (e.g., `"7y"`, `"90d"`)
`signing_key_path`	string	No	—	Path to Ed25519 private key for evidence signing

Object Lock

If the evidence bucket has S3 Object Lock enabled, the evidence bundle is uploaded in COMPLIANCE mode with the configured retention. If Object Lock is not configured, the bundle is uploaded without retention and a warning is logged.

Cutover Options (`cutover`)

Controls the cutover phase behavior — drain convergence, producer freeze, and sentinel timing.

Field	Type	Default	Description
`drain_timeout`	duration	`10m`	Maximum time to wait for tail lag to converge before giving up.
`drain_max_partition_lag`	integer	`1000`	Maximum per-partition lag (records) to consider "caught up".
`drain_stable_window`	duration	`60s`	Lag must stay below threshold for this duration before drain is declared ready.
`max_producer_freeze`	duration	`60s`	Maximum time producers remain frozen during cutover. Cutover aborts if exceeded.
`producer_freeze_webhook`	string	—	URL to POST for producer freeze/unfreeze. If omitted, the tool prompts on TTY.
`reverse_replication_grace`	duration	`1h`	Grace period for reverse replication (reserved for future use).
`reverse_replication_enabled`	boolean	`false`	Enable reverse replication after cutover. Not yet implemented — setting to `true` is a blocker (B13).

Duration values use human-readable format: 30s, 5m, 1h, 2h30m.

Producer freeze webhook

When configured, the tool sends HTTP POST requests:

POST <webhook_url>
Content-Type: application/json

{"action": "freeze", "migration_id": "<id>", "reason": "cutover"}

The webhook must return HTTP 2xx within max_producer_freeze. On failure or timeout, cutover aborts and the migration enters failed state.

Validation Options (`validation`)

Controls the 5-check validation suite that runs during finalize.

Field	Type	Default	Description
`count_tolerance`	integer	`1`	Absolute difference in record count allowed per partition. Set to 1 to account for the sentinel record.
`spot_check_records_per_partition`	integer	`3`	Number of records sampled per partition for byte-equality check. Minimum 1.

Seed Options (`seed`)

Controls the bulk data transfer (seed) phase.

Field	Type	Default	Description
`max_concurrent_partitions`	integer	`4`	Number of partitions transferred in parallel. Higher values increase source cluster load.
`segment_max_bytes`	integer	`33554432` (32 MB)	Maximum S3 segment size in bytes. Larger segments reduce S3 API calls but use more memory.

Tuning for large clusters

For clusters with 500+ partitions, consider increasing max_concurrent_partitions to 8-16 to reduce seed phase duration. Monitor source cluster CPU and network during seed.

Topology Options (`topology`)

Controls how topic configuration drift is handled when the target already has pre-existing topics.

Field	Type	Default	Description
`on_config_drift`	enum	`overwrite_with_source`	How to handle config differences between source and target topics.

Drift policies

Policy	Behavior
`overwrite_with_source`	Push source config values to target. Source keys absent on target are added. Default.
`keep_target`	Keep pre-existing target values. Source keys absent on target are still added.
`refuse`	Refuse to proceed if any config drift exists. Safest for regulated workloads.

ACL Options (`acl`)

Controls how ACL binding drift is handled between source and target.

Field	Type	Default	Description
`on_drift`	enum	`merge`	How to handle ACL differences between source and target.

Drift policies

Policy	Behavior
`merge`	Create missing bindings on target. Leave extra target-only bindings in place. Default.
`replace`	Create missing bindings. Surface extra target-only bindings in the report. Deletion is deferred to the operator.
`refuse`	Refuse to proceed if any target-only ACL binding exists.

MSK internal ACLs

ACL bindings for MSK internal principals (e.g., User:ANONYMOUS, __consumer_offsets) are automatically filtered and never copied to the target. This filtering is reported as warning W09 in precheck.

Environment Variables

Variable	Description
`AWS_ACCESS_KEY_ID`	AWS credentials (standard SDK chain)
`AWS_SECRET_ACCESS_KEY`	AWS credentials (standard SDK chain)
`AWS_SESSION_TOKEN`	AWS session token for temporary credentials
`AWS_REGION`	Default AWS region (overridden by cluster ARN region)
`AWS_ENDPOINT_URL_S3`	Custom S3 endpoint (for LocalStack or S3-compatible storage)
`RUST_LOG`	Log level control (e.g., `RUST_LOG=debug`)

All credential fields in YAML support ${ENV_VAR} interpolation.

Next Steps

Production Migration Runbook — step-by-step guide
Precheck Codes Reference — every finding explained
CLI Reference — command flags and options

Minimal Configuration​

Full Annotated Example​

Source / Target (source, target)​

Authentication modes (auth)​

IAM​

SCRAM-SHA-512​

mTLS​

Plain​

Local dev override (local_dev)​

Backup Channel (backup)​

Evidence Store (evidence)​

Cutover Options (cutover)​

Producer freeze webhook​

Validation Options (validation)​

Seed Options (seed)​

Topology Options (topology)​

Drift policies​

ACL Options (acl)​

Drift policies​

Environment Variables​

Next Steps​

Minimal Configuration

Full Annotated Example

Source / Target (`source`, `target`)

Authentication modes (`auth`)

IAM

SCRAM-SHA-512

mTLS

Plain

Local dev override (`local_dev`)

Backup Channel (`backup`)

Evidence Store (`evidence`)

Cutover Options (`cutover`)

Producer freeze webhook

Validation Options (`validation`)

Seed Options (`seed`)

Topology Options (`topology`)

Drift policies

ACL Options (`acl`)

Drift policies

Environment Variables

Next Steps