Skip to main content

Performance Efficiency

"Using compute, storage, and network resources efficiently to meet backup and restore throughput requirements, and maintaining that efficiency as data volumes grow."

Kafka backup workloads must keep pace with production data rates while consuming the minimum necessary resources. The Performance Efficiency pillar ensures your backup architecture is right-sized, properly tuned, and continuously benchmarked — so that growing data volumes never outstrip your ability to protect them.

Design Principles

  1. Right-size for throughput requirements — Match compute, memory, and network capacity to measured data rates rather than guessing. Over-provisioning wastes budget; under-provisioning causes backup lag.

  2. Use compression to reduce storage and network overhead — Compress backup segments to shrink storage footprint and reduce network transfer time, choosing an algorithm that balances ratio against CPU cost.

  3. Benchmark before deploying to production — Validate throughput, latency, and resource consumption under realistic load before going live. Assumptions about performance are not a substitute for measurement.

  4. Monitor performance continuously and act on trends — Track throughput, compression ratios, and resource utilisation over time. Identify degradation early and address it before it becomes an incident.

  5. Go serverless / managed where possible — Prefer managed object storage (S3, GCS, Azure Blob) over self-managed block storage. Managed services scale automatically and eliminate storage infrastructure overhead.

  6. Experiment with configuration in staging before production — Test tuning changes (segment sizes, compression levels, concurrency) in a staging environment that mirrors production topology and data characteristics.


Best Practices

PE-01: Throughput Optimisation

Performance Targets

Establish clear targets for your environment and validate them through benchmarking:

MetricTargetNotes
Throughput per partition100+ MB/sDepends on network, storage backend, and message size
Checkpoint latency (p99)< 100 msEnsures minimal data loss window on failure
Compression ratio3--5xTypical for JSON/text payloads with zstd
Memory usage< 500 MB for 4 partitionsRust-native efficiency keeps the footprint low

Tuning Levers

  • segment_max_bytes — Set between 128 MB and 256 MB. Larger segments reduce the number of storage API calls but increase the flush latency window.
  • fetch_max_bytes — Increase to match partition throughput. Undersized fetch buffers cause excessive round trips to brokers.
  • compression — Choose zstd for the best ratio-to-speed trade-off, or lz4 when CPU is constrained.
  • max_concurrent_partitions — Scale this to match available CPU cores and network bandwidth. Each partition backup runs in its own async task.

Implementation Guidance

  • Scale horizontally — Run multiple kafka-backup instances, each handling a subset of partitions, rather than scaling a single instance vertically.
  • Co-locate with Kafka brokers — Deploy backup instances in the same availability zone as the brokers they read from to minimise network latency and cross-AZ data transfer costs.
  • Tune incrementally — Change one parameter at a time and measure the impact. Simultaneous changes make it impossible to attribute improvements or regressions.

Configuration: High-Throughput Backup

kafka:
bootstrap_servers: "kafka-0:9092,kafka-1:9092,kafka-2:9092"
fetch_max_bytes: 52428800 # 50 MB

backup:
segment_max_bytes: 268435456 # 256 MB
compression: zstd
compression_level: 3
max_concurrent_partitions: 8

storage:
type: s3
bucket: my-kafka-backups
region: eu-west-1
tip

Start with segment_max_bytes: 134217728 (128 MB) and increase to 256 MB only after confirming that your storage backend handles larger PUT requests without timeout issues.

Anti-patterns

Anti-patterns
  • Using default configuration for high-throughput topics — Defaults are conservative. Production workloads almost always benefit from tuning segment size, fetch size, and concurrency.
  • Deploying backup in a different region from Kafka — Cross-region data transfer adds latency, cost, and fragility.
  • Using very small segments (< 32 MB) — Creates excessive storage API calls, increasing both latency and cost.
  • No benchmarking before production — Performance assumptions based on development data volumes are unreliable.

PE-02: Compression Strategy

Algorithm Comparison

AlgorithmCompression RatioSpeedCPU UsageBest For
zstd3--5xFastModerateGeneral-purpose default
lz42--3xVery fastLowLatency-sensitive workloads
none1xFastestNonePre-compressed or encrypted data

Compression Ratios by Data Format

Data FormatTypical Ratio (zstd)Notes
JSON5--8xHighly repetitive structure compresses well
Avro2--4xBinary format, moderate compressibility
Protobuf2--3xCompact binary, less headroom for compression
Already compressed< 1.1xNo benefit — adds CPU cost for no gain

Zstd Compression Levels

Level RangeSpeedRatioUse Case
1--3FastModerateRecommended for real-time backup
4--9BalancedHigherBatch or off-peak backup windows
10+SlowMaximumArchival, where ratio matters more than speed
tip

Levels 1--3 offer the best throughput-to-ratio trade-off for real-time backup. Only increase beyond 3 if you have measured a meaningful improvement in your specific data and can tolerate the additional CPU overhead.

Monitoring Compression

Monitor the kafka_backup_compression_ratio metric to detect changes in data compressibility over time. A sudden drop in ratio may indicate a change in message format or the introduction of pre-compressed payloads.

Anti-patterns

Anti-patterns
  • Compressing already-compressed data — Wastes CPU cycles for negligible size reduction. Disable compression for topics containing gzip, snappy, or encrypted payloads.
  • Using maximum compression levels for latency-sensitive workloads — High zstd levels (10+) can add significant CPU time per segment, increasing flush latency.
  • Never benchmarking compression — Different data formats yield vastly different ratios. Benchmark your actual payloads to choose the right algorithm and level.

PE-03: Storage Backend Selection

Backend Comparison

BackendThroughputDurabilityRelative CostNotes
Amazon S3High11 9's$$Most mature, widest tooling ecosystem
Azure Blob StorageHigh17 9's (GRS)$$Native integration with Azure workloads
Google Cloud StorageHigh11 9's$$Strong multi-region replication
MinIOHighDepends on deployment$S3-compatible, self-hosted
FilesystemVery highDepends on underlying storageFreeNo network overhead, limited durability

Production Recommendations

  • Use object storage — S3, Azure Blob, or GCS provide the durability, scalability, and lifecycle management required for production backup data.
  • Enable versioning — Protect against accidental overwrites or deletions. Versioning also supports compliance requirements for immutable backups.
  • Use Transfer Acceleration for cross-region — When restoring data to a different region, enable S3 Transfer Acceleration or equivalent to reduce transfer time.

Development Recommendations

  • Filesystem or MinIO for local development — Avoids cloud costs and network dependencies during development and testing.
  • Mirror production configuration — Use the same segment sizes, compression settings, and directory structures so that performance characteristics transfer to production.

Reduce Latency with VPC Endpoints

Deploy VPC endpoints (AWS), private endpoints (Azure), or Private Service Connect (GCP) to keep backup traffic on the cloud provider's backbone network. This eliminates internet gateway latency and reduces data transfer costs.

Configuration Examples

Amazon S3:

storage:
type: s3
bucket: my-kafka-backups
region: eu-west-1
# Use VPC endpoint for private network access
endpoint: https://s3.eu-west-1.amazonaws.com

Azure Blob Storage:

storage:
type: azure
container: my-kafka-backups
account_name: mystorageaccount

Filesystem (development):

storage:
type: filesystem
base_path: /var/lib/kafka-backup/data

Anti-patterns

Anti-patterns
  • Using filesystem storage for production — Filesystem storage lacks the durability guarantees, lifecycle management, and cross-region replication of object storage.
  • No versioning on backup buckets — A single accidental deletion or overwrite can destroy your recovery capability.
  • Storing backups on a different continent from Kafka — Intercontinental data transfer adds significant latency and cost to every backup segment write.
  • No VPC endpoints — Routing backup traffic through the public internet adds latency, increases cost, and exposes data to unnecessary network hops.

PE-04: Resource Sizing & Scaling

Sizing Guidelines

Partition CountCPUMemoryNetwork
1--41 vCPU512 MB1 Gbps
5--162 vCPU1 GB5 Gbps
17--644 vCPU2 GB10 Gbps
65+Scale horizontally
tip

OSO Kafka Backup is written in Rust and is designed to use less than 500 MB of memory for 4 partitions. The sizing table above includes headroom for bursts and garbage-free operation.

Kubernetes Resource Configuration

Set both requests and limits to ensure predictable scheduling and prevent noisy-neighbour effects:

apiVersion: apps/v1
kind: Deployment
metadata:
name: kafka-backup
spec:
replicas: 1
selector:
matchLabels:
app: kafka-backup
template:
metadata:
labels:
app: kafka-backup
spec:
containers:
- name: kafka-backup
image: ghcr.io/osodevops/kafka-backup:latest
resources:
requests:
cpu: "1"
memory: "512Mi"
limits:
cpu: "2"
memory: "1Gi"
volumeMounts:
- name: config
mountPath: /etc/kafka-backup
volumes:
- name: config
configMap:
name: kafka-backup-config

Horizontal Pod Autoscaling

Scale based on backup lag to ensure partitions are covered as throughput increases:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: kafka-backup-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: kafka-backup
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: kafka_backup_lag_bytes
target:
type: AverageValue
averageValue: "104857600" # 100 MB average lag

Anti-patterns

Anti-patterns
  • No resource limits — Without limits, a misbehaving backup instance can consume all node resources and affect other workloads.
  • No resource requests — Without requests, the Kubernetes scheduler cannot make informed placement decisions, leading to resource contention.
  • Vertical scaling only — A single large instance has a failure blast radius that affects all partitions. Horizontal scaling isolates failures.
  • Running backup on Kafka broker nodes — Backup I/O competes with broker I/O for disk and network bandwidth, degrading both.

PE-05: Network Optimisation

Co-location

Deploy backup instances in the same availability zone as the Kafka brokers they read from. This provides the lowest latency and eliminates cross-AZ data transfer charges.

warning

Cross-AZ data transfer in AWS costs $0.01/GB in each direction. For a topic producing 1 TB/day, this adds approximately $600/month in transfer costs alone.

Private Network Paths

  • VPC endpoints — Use gateway or interface VPC endpoints for S3 and other storage services to keep traffic off the public internet.
  • Rack awareness — Configure client.rack so that Kafka directs fetch requests to the closest replica, reducing cross-rack and cross-AZ traffic.

Fetch Size Optimisation

Increase fetch_max_bytes to reduce the number of fetch round trips:

kafka:
fetch_max_bytes: 52428800 # 50 MB
fetch_max_wait_ms: 500
tip

Larger fetch sizes amortise the per-request overhead but increase memory usage. Monitor memory consumption when tuning fetch sizes upward.

Cross-Region Restores

When restoring data to a different region:

  • Enable S3 Transfer Acceleration or equivalent service to optimise long-distance transfers.
  • Consider pre-staging backup data to the target region using storage replication before initiating the restore.

Anti-patterns

Anti-patterns
  • Backup in a different region from Kafka — Adds latency to every fetch request and segment upload, reducing throughput and increasing costs.
  • Small fetch sizes with high-throughput topics — Causes excessive round trips to brokers, wasting network capacity on request overhead.
  • Shared network without QoS — Backup traffic can saturate shared links, affecting production Kafka clients.
  • No consideration for cross-AZ costs — Multi-AZ deployments are resilient but expensive. Understand the cost trade-off and optimise placement accordingly.

PE-06: Benchmarking & Load Testing

Benchmark Suite

Use the kafka-backup-demos benchmark suite to measure performance under controlled conditions.

# Clone the benchmark suite
git clone https://github.com/osodevops/kafka-backup-demos.git
cd kafka-backup-demos

# Run the throughput benchmark
./benchmarks/run-throughput-test.sh --partitions 4 --message-size 1024 --duration 300

Benchmark Scenarios

Run the following scenarios to build a complete performance profile:

ScenarioPurpose
Max throughput per partitionEstablish single-partition ceiling
Multi-partition scalingValidate linear scaling with partition count
Large messages (> 1 MB)Identify buffer and timeout issues
Restore speedMeasure time-to-recovery for capacity planning
WAN latency simulationUnderstand cross-region performance impact

Record Baselines

For each scenario, record:

  • Throughput — MB/s sustained over the test duration
  • Duration — Total time to back up or restore the test dataset
  • Resource utilisation — CPU, memory, network, and disk I/O during the test

Store baselines in version control alongside your backup configuration so they can be compared over time.

When to Re-run Benchmarks

  • After version upgrades of kafka-backup
  • After configuration changes to segment size, compression, or concurrency
  • After infrastructure changes such as instance type, network topology, or storage backend
  • Quarterly as a routine check against baseline drift
warning

Benchmarks run on small data volumes (< 1 GB) do not represent production performance. Use datasets that are at least 10x the segment size and run for a minimum of 5 minutes to capture steady-state behaviour.

Anti-patterns

Anti-patterns
  • No benchmarking before production — Deploying without performance validation is deploying blind. Every environment has unique characteristics that affect throughput.
  • Benchmarking with unrealistically small data volumes — Small datasets fit entirely in OS page cache, producing misleadingly high throughput numbers.
  • Benchmarking backup only, not restore — Restore performance is equally critical and often has different bottlenecks (e.g., Kafka producer throughput to the target cluster).
  • Benchmarking on different hardware than production — Results from a developer laptop do not predict production performance.

Review Questions

Use the following questions during architecture reviews to assess performance efficiency:

  1. Have you established throughput targets (MB/s per partition) and validated them through benchmarking?
  2. Is compression enabled, and have you chosen the algorithm and level based on your data format?
  3. Is the storage backend appropriate for your durability, throughput, and cost requirements?
  4. Are resource requests and limits set for all backup workloads running in Kubernetes?
  5. Are backup instances co-located with Kafka brokers in the same availability zone?
  6. Are VPC endpoints or private network paths used for storage access?
  7. Have you benchmarked restore performance in addition to backup performance?
  8. Do you re-run benchmarks after configuration changes, version upgrades, and on a quarterly schedule?
  9. Is horizontal scaling used instead of vertical scaling for high partition counts?
  10. Are you monitoring kafka_backup_compression_ratio, throughput, and resource utilisation continuously?

Resources