Performance Efficiency

"Using compute, storage, and network resources efficiently to meet backup and restore throughput requirements, and maintaining that efficiency as data volumes grow."

Kafka backup workloads must keep pace with production data rates while consuming the minimum necessary resources. The Performance Efficiency pillar ensures your backup architecture is right-sized, properly tuned, and continuously benchmarked — so that growing data volumes never outstrip your ability to protect them.

Design Principles

Right-size for throughput requirements — Match compute, memory, and network capacity to measured data rates rather than guessing. Over-provisioning wastes budget; under-provisioning causes backup lag.
Use compression to reduce storage and network overhead — Compress backup segments to shrink storage footprint and reduce network transfer time, choosing an algorithm that balances ratio against CPU cost.
Benchmark before deploying to production — Validate throughput, latency, and resource consumption under realistic load before going live. Assumptions about performance are not a substitute for measurement.
Monitor performance continuously and act on trends — Track throughput, compression ratios, and resource utilisation over time. Identify degradation early and address it before it becomes an incident.
Go serverless / managed where possible — Prefer managed object storage (S3, GCS, Azure Blob) over self-managed block storage. Managed services scale automatically and eliminate storage infrastructure overhead.
Experiment with configuration in staging before production — Test tuning changes (segment sizes, compression levels, concurrency) in a staging environment that mirrors production topology and data characteristics.

Best Practices

PE-01: Throughput Optimisation

Performance Targets

Establish clear targets for your environment and validate them through benchmarking:

Metric	Target	Notes
Throughput per partition	100+ MB/s	Depends on network, storage backend, and message size
Checkpoint latency (p99)	< 100 ms	Ensures minimal data loss window on failure
Compression ratio	3--5x	Typical for JSON/text payloads with zstd
Memory usage	< 500 MB for 4 partitions	Rust-native efficiency keeps the footprint low

Tuning Levers

segment_max_bytes — Set between 128 MB and 256 MB. Larger segments reduce the number of storage API calls but increase the flush latency window.
fetch_max_bytes — Increase to match partition throughput. Undersized fetch buffers cause excessive round trips to brokers.
compression — Choose zstd for the best ratio-to-speed trade-off, or lz4 when CPU is constrained.
max_concurrent_partitions — Scale this to match available CPU cores and network bandwidth. Each partition backup runs in its own async task.

Implementation Guidance

Scale horizontally — Run multiple kafka-backup instances, each handling a subset of partitions, rather than scaling a single instance vertically.
Co-locate with Kafka brokers — Deploy backup instances in the same availability zone as the brokers they read from to minimise network latency and cross-AZ data transfer costs.
Tune incrementally — Change one parameter at a time and measure the impact. Simultaneous changes make it impossible to attribute improvements or regressions.

Configuration: High-Throughput Backup

kafka:
  bootstrap_servers: "kafka-0:9092,kafka-1:9092,kafka-2:9092"
  fetch_max_bytes: 52428800  # 50 MB

backup:
  segment_max_bytes: 268435456  # 256 MB
  compression: zstd
  compression_level: 3
  max_concurrent_partitions: 8

storage:
  type: s3
  bucket: my-kafka-backups
  region: eu-west-1

tip

Start with segment_max_bytes: 134217728 (128 MB) and increase to 256 MB only after confirming that your storage backend handles larger PUT requests without timeout issues.

Anti-patterns

Using default configuration for high-throughput topics — Defaults are conservative. Production workloads almost always benefit from tuning segment size, fetch size, and concurrency.
Deploying backup in a different region from Kafka — Cross-region data transfer adds latency, cost, and fragility.
Using very small segments (< 32 MB) — Creates excessive storage API calls, increasing both latency and cost.
No benchmarking before production — Performance assumptions based on development data volumes are unreliable.

PE-02: Compression Strategy

Algorithm Comparison

Algorithm	Compression Ratio	Speed	CPU Usage	Best For
zstd	3--5x	Fast	Moderate	General-purpose default
lz4	2--3x	Very fast	Low	Latency-sensitive workloads
none	1x	Fastest	None	Pre-compressed or encrypted data

Compression Ratios by Data Format

Data Format	Typical Ratio (zstd)	Notes
JSON	5--8x	Highly repetitive structure compresses well
Avro	2--4x	Binary format, moderate compressibility
Protobuf	2--3x	Compact binary, less headroom for compression
Already compressed	< 1.1x	No benefit — adds CPU cost for no gain

Zstd Compression Levels

Level Range	Speed	Ratio	Use Case
1--3	Fast	Moderate	Recommended for real-time backup
4--9	Balanced	Higher	Batch or off-peak backup windows
10+	Slow	Maximum	Archival, where ratio matters more than speed

tip

Levels 1--3 offer the best throughput-to-ratio trade-off for real-time backup. Only increase beyond 3 if you have measured a meaningful improvement in your specific data and can tolerate the additional CPU overhead.

Monitoring Compression

Monitor the kafka_backup_compression_ratio metric to detect changes in data compressibility over time. A sudden drop in ratio may indicate a change in message format or the introduction of pre-compressed payloads.

Anti-patterns

Compressing already-compressed data — Wastes CPU cycles for negligible size reduction. Disable compression for topics containing gzip, snappy, or encrypted payloads.
Using maximum compression levels for latency-sensitive workloads — High zstd levels (10+) can add significant CPU time per segment, increasing flush latency.
Never benchmarking compression — Different data formats yield vastly different ratios. Benchmark your actual payloads to choose the right algorithm and level.

PE-03: Storage Backend Selection

Backend Comparison

Backend	Throughput	Durability	Relative Cost	Notes
Amazon S3	High	11 9's	$$	Most mature, widest tooling ecosystem
Azure Blob Storage	High	17 9's (GRS)	$$	Native integration with Azure workloads
Google Cloud Storage	High	11 9's	$$	Strong multi-region replication
MinIO	High	Depends on deployment	$	S3-compatible, self-hosted
Filesystem	Very high	Depends on underlying storage	Free	No network overhead, limited durability

Production Recommendations

Use object storage — S3, Azure Blob, or GCS provide the durability, scalability, and lifecycle management required for production backup data.
Enable versioning — Protect against accidental overwrites or deletions. Versioning also supports compliance requirements for immutable backups.
Use Transfer Acceleration for cross-region — When restoring data to a different region, enable S3 Transfer Acceleration or equivalent to reduce transfer time.

Development Recommendations

Filesystem or MinIO for local development — Avoids cloud costs and network dependencies during development and testing.
Mirror production configuration — Use the same segment sizes, compression settings, and directory structures so that performance characteristics transfer to production.

Reduce Latency with VPC Endpoints

Deploy VPC endpoints (AWS), private endpoints (Azure), or Private Service Connect (GCP) to keep backup traffic on the cloud provider's backbone network. This eliminates internet gateway latency and reduces data transfer costs.

Configuration Examples

Amazon S3:

storage:
  type: s3
  bucket: my-kafka-backups
  region: eu-west-1
  # Use VPC endpoint for private network access
  endpoint: https://s3.eu-west-1.amazonaws.com

Azure Blob Storage:

storage:
  type: azure
  container: my-kafka-backups
  account_name: mystorageaccount

Filesystem (development):

storage:
  type: filesystem
  base_path: /var/lib/kafka-backup/data

Anti-patterns

Using filesystem storage for production — Filesystem storage lacks the durability guarantees, lifecycle management, and cross-region replication of object storage.
No versioning on backup buckets — A single accidental deletion or overwrite can destroy your recovery capability.
Storing backups on a different continent from Kafka — Intercontinental data transfer adds significant latency and cost to every backup segment write.
No VPC endpoints — Routing backup traffic through the public internet adds latency, increases cost, and exposes data to unnecessary network hops.

PE-04: Resource Sizing & Scaling

Sizing Guidelines

Partition Count	CPU	Memory	Network
1--4	1 vCPU	512 MB	1 Gbps
5--16	2 vCPU	1 GB	5 Gbps
17--64	4 vCPU	2 GB	10 Gbps
65+	Scale horizontally	—	—

tip

OSO Kafka Backup is written in Rust and is designed to use less than 500 MB of memory for 4 partitions. The sizing table above includes headroom for bursts and garbage-free operation.

Kubernetes Resource Configuration

Set both requests and limits to ensure predictable scheduling and prevent noisy-neighbour effects:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-backup
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kafka-backup
  template:
    metadata:
      labels:
        app: kafka-backup
    spec:
      containers:
        - name: kafka-backup
          image: ghcr.io/osodevops/kafka-backup:latest
          resources:
            requests:
              cpu: "1"
              memory: "512Mi"
            limits:
              cpu: "2"
              memory: "1Gi"
          volumeMounts:
            - name: config
              mountPath: /etc/kafka-backup
      volumes:
        - name: config
          configMap:
            name: kafka-backup-config

Horizontal Pod Autoscaling

Scale based on backup lag to ensure partitions are covered as throughput increases:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: kafka-backup-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: kafka-backup
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: kafka_backup_lag_bytes
        target:
          type: AverageValue
          averageValue: "104857600"  # 100 MB average lag

Anti-patterns

No resource limits — Without limits, a misbehaving backup instance can consume all node resources and affect other workloads.
No resource requests — Without requests, the Kubernetes scheduler cannot make informed placement decisions, leading to resource contention.
Vertical scaling only — A single large instance has a failure blast radius that affects all partitions. Horizontal scaling isolates failures.
Running backup on Kafka broker nodes — Backup I/O competes with broker I/O for disk and network bandwidth, degrading both.

PE-05: Network Optimisation

Co-location

Deploy backup instances in the same availability zone as the Kafka brokers they read from. This provides the lowest latency and eliminates cross-AZ data transfer charges.

warning

Cross-AZ data transfer in AWS costs $0.01/GB in each direction. For a topic producing 1 TB/day, this adds approximately $600/month in transfer costs alone.

Private Network Paths

VPC endpoints — Use gateway or interface VPC endpoints for S3 and other storage services to keep traffic off the public internet.
Rack awareness — Configure client.rack so that Kafka directs fetch requests to the closest replica, reducing cross-rack and cross-AZ traffic.

Fetch Size Optimisation

Increase fetch_max_bytes to reduce the number of fetch round trips:

kafka:
  fetch_max_bytes: 52428800  # 50 MB
  fetch_max_wait_ms: 500

tip

Larger fetch sizes amortise the per-request overhead but increase memory usage. Monitor memory consumption when tuning fetch sizes upward.

Cross-Region Restores

When restoring data to a different region:

Enable S3 Transfer Acceleration or equivalent service to optimise long-distance transfers.
Consider pre-staging backup data to the target region using storage replication before initiating the restore.

Anti-patterns

Backup in a different region from Kafka — Adds latency to every fetch request and segment upload, reducing throughput and increasing costs.
Small fetch sizes with high-throughput topics — Causes excessive round trips to brokers, wasting network capacity on request overhead.
Shared network without QoS — Backup traffic can saturate shared links, affecting production Kafka clients.
No consideration for cross-AZ costs — Multi-AZ deployments are resilient but expensive. Understand the cost trade-off and optimise placement accordingly.

PE-06: Benchmarking & Load Testing

Benchmark Suite

Use the kafka-backup-demos benchmark suite to measure performance under controlled conditions.

# Clone the benchmark suite
git clone https://github.com/osodevops/kafka-backup-demos.git
cd kafka-backup-demos

# Run the throughput benchmark
./benchmarks/run-throughput-test.sh --partitions 4 --message-size 1024 --duration 300

Benchmark Scenarios

Run the following scenarios to build a complete performance profile:

Scenario	Purpose
Max throughput per partition	Establish single-partition ceiling
Multi-partition scaling	Validate linear scaling with partition count
Large messages (> 1 MB)	Identify buffer and timeout issues
Restore speed	Measure time-to-recovery for capacity planning
WAN latency simulation	Understand cross-region performance impact

Record Baselines

For each scenario, record:

Throughput — MB/s sustained over the test duration
Duration — Total time to back up or restore the test dataset
Resource utilisation — CPU, memory, network, and disk I/O during the test

Store baselines in version control alongside your backup configuration so they can be compared over time.

When to Re-run Benchmarks

After version upgrades of kafka-backup
After configuration changes to segment size, compression, or concurrency
After infrastructure changes such as instance type, network topology, or storage backend
Quarterly as a routine check against baseline drift

warning

Benchmarks run on small data volumes (< 1 GB) do not represent production performance. Use datasets that are at least 10x the segment size and run for a minimum of 5 minutes to capture steady-state behaviour.

Anti-patterns

No benchmarking before production — Deploying without performance validation is deploying blind. Every environment has unique characteristics that affect throughput.
Benchmarking with unrealistically small data volumes — Small datasets fit entirely in OS page cache, producing misleadingly high throughput numbers.
Benchmarking backup only, not restore — Restore performance is equally critical and often has different bottlenecks (e.g., Kafka producer throughput to the target cluster).
Benchmarking on different hardware than production — Results from a developer laptop do not predict production performance.

Review Questions

Use the following questions during architecture reviews to assess performance efficiency:

Have you established throughput targets (MB/s per partition) and validated them through benchmarking?
Is compression enabled, and have you chosen the algorithm and level based on your data format?
Is the storage backend appropriate for your durability, throughput, and cost requirements?
Are resource requests and limits set for all backup workloads running in Kubernetes?
Are backup instances co-located with Kafka brokers in the same availability zone?
Are VPC endpoints or private network paths used for storage access?
Have you benchmarked restore performance in addition to backup performance?
Do you re-run benchmarks after configuration changes, version upgrades, and on a quarterly schedule?
Is horizontal scaling used instead of vertical scaling for high partition counts?
Are you monitoring kafka_backup_compression_ratio, throughput, and resource utilisation continuously?

Resources

Performance Tuning Guide — Step-by-step tuning for throughput and latency
Configuration Reference — Complete YAML configuration options
Metrics Reference — Full list of Prometheus metrics for performance monitoring
Compression Architecture — How compression is implemented in the backup pipeline
Zero-Copy Optimisation — Rust zero-copy design for minimal overhead

Design Principles​

Best Practices​

PE-01: Throughput Optimisation​

Performance Targets​

Tuning Levers​

Implementation Guidance​

Configuration: High-Throughput Backup​

Anti-patterns​

PE-02: Compression Strategy​

Algorithm Comparison​

Compression Ratios by Data Format​

Zstd Compression Levels​

Monitoring Compression​

Anti-patterns​

PE-03: Storage Backend Selection​

Backend Comparison​

Production Recommendations​

Development Recommendations​

Reduce Latency with VPC Endpoints​

Configuration Examples​

Anti-patterns​

PE-04: Resource Sizing & Scaling​

Sizing Guidelines​

Kubernetes Resource Configuration​

Horizontal Pod Autoscaling​

Anti-patterns​

PE-05: Network Optimisation​

Co-location​

Private Network Paths​

Fetch Size Optimisation​

Cross-Region Restores​

Anti-patterns​

PE-06: Benchmarking & Load Testing​

Benchmark Suite​

Benchmark Scenarios​

Record Baselines​

When to Re-run Benchmarks​

Anti-patterns​

Review Questions​

Resources​

Design Principles

Best Practices

PE-01: Throughput Optimisation

Performance Targets

Tuning Levers

Implementation Guidance

Configuration: High-Throughput Backup

Anti-patterns

PE-02: Compression Strategy

Algorithm Comparison

Compression Ratios by Data Format

Zstd Compression Levels

Monitoring Compression

Anti-patterns

PE-03: Storage Backend Selection

Backend Comparison

Production Recommendations

Development Recommendations

Reduce Latency with VPC Endpoints

Configuration Examples

Anti-patterns

PE-04: Resource Sizing & Scaling

Sizing Guidelines

Kubernetes Resource Configuration

Horizontal Pod Autoscaling

Anti-patterns

PE-05: Network Optimisation

Co-location

Private Network Paths

Fetch Size Optimisation

Cross-Region Restores

Anti-patterns

PE-06: Benchmarking & Load Testing

Benchmark Suite

Benchmark Scenarios

Record Baselines

When to Re-run Benchmarks

Anti-patterns

Review Questions

Resources