Performance Issues
This guide helps diagnose and resolve slow backup and restore operations.
Diagnosing Performance Problems
Check Current Throughput
# Monitor backup progress in real-time
kafka-backup status --config backup.yaml --watch
# Example output:
# ================================================================
# OSO Kafka Backup - Live Status
# ================================================================
# Backup ID: production-backup Uptime: 00:15:32
# Status: RUNNING
# ================================================================
# Progress
# |- Records: 1,234,567
# |- Bytes: 256.0 MB (compressed)
# |- Throughput: 15234 rec/s | 3.2 MB/s
# |- Lag: 45,000 records (orders-0)
# ================================================================
# Components
# |- kafka: [OK] ok
# |- storage: [OK] ok
# ================================================================
# Compression: 3.2x ratio | Errors: 0
# ================================================================
# One-shot status (no continuous refresh)
kafka-backup status --config backup.yaml
# Custom refresh interval (5 seconds)
kafka-backup status --config backup.yaml --watch --interval 5
Expected Performance
| Operation | Per Partition | With 10 Partitions |
|---|---|---|
| Backup | 50-100 MB/s | 500 MB/s - 1 GB/s |
| Restore | 75-150 MB/s | 750 MB/s - 1.5 GB/s |
If you're seeing significantly lower numbers, continue troubleshooting.
Backup Performance
Problem: Slow Kafka Consumption
Symptoms:
- Low records/sec
- High CPU idle
- Storage I/O is fine
Diagnosis:
# Check consumer lag
kafka-consumer-groups --bootstrap-server kafka:9092 \
--describe --group kafka-backup-$BACKUP_ID
Solutions:
# Increase fetch sizes
source:
kafka_config:
fetch.max.bytes: 104857600 # 100 MB
max.partition.fetch.bytes: 10485760 # 10 MB per partition
fetch.min.bytes: 1048576 # 1 MB minimum
fetch.max.wait.ms: 500 # Wait for batches
Problem: Network Bottleneck
Symptoms:
- Throughput limited regardless of settings
- Network utilization at 100%
Diagnosis:
# Check network utilization
iftop -i eth0
# Test network bandwidth
iperf3 -c kafka-broker-0 -p 5201
Solutions:
- Compress before transfer:
backup:
compression: zstd
compression_level: 1 # Fast compression
- Use closer storage region:
storage:
backend: s3
region: us-west-2 # Same region as Kafka
- Enable VPC endpoints:
storage:
backend: s3
endpoint: https://s3.us-west-2.amazonaws.com
use_vpc_endpoint: true
Problem: Slow Compression
Symptoms:
- High CPU utilization
- Compression taking most of the time
Diagnosis:
# Check CPU during backup
top -p $(pgrep kafka-backup)
# Check metrics
curl localhost:9090/metrics | grep compression
Solutions:
# Use faster compression
backup:
compression: lz4 # Faster than zstd
# Or reduce level
compression: zstd
compression_level: 1 # Fastest
Problem: Slow Storage Writes
Symptoms:
- Low storage throughput
- High storage latency
Diagnosis:
# Test S3 write speed
dd if=/dev/zero bs=1M count=100 | aws s3 cp - s3://bucket/test-file
# Check CloudWatch metrics for S3
Solutions:
# Optimize multipart uploads
storage:
backend: s3
multipart_threshold: 52428800 # 50 MB
multipart_part_size: 10485760 # 10 MB
max_concurrent_uploads: 10
Problem: Single Partition Bottleneck
Symptoms:
- One partition much slower than others
- Uneven partition sizes
Diagnosis:
# Check partition sizes
kafka-log-dirs --bootstrap-server kafka:9092 --describe
Solutions:
- For future: Use better partition keys
- For now: Accept longer backup time for skewed partitions