Metrics Reference
OSO Kafka Backup exposes Prometheus/OpenMetrics metrics for monitoring backup and restore operations.
Metrics Endpoint
When running a backup or restore, metrics are exposed at:
http://localhost:8080/metrics
The endpoint can be configured via the metrics section in your config file:
metrics:
enabled: true
port: 8080
bind_address: "0.0.0.0"
path: "/metrics"
Quick Start with Docker
A complete monitoring stack is provided for local development and testing:
cd docker
docker-compose -f docker-compose.metrics.yml up -d
This starts:
- Prometheus at http://localhost:9090
- Grafana at http://localhost:3000 (admin/admin)
- Mimir at http://localhost:9009 (long-term storage)
A pre-configured Grafana dashboard is automatically provisioned.
Backup Metrics
kafka_backup_records_total
Type: Counter
Total number of records backed up.
Labels:
backup_id- Backup identifier
Example:
kafka_backup_records_total_total{backup_id="daily-001"} 150234
Counter metrics have _total appended by the prometheus-client library, resulting in kafka_backup_records_total_total.
kafka_backup_bytes_total
Type: Counter
Total bytes backed up (uncompressed).
Labels:
backup_id- Backup identifier
Example:
kafka_backup_bytes_total_total{backup_id="daily-001"} 52428800
kafka_backup_lag_records
Type: Gauge
Number of records behind the high watermark (consumer lag).
Labels:
topic- Topic namepartition- Partition numberbackup_id- Backup identifier
Example:
kafka_backup_lag_records{topic="orders",partition="0",backup_id="daily-001"} 1523
kafka_backup_compression_ratio
Type: Gauge
Compression ratio (uncompressed / compressed). Higher is better.
Labels:
algorithm- Compression algorithm (zstd, lz4, none)backup_id- Backup identifier
Example:
kafka_backup_compression_ratio{algorithm="zstd",backup_id="daily-001"} 11.0
kafka_backup_compressed_bytes_total
Type: Counter
Total compressed bytes written.
Labels:
algorithm- Compression algorithmbackup_id- Backup identifier
Example:
kafka_backup_compressed_bytes_total_total{algorithm="zstd",backup_id="daily-001"} 4762800
kafka_backup_uncompressed_bytes_total
Type: Counter
Total uncompressed bytes before compression.
Labels:
algorithm- Compression algorithmbackup_id- Backup identifier
Storage Metrics
kafka_backup_storage_write_latency_seconds
Type: Histogram
Storage write latency in seconds.
Labels:
backend- Storage backend (filesystem, s3, azure, gcs)operation- Operation type (segment, manifest, checkpoint)
Buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
Example:
kafka_backup_storage_write_latency_seconds_bucket{backend="filesystem",operation="segment",le="0.1"} 45
kafka_backup_storage_write_latency_seconds_sum{backend="filesystem",operation="segment"} 2.34
kafka_backup_storage_write_latency_seconds_count{backend="filesystem",operation="segment"} 50
kafka_backup_storage_write_bytes_total
Type: Counter
Total bytes written to storage.
Labels:
backend- Storage backendbackup_id- Backup identifier
Example:
kafka_backup_storage_write_bytes_total_total{backend="s3",backup_id="daily-001"} 10485760
kafka_backup_storage_read_latency_seconds
Type: Histogram
Storage read latency (used during restore).
Labels:
backend- Storage backendoperation- Operation type
Buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
kafka_backup_storage_read_bytes_total
Type: Counter
Total bytes read from storage.
Labels:
backend- Storage backendbackup_id- Backup identifier
Error Metrics
kafka_backup_errors_total
Type: Counter
Total number of errors encountered.
Labels:
backup_id- Backup identifiererror_type- Error category:kafka- Kafka connection/protocol errorsstorage- Storage backend errorscompression- Compression/decompression errorsconfig- Configuration errorstimeout- Operation timeoutsother- Uncategorized errors
Example:
kafka_backup_errors_total_total{backup_id="daily-001",error_type="kafka"} 2
kafka_backup_retries_total
Type: Counter
Total number of retry attempts.
Labels:
backup_id- Backup identifieroperation- Operation being retried
Restore Metrics
kafka_restore_records_total
Type: Counter
Total number of records restored.
Labels:
backup_id- Source backup identifierrestore_id- Restore operation identifier
kafka_restore_bytes_total
Type: Counter
Total bytes restored.
Labels:
backup_id- Source backup identifierrestore_id- Restore operation identifier
kafka_restore_progress_percent
Type: Gauge
Restore progress percentage (0-100).
Labels:
backup_id- Source backup identifierrestore_id- Restore operation identifier
kafka_restore_eta_seconds
Type: Gauge
Estimated time remaining for restore in seconds.
Labels:
backup_id- Source backup identifierrestore_id- Restore operation identifier
kafka_restore_throughput_records_per_second
Type: Gauge
Current restore throughput.
Labels:
restore_id- Restore operation identifier
Grafana Dashboard
Pre-built Dashboard
The Docker monitoring stack includes a pre-configured Grafana dashboard with:
| Panel | Description |
|---|---|
| Total Records | Running count of backed up records |
| Total Bytes | Running count of backed up bytes |
| Compression Ratio | Current compression efficiency |
| Consumer Lag | Total lag across all partitions |
| Consumer Lag by Partition | Time series of lag per topic/partition |
| Storage Write Latency | p50 and p99 write latency |
| Storage I/O | Bytes per second written to storage |
Recommended PromQL Queries
Backup Throughput (records/sec)
rate(kafka_backup_records_total_total[5m])
Consumer Lag by Partition
kafka_backup_lag_records{backup_id=~"$backup_id"}
Total Consumer Lag
sum(kafka_backup_lag_records{backup_id=~"$backup_id"})
Compression Ratio
kafka_backup_compression_ratio{backup_id=~"$backup_id"}
Storage Write Latency (p99)
histogram_quantile(0.99, rate(kafka_backup_storage_write_latency_seconds_bucket[5m]))
Storage Write Latency (p50)
histogram_quantile(0.50, rate(kafka_backup_storage_write_latency_seconds_bucket[5m]))
Storage Write Throughput (bytes/sec)
rate(kafka_backup_storage_write_bytes_total_total[1m])
Error Rate
rate(kafka_backup_errors_total_total[5m])
Alert Rules
Example Prometheus alerting rules:
groups:
- name: kafka-backup
rules:
- alert: BackupLagging
expr: sum(kafka_backup_lag_records) > 100000
for: 10m
labels:
severity: warning
annotations:
summary: "Kafka backup is lagging"
description: "Backup is {{ $value }} records behind the high watermark"
- alert: BackupErrors
expr: increase(kafka_backup_errors_total_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka backup errors detected"
description: "Backup {{ $labels.backup_id }} encountered {{ $labels.error_type }} errors"
- alert: StorageLatencyHigh
expr: histogram_quantile(0.99, rate(kafka_backup_storage_write_latency_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Storage write latency is high"
description: "p99 storage latency is {{ $value }}s"
- alert: LowCompressionRatio
expr: kafka_backup_compression_ratio < 2
for: 30m
labels:
severity: info
annotations:
summary: "Low compression ratio"
description: "Compression ratio is only {{ $value }}x - data may already be compressed"
- alert: BackupStalled
expr: increase(kafka_backup_records_total_total[5m]) == 0 and kafka_backup_lag_records > 0
for: 10m
labels:
severity: critical
annotations:
summary: "Backup appears stalled"
description: "No records processed in 5 minutes but lag exists"
Prometheus Scrape Configuration
Static Configuration
scrape_configs:
- job_name: 'kafka-backup'
static_configs:
- targets: ['localhost:8080']
metrics_path: /metrics
scrape_interval: 15s
Docker Compose
For Docker deployments, use host.docker.internal to reach the host:
scrape_configs:
- job_name: 'kafka-backup'
static_configs:
- targets: ['host.docker.internal:8080']
scrape_interval: 15s
Kubernetes ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-backup
labels:
release: prometheus
spec:
selector:
matchLabels:
app: kafka-backup
endpoints:
- port: metrics
interval: 15s
path: /metrics
Configuration Reference
Full metrics configuration options:
metrics:
# Enable metrics endpoint (default: true)
enabled: true
# Port to listen on (default: 8080)
port: 8080
# Bind address (default: "0.0.0.0")
bind_address: "0.0.0.0"
# Metrics path (default: "/metrics")
path: "/metrics"
Endpoints
The metrics server exposes multiple endpoints:
| Endpoint | Description |
|---|---|
/metrics | Prometheus metrics in OpenMetrics format |
/health | Health check (returns 200 OK if healthy) |
/ | Basic info page |
Next Steps
- Performance Tuning - Use metrics to optimize
- Error Codes - Understand error types
- Troubleshooting - Debug with metrics