Skip to main content

Metrics Reference

OSO Kafka Backup exposes Prometheus/OpenMetrics metrics for monitoring backup and restore operations.

Metrics Endpoint

When running a backup or restore, metrics are exposed at:

http://localhost:8080/metrics

The endpoint can be configured via the metrics section in your config file:

metrics:
enabled: true
port: 8080
bind_address: "0.0.0.0"
path: "/metrics"

Quick Start with Docker

A complete monitoring stack is provided for local development and testing:

cd docker
docker-compose -f docker-compose.metrics.yml up -d

This starts:

A pre-configured Grafana dashboard is automatically provisioned.


Backup Metrics

kafka_backup_records_total

Type: Counter

Total number of records backed up.

Labels:

  • backup_id - Backup identifier

Example:

kafka_backup_records_total_total{backup_id="daily-001"} 150234
note

Counter metrics have _total appended by the prometheus-client library, resulting in kafka_backup_records_total_total.

kafka_backup_bytes_total

Type: Counter

Total bytes backed up (uncompressed).

Labels:

  • backup_id - Backup identifier

Example:

kafka_backup_bytes_total_total{backup_id="daily-001"} 52428800

kafka_backup_lag_records

Type: Gauge

Number of records behind the high watermark (consumer lag).

Labels:

  • topic - Topic name
  • partition - Partition number
  • backup_id - Backup identifier

Example:

kafka_backup_lag_records{topic="orders",partition="0",backup_id="daily-001"} 1523

kafka_backup_compression_ratio

Type: Gauge

Compression ratio (uncompressed / compressed). Higher is better.

Labels:

  • algorithm - Compression algorithm (zstd, lz4, none)
  • backup_id - Backup identifier

Example:

kafka_backup_compression_ratio{algorithm="zstd",backup_id="daily-001"} 11.0

kafka_backup_compressed_bytes_total

Type: Counter

Total compressed bytes written.

Labels:

  • algorithm - Compression algorithm
  • backup_id - Backup identifier

Example:

kafka_backup_compressed_bytes_total_total{algorithm="zstd",backup_id="daily-001"} 4762800

kafka_backup_uncompressed_bytes_total

Type: Counter

Total uncompressed bytes before compression.

Labels:

  • algorithm - Compression algorithm
  • backup_id - Backup identifier

Storage Metrics

kafka_backup_storage_write_latency_seconds

Type: Histogram

Storage write latency in seconds.

Labels:

  • backend - Storage backend (filesystem, s3, azure, gcs)
  • operation - Operation type (segment, manifest, checkpoint)

Buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

Example:

kafka_backup_storage_write_latency_seconds_bucket{backend="filesystem",operation="segment",le="0.1"} 45
kafka_backup_storage_write_latency_seconds_sum{backend="filesystem",operation="segment"} 2.34
kafka_backup_storage_write_latency_seconds_count{backend="filesystem",operation="segment"} 50

kafka_backup_storage_write_bytes_total

Type: Counter

Total bytes written to storage.

Labels:

  • backend - Storage backend
  • backup_id - Backup identifier

Example:

kafka_backup_storage_write_bytes_total_total{backend="s3",backup_id="daily-001"} 10485760

kafka_backup_storage_read_latency_seconds

Type: Histogram

Storage read latency (used during restore).

Labels:

  • backend - Storage backend
  • operation - Operation type

Buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

kafka_backup_storage_read_bytes_total

Type: Counter

Total bytes read from storage.

Labels:

  • backend - Storage backend
  • backup_id - Backup identifier

Error Metrics

kafka_backup_errors_total

Type: Counter

Total number of errors encountered.

Labels:

  • backup_id - Backup identifier
  • error_type - Error category:
    • kafka - Kafka connection/protocol errors
    • storage - Storage backend errors
    • compression - Compression/decompression errors
    • config - Configuration errors
    • timeout - Operation timeouts
    • other - Uncategorized errors

Example:

kafka_backup_errors_total_total{backup_id="daily-001",error_type="kafka"} 2

kafka_backup_retries_total

Type: Counter

Total number of retry attempts.

Labels:

  • backup_id - Backup identifier
  • operation - Operation being retried

Restore Metrics

kafka_restore_records_total

Type: Counter

Total number of records restored.

Labels:

  • backup_id - Source backup identifier
  • restore_id - Restore operation identifier

kafka_restore_bytes_total

Type: Counter

Total bytes restored.

Labels:

  • backup_id - Source backup identifier
  • restore_id - Restore operation identifier

kafka_restore_progress_percent

Type: Gauge

Restore progress percentage (0-100).

Labels:

  • backup_id - Source backup identifier
  • restore_id - Restore operation identifier

kafka_restore_eta_seconds

Type: Gauge

Estimated time remaining for restore in seconds.

Labels:

  • backup_id - Source backup identifier
  • restore_id - Restore operation identifier

kafka_restore_throughput_records_per_second

Type: Gauge

Current restore throughput.

Labels:

  • restore_id - Restore operation identifier

Grafana Dashboard

Pre-built Dashboard

The Docker monitoring stack includes a pre-configured Grafana dashboard with:

PanelDescription
Total RecordsRunning count of backed up records
Total BytesRunning count of backed up bytes
Compression RatioCurrent compression efficiency
Consumer LagTotal lag across all partitions
Consumer Lag by PartitionTime series of lag per topic/partition
Storage Write Latencyp50 and p99 write latency
Storage I/OBytes per second written to storage

Backup Throughput (records/sec)

rate(kafka_backup_records_total_total[5m])

Consumer Lag by Partition

kafka_backup_lag_records{backup_id=~"$backup_id"}

Total Consumer Lag

sum(kafka_backup_lag_records{backup_id=~"$backup_id"})

Compression Ratio

kafka_backup_compression_ratio{backup_id=~"$backup_id"}

Storage Write Latency (p99)

histogram_quantile(0.99, rate(kafka_backup_storage_write_latency_seconds_bucket[5m]))

Storage Write Latency (p50)

histogram_quantile(0.50, rate(kafka_backup_storage_write_latency_seconds_bucket[5m]))

Storage Write Throughput (bytes/sec)

rate(kafka_backup_storage_write_bytes_total_total[1m])

Error Rate

rate(kafka_backup_errors_total_total[5m])

Alert Rules

Example Prometheus alerting rules:

groups:
- name: kafka-backup
rules:
- alert: BackupLagging
expr: sum(kafka_backup_lag_records) > 100000
for: 10m
labels:
severity: warning
annotations:
summary: "Kafka backup is lagging"
description: "Backup is {{ $value }} records behind the high watermark"

- alert: BackupErrors
expr: increase(kafka_backup_errors_total_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka backup errors detected"
description: "Backup {{ $labels.backup_id }} encountered {{ $labels.error_type }} errors"

- alert: StorageLatencyHigh
expr: histogram_quantile(0.99, rate(kafka_backup_storage_write_latency_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Storage write latency is high"
description: "p99 storage latency is {{ $value }}s"

- alert: LowCompressionRatio
expr: kafka_backup_compression_ratio < 2
for: 30m
labels:
severity: info
annotations:
summary: "Low compression ratio"
description: "Compression ratio is only {{ $value }}x - data may already be compressed"

- alert: BackupStalled
expr: increase(kafka_backup_records_total_total[5m]) == 0 and kafka_backup_lag_records > 0
for: 10m
labels:
severity: critical
annotations:
summary: "Backup appears stalled"
description: "No records processed in 5 minutes but lag exists"

Prometheus Scrape Configuration

Static Configuration

scrape_configs:
- job_name: 'kafka-backup'
static_configs:
- targets: ['localhost:8080']
metrics_path: /metrics
scrape_interval: 15s

Docker Compose

For Docker deployments, use host.docker.internal to reach the host:

scrape_configs:
- job_name: 'kafka-backup'
static_configs:
- targets: ['host.docker.internal:8080']
scrape_interval: 15s

Kubernetes ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-backup
labels:
release: prometheus
spec:
selector:
matchLabels:
app: kafka-backup
endpoints:
- port: metrics
interval: 15s
path: /metrics

Configuration Reference

Full metrics configuration options:

metrics:
# Enable metrics endpoint (default: true)
enabled: true

# Port to listen on (default: 8080)
port: 8080

# Bind address (default: "0.0.0.0")
bind_address: "0.0.0.0"

# Metrics path (default: "/metrics")
path: "/metrics"

Endpoints

The metrics server exposes multiple endpoints:

EndpointDescription
/metricsPrometheus metrics in OpenMetrics format
/healthHealth check (returns 200 OK if healthy)
/Basic info page

Next Steps