Metrics Reference

OSO Kafka Backup exposes Prometheus/OpenMetrics metrics for monitoring backup and restore operations.

Metrics Endpoint

When running a backup or restore, metrics are exposed at:

http://localhost:8080/metrics

The endpoint can be configured via the metrics section in your config file:

metrics:
  enabled: true
  port: 8080
  bind_address: "0.0.0.0"
  path: "/metrics"

Quick Start with Docker

A complete monitoring stack is provided for local development and testing:

cd docker
docker-compose -f docker-compose.metrics.yml up -d

This starts:

Prometheus at http://localhost:9090
Grafana at http://localhost:3000 (admin/admin)
Mimir at http://localhost:9009 (long-term storage)

A pre-configured Grafana dashboard is automatically provisioned.

Backup Metrics

kafka_backup_records_total

Type: Counter

Total number of records backed up.

Labels:

backup_id - Backup identifier

Example:

kafka_backup_records_total_total{backup_id="daily-001"} 150234

note

Counter metrics have _total appended by the prometheus-client library, resulting in kafka_backup_records_total_total.

kafka_backup_bytes_total

Type: Counter

Total bytes backed up (uncompressed).

Labels:

backup_id - Backup identifier

Example:

kafka_backup_bytes_total_total{backup_id="daily-001"} 52428800

kafka_backup_lag_records

Type: Gauge

Number of records behind the high watermark (consumer lag).

Labels:

topic - Topic name
partition - Partition number
backup_id - Backup identifier

Example:

kafka_backup_lag_records{topic="orders",partition="0",backup_id="daily-001"} 1523

kafka_backup_compression_ratio

Type: Gauge

Compression ratio (uncompressed / compressed). Higher is better.

Labels:

algorithm - Compression algorithm (zstd, lz4, none)
backup_id - Backup identifier

Example:

kafka_backup_compression_ratio{algorithm="zstd",backup_id="daily-001"} 11.0

kafka_backup_compressed_bytes_total

Type: Counter

Total compressed bytes written.

Labels:

algorithm - Compression algorithm
backup_id - Backup identifier

Example:

kafka_backup_compressed_bytes_total_total{algorithm="zstd",backup_id="daily-001"} 4762800

kafka_backup_uncompressed_bytes_total

Type: Counter

Total uncompressed bytes before compression.

Labels:

algorithm - Compression algorithm
backup_id - Backup identifier

Storage Metrics

kafka_backup_storage_write_latency_seconds

Type: Histogram

Storage write latency in seconds.

Labels:

backend - Storage backend (filesystem, s3, azure, gcs)
operation - Operation type (segment, manifest, checkpoint)

Buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

Example:

kafka_backup_storage_write_latency_seconds_bucket{backend="filesystem",operation="segment",le="0.1"} 45
kafka_backup_storage_write_latency_seconds_sum{backend="filesystem",operation="segment"} 2.34
kafka_backup_storage_write_latency_seconds_count{backend="filesystem",operation="segment"} 50

kafka_backup_storage_write_bytes_total

Type: Counter

Total bytes written to storage.

Labels:

backend - Storage backend
backup_id - Backup identifier

Example:

kafka_backup_storage_write_bytes_total_total{backend="s3",backup_id="daily-001"} 10485760

kafka_backup_storage_read_latency_seconds

Type: Histogram

Storage read latency (used during restore).

Labels:

backend - Storage backend
operation - Operation type

Buckets: 0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

kafka_backup_storage_read_bytes_total

Type: Counter

Total bytes read from storage.

Labels:

backend - Storage backend
backup_id - Backup identifier

Error Metrics

kafka_backup_errors_total

Type: Counter

Total number of errors encountered.

Labels:

backup_id - Backup identifier
error_type - Error category:
- kafka - Kafka connection/protocol errors
- storage - Storage backend errors
- compression - Compression/decompression errors
- config - Configuration errors
- timeout - Operation timeouts
- other - Uncategorized errors

Example:

kafka_backup_errors_total_total{backup_id="daily-001",error_type="kafka"} 2

kafka_backup_retries_total

Type: Counter

Total number of retry attempts.

Labels:

backup_id - Backup identifier
operation - Operation being retried

Restore Metrics

kafka_restore_records_total

Type: Counter

Total number of records restored.

Labels:

backup_id - Source backup identifier
restore_id - Restore operation identifier

kafka_restore_bytes_total

Type: Counter

Total bytes restored.

Labels:

backup_id - Source backup identifier
restore_id - Restore operation identifier

kafka_restore_progress_percent

Type: Gauge

Restore progress percentage (0-100).

Labels:

backup_id - Source backup identifier
restore_id - Restore operation identifier

kafka_restore_eta_seconds

Type: Gauge

Estimated time remaining for restore in seconds.

Labels:

backup_id - Source backup identifier
restore_id - Restore operation identifier

kafka_restore_throughput_records_per_second

Type: Gauge

Current restore throughput.

Labels:

restore_id - Restore operation identifier

Grafana Dashboard

Pre-built Dashboard

The Docker monitoring stack includes a pre-configured Grafana dashboard with:

Panel	Description
Total Records	Running count of backed up records
Total Bytes	Running count of backed up bytes
Compression Ratio	Current compression efficiency
Consumer Lag	Total lag across all partitions
Consumer Lag by Partition	Time series of lag per topic/partition
Storage Write Latency	p50 and p99 write latency
Storage I/O	Bytes per second written to storage

Recommended PromQL Queries

Backup Throughput (records/sec)

rate(kafka_backup_records_total_total[5m])

Consumer Lag by Partition

kafka_backup_lag_records{backup_id=~"$backup_id"}

Total Consumer Lag

sum(kafka_backup_lag_records{backup_id=~"$backup_id"})

Compression Ratio

kafka_backup_compression_ratio{backup_id=~"$backup_id"}

Storage Write Latency (p99)

histogram_quantile(0.99, rate(kafka_backup_storage_write_latency_seconds_bucket[5m]))

Storage Write Latency (p50)

histogram_quantile(0.50, rate(kafka_backup_storage_write_latency_seconds_bucket[5m]))

Storage Write Throughput (bytes/sec)

rate(kafka_backup_storage_write_bytes_total_total[1m])

Error Rate

rate(kafka_backup_errors_total_total[5m])

Alert Rules

Example Prometheus alerting rules:

groups:
  - name: kafka-backup
    rules:
      - alert: BackupLagging
        expr: sum(kafka_backup_lag_records) > 100000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Kafka backup is lagging"
          description: "Backup is {{ $value }} records behind the high watermark"

      - alert: BackupErrors
        expr: increase(kafka_backup_errors_total_total[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka backup errors detected"
          description: "Backup {{ $labels.backup_id }} encountered {{ $labels.error_type }} errors"

      - alert: StorageLatencyHigh
        expr: histogram_quantile(0.99, rate(kafka_backup_storage_write_latency_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Storage write latency is high"
          description: "p99 storage latency is {{ $value }}s"

      - alert: LowCompressionRatio
        expr: kafka_backup_compression_ratio < 2
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "Low compression ratio"
          description: "Compression ratio is only {{ $value }}x - data may already be compressed"

      - alert: BackupStalled
        expr: increase(kafka_backup_records_total_total[5m]) == 0 and kafka_backup_lag_records > 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Backup appears stalled"
          description: "No records processed in 5 minutes but lag exists"

Prometheus Scrape Configuration

Static Configuration

scrape_configs:
  - job_name: 'kafka-backup'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics
    scrape_interval: 15s

Docker Compose

For Docker deployments, use host.docker.internal to reach the host:

scrape_configs:
  - job_name: 'kafka-backup'
    static_configs:
      - targets: ['host.docker.internal:8080']
    scrape_interval: 15s

Kubernetes ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-backup
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: kafka-backup
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Configuration Reference

Full metrics configuration options:

metrics:
  # Enable metrics endpoint (default: true)
  enabled: true

  # Port to listen on (default: 8080)
  port: 8080

  # Bind address (default: "0.0.0.0")
  bind_address: "0.0.0.0"

  # Metrics path (default: "/metrics")
  path: "/metrics"

Endpoints

The metrics server exposes multiple endpoints:

Endpoint	Description
`/metrics`	Prometheus metrics in OpenMetrics format
`/health`	Health check (returns 200 OK if healthy)
`/`	Basic info page

Next Steps

Performance Tuning - Use metrics to optimize
Error Codes - Understand error types
Troubleshooting - Debug with metrics

Metrics Endpoint​

Quick Start with Docker​

Backup Metrics​

kafka_backup_records_total​

kafka_backup_bytes_total​

kafka_backup_lag_records​

kafka_backup_compression_ratio​

kafka_backup_compressed_bytes_total​

kafka_backup_uncompressed_bytes_total​

Storage Metrics​

kafka_backup_storage_write_latency_seconds​

kafka_backup_storage_write_bytes_total​

kafka_backup_storage_read_latency_seconds​

kafka_backup_storage_read_bytes_total​

Error Metrics​

kafka_backup_errors_total​

kafka_backup_retries_total​

Restore Metrics​

kafka_restore_records_total​

kafka_restore_bytes_total​

kafka_restore_progress_percent​

kafka_restore_eta_seconds​

kafka_restore_throughput_records_per_second​

Grafana Dashboard​

Pre-built Dashboard​

Recommended PromQL Queries​

Backup Throughput (records/sec)​

Consumer Lag by Partition​

Total Consumer Lag​

Compression Ratio​

Storage Write Latency (p99)​

Storage Write Latency (p50)​

Storage Write Throughput (bytes/sec)​

Error Rate​

Alert Rules​

Prometheus Scrape Configuration​

Static Configuration​

Docker Compose​

Kubernetes ServiceMonitor​

Configuration Reference​

Endpoints​

Next Steps​

Metrics Endpoint

Quick Start with Docker

Backup Metrics

kafka_backup_records_total

kafka_backup_bytes_total

kafka_backup_lag_records

kafka_backup_compression_ratio

kafka_backup_compressed_bytes_total

kafka_backup_uncompressed_bytes_total

Storage Metrics

kafka_backup_storage_write_latency_seconds

kafka_backup_storage_write_bytes_total

kafka_backup_storage_read_latency_seconds

kafka_backup_storage_read_bytes_total

Error Metrics

kafka_backup_errors_total

kafka_backup_retries_total

Restore Metrics

kafka_restore_records_total

kafka_restore_bytes_total

kafka_restore_progress_percent

kafka_restore_eta_seconds

kafka_restore_throughput_records_per_second

Grafana Dashboard

Pre-built Dashboard

Recommended PromQL Queries

Backup Throughput (records/sec)

Consumer Lag by Partition

Total Consumer Lag

Compression Ratio

Storage Write Latency (p99)

Storage Write Latency (p50)

Storage Write Throughput (bytes/sec)

Error Rate

Alert Rules

Prometheus Scrape Configuration

Static Configuration

Docker Compose

Kubernetes ServiceMonitor

Configuration Reference

Endpoints

Next Steps