Cost Optimisation

"Managing and reducing the total cost of Kafka backup operations through efficient resource utilisation, smart storage tiering, and cost-aware architecture decisions."

Backup infrastructure can become a significant and often invisible cost centre. Without active cost management, storage grows unbounded, oversized compute runs around the clock, and network transfer charges accumulate unnoticed. The Cost Optimisation pillar ensures you achieve reliable data protection at the lowest reasonable cost by continuously measuring, right-sizing, and governing every component of your backup architecture.

Design Principles

Implement cloud financial management — Tag every backup resource, track spending against budgets, and make cost data visible to the teams that control it.
Adopt a consumption model — Pay only for what you use. Scale compute to match backup windows, use lifecycle policies to tier storage, and avoid paying for idle capacity.
Measure overall efficiency — Define a unit cost metric such as cost per GB backed up. Track it over time and use it to evaluate architecture changes.
Stop spending on undifferentiated heavy lifting — Use managed object storage (S3, GCS, Azure Blob) rather than self-hosted storage. Let the cloud provider handle durability, availability, and scaling.
Analyse and attribute expenditure — Break down backup costs by team, environment, and topic. Attribution drives accountability and surfaces optimisation opportunities.
Right-size retention to actual business need — Not all data needs the same retention period. Match retention to compliance, operational, and business requirements rather than applying a single blanket policy.

Best Practices

CO-01: Storage Cost Management

What

Minimise storage costs through lifecycle policies, compression, deduplication, and continuous monitoring — without compromising data durability or restore capability.

Why

Storage is typically the largest component of backup cost. A single unmanaged S3 bucket can grow from manageable to expensive within months. Lifecycle policies alone can reduce long-term storage costs by 95% compared to keeping everything in the default storage class.

Implementation Guidance

AWS S3 Lifecycle Tiers

Age	Storage Class	Approx. Cost (USD/GB/mo)	Use Case
0–30 days	S3 Standard	$0.023	Active backups, frequent restores
31–90 days	S3 Standard-IA	$0.0125	Infrequent access, still fast retrieval
91–365 days	S3 Glacier Instant Retrieval	$0.004	Archival with millisecond access
1+ years	S3 Glacier Deep Archive	$0.00099	Long-term compliance, rare access

Azure Blob Storage Tiers

Age	Access Tier	Use Case
0–30 days	Hot	Active backups
31–90 days	Cool	Infrequent access
91–365 days	Cold	Archival with moderate retrieval time
1+ years	Archive	Long-term compliance

GCS Storage Classes

Age	Storage Class	Use Case
0–30 days	Standard	Active backups
31–90 days	Nearline	Monthly access pattern
91–365 days	Coldline	Quarterly access pattern
1+ years	Archive	Annual access or compliance

Compression and deduplication:

Enable compression in kafka-backup for a typical 3–5x reduction in stored data size
Use deduplication to eliminate redundant segments across incremental backups
Automate deletion of backups beyond the retention policy with operator-managed retention or storage lifecycle rules

Understand total cost components:

Storage at rest (the largest component)
API calls (PUT, GET, LIST operations)
Data transfer (retrieval and cross-region)
Compute (the backup process itself)

tip

Enable S3 Intelligent-Tiering for buckets where access patterns are unpredictable. It automatically moves objects between tiers based on access frequency, with no retrieval fees for the frequent and infrequent access tiers.

Configuration

S3 Lifecycle Policy:

{
  "Rules": [
    {
      "ID": "kafka-backup-lifecycle",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "backups/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 2555
      }
    }
  ]
}

kafka-backup compression configuration:

backup:
  compression:
    enabled: true
    algorithm: zstd
    level: 3
  storage:
    s3:
      bucket: my-kafka-backups
      region: eu-west-1
      storage-class: STANDARD

Anti-patterns

All Standard, forever — Storing every backup in S3 Standard with no lifecycle policy. A 10 TB dataset costs ~$230/month in Standard vs ~$10/month in Deep Archive.
No lifecycle policies — Relying on manual cleanup that never happens. Storage grows linearly and silently.
No cost monitoring — Discovering a $5,000/month storage bill during quarterly budget review instead of catching it at $500.
No compression — Storing uncompressed Kafka segments when 3–5x compression is available with minimal CPU overhead.

CO-02: Compute Right-Sizing

What

Match compute resources to actual backup workload requirements, scaling up for backup windows and down during idle periods.

Why

Backup workloads are inherently bursty. Running large instances 24/7 for a workload that peaks during a two-hour backup window wastes 90% of the compute spend.

Implementation Guidance

Start with PE-04 sizing guidance — Use the performance efficiency pillar's sizing recommendations as a baseline, then refine based on observed utilisation.
Monitor actual utilisation — Track CPU, memory, and network usage during backup runs. If peak utilisation is below 50%, you are over-provisioned.
Right-size with 20–30% headroom — Allow enough capacity to handle spikes without throttling, but no more.

Kubernetes requests and limits:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "2000m"
    memory: "2Gi"

Use Spot/Preemptible instances for non-critical workloads — Validation jobs and development backups tolerate interruption. Enable checkpointing so interrupted jobs resume rather than restart.

# Kubernetes node affinity for Spot instances
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
            - key: node.kubernetes.io/instance-type
              operator: In
              values:
                - spot

Schedule scale-up during backup windows — Use Kubernetes CronJobs or cluster autoscaler profiles to add capacity before backup runs and release it afterwards.
Leverage Rust efficiency — kafka-backup is written in Rust, which delivers significantly lower CPU and memory consumption compared to JVM-based alternatives. This translates directly into smaller instance sizes and lower compute costs.

tip

Combine Spot instances with checkpointing for development and validation workloads. If a Spot instance is reclaimed, the job resumes from the last checkpoint rather than restarting from scratch.

Anti-patterns

Large instances 24/7 for a 2-hour daily backup — Running an m5.4xlarge around the clock when a m5.xlarge during the backup window would suffice.
No utilisation monitoring — Provisioning based on initial estimates and never revisiting. Workloads change; compute allocation should too.
Ignoring Spot/Preemptible — Paying full on-demand price for fault-tolerant workloads that can run on Spot at 60–90% discount.

CO-03: Network Transfer Costs

What

Minimise data transfer charges by co-locating backup components, using private endpoints, and compressing data before transmission.

Why

Cloud providers charge for data that crosses availability zone, region, or internet boundaries. For high-throughput Kafka clusters, transfer costs can rival or exceed storage costs if not managed carefully.

Implementation Guidance

Typical transfer pricing (AWS):

Path	Approx. Cost (USD/GB)
Same AZ (private IP)	Free
Cross-AZ	$0.01
Cross-region	$0.02–$0.09
Internet egress	$0.05–$0.12

Co-locate in the same AZ — Run kafka-backup in the same availability zone as your Kafka brokers. This eliminates cross-AZ charges for the data read path.

# Pod topology constraint for same-AZ placement
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: kafka-backup

Use VPC endpoints — Access S3, GCS, or Azure Blob via private endpoints rather than public internet. This eliminates NAT gateway charges and reduces latency.

# Create S3 VPC Gateway Endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-abc123 \
  --service-name com.amazonaws.eu-west-1.s3 \
  --route-table-ids rtb-abc123

Backup locally, then replicate — Write backups to a storage bucket in the same region as Kafka. Use storage-native cross-region replication (e.g., S3 CRR) for DR copies — this is cheaper and more reliable than backing up directly to a remote region.
Compression reduces transfer cost proportionally — A 4x compression ratio means 75% less data traverses the network, reducing transfer charges by the same proportion.

Anti-patterns

Direct cross-region backup — Streaming backup data from eu-west-1 Kafka to a us-east-1 bucket, paying cross-region transfer on every byte.
No VPC endpoints — Routing S3 traffic through a NAT gateway, paying $0.045/GB for NAT processing on top of transfer charges.
No transfer cost accounting — Tracking storage costs but ignoring the transfer charges that can exceed them.

CO-04: Backup Retention Policies

What

Define retention periods per topic or data classification, automating the deletion of backups that are no longer needed.

Why

Keeping all backups indefinitely is the single largest driver of runaway storage costs. A well-designed retention policy reduces storage by 60–80% while still meeting every compliance and operational requirement.

Implementation Guidance

Retention tiers by data classification:

Tier	Example Topics	Retention	Justification
Compliance	`financial-transactions`, `audit-log`	7 years	Regulatory requirement (e.g., SOX, MiFID II)
Critical	`orders`, `payments`, `customer-updates`	90 days	Operational recovery and dispute resolution
Standard	`user-events`, `page-views`, `click-stream`	30 days	Analytics replay, short-term debugging
Ephemeral	`logs`, `metrics`, `health-checks`	7 days	Troubleshooting only, easily regenerated

Automate with operator retention or storage lifecycle policies:

For Kubernetes operator deployments, KafkaBackup.spec.retention can prune complete backup sets after successful backup runs. It is disabled by default and must be enabled per backup:

apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
  name: production-nightly
spec:
  schedule: "0 0 2 * * * *"
  stopAtCurrentOffsets: true
  storage:
    storageType: s3
    s3:
      bucket: kafka-backups
      region: eu-west-1
      prefix: production/nightly
      credentialsSecret:
        name: s3-credentials
  retention:
    enabled: true
    maxAgeDays: 90
    keepLast: 7
    dryRun: true

Use dryRun: true first, review the reported eligible backups and reclaimed bytes, then switch to dryRun: false after the policy has been approved.

Use storage prefixes per tier when retention is enforced by the storage backend:

{
  "Rules": [
    {
      "ID": "financial-archive-7-years",
      "Status": "Enabled",
      "Filter": { "Prefix": "financial/" },
      "Expiration": { "Days": 2555 }
    },
    {
      "ID": "production-90-days",
      "Status": "Enabled",
      "Filter": { "Prefix": "production/" },
      "Expiration": { "Days": 90 }
    },
    {
      "ID": "ephemeral-7-days",
      "Status": "Enabled",
      "Filter": { "Prefix": "ephemeral/" },
      "Expiration": { "Days": 7 }
    }
  ]
}

Use operator retention for backup-set pruning — For PVC/local, S3/S3-compatible, and Azure Blob Storage, the operator can delete whole backup IDs without partially pruning manifests.
Use storage lifecycle for backend-native controls — Configure S3, S3-compatible, Azure Blob, or GCS lifecycle policies when you need object lock, legal hold, cross-account enforcement, or GCS support.
Audit retention quarterly — Review which topics are being backed up, how much storage each consumes, and whether the retention period is still appropriate.
Document justification — Record why each retention period was chosen. Compliance requirements change; undocumented policies cannot be reviewed.
Support legal hold — Ensure your retention automation can be overridden for specific backups when legal hold is required (e.g., litigation or regulatory investigation).

warning

Automated deletion is irreversible. Before enabling retention policies, verify that your compliance team has signed off on the retention periods for regulated data.

Anti-patterns

Keep everything forever — The most expensive and least compliant approach. Indefinite retention increases both cost and risk surface.
Manual cleanup — Relying on an engineer to periodically delete old backups. This never happens consistently.
Same retention for all topics — Applying a 7-year retention to ephemeral logs because "it's easier than classifying topics".

CO-05: Cost Visibility & Governance

What

Implement tagging, dashboards, budgets, and review processes that make backup costs transparent and actionable.

Why

You cannot optimise what you cannot see. Without visibility, backup costs are absorbed into general cloud spend, optimisation opportunities go unnoticed, and there is no accountability for cost growth.

Implementation Guidance

Tag all resources consistently:

# Standard tagging schema
tags:
  team: platform-engineering
  environment: production
  project: kafka-backup
  cost-centre: CC-4521
  managed-by: terraform

Build cost dashboards that show:

Total backup cost per month (trend over 6+ months)
Breakdown by category: storage, compute, network
Cost per GB backed up (efficiency metric)
Cost by environment (production vs staging vs development)
Month-over-month growth rate

Set alerts and budgets:

# AWS Budget example
aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "kafka-backup-monthly",
    "BudgetLimit": {"Amount": "500", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {
      "TagKeyValue": ["user:project$kafka-backup"]
    }
  }' \
  --notifications-with-subscribers '[{
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80
    },
    "Subscribers": [{
      "SubscriptionType": "EMAIL",
      "Address": "platform-team@example.com"
    }]
  }]'

Monthly cost review — Include backup costs in your team's monthly operational review. Compare actual spend against budget and investigate variances.
Cost per topic — Where possible, attribute storage costs to individual topics. This surfaces topics that are disproportionately expensive and drives conversations about retention and compression.

tip

Use AWS Cost Explorer tag filtering, Azure Cost Management scopes, or GCP billing labels to create dedicated backup cost views without building custom dashboards.

Anti-patterns

No tagging — Backup resources are untagged, making it impossible to separate backup costs from general infrastructure spend.
Lumped into general spend — Backup costs are not broken out, so no one knows whether the $2,000/month increase is from backups, compute, or something else entirely.
No budget or alerts — The team discovers cost overruns during quarterly business review instead of when they happen.

Review Questions

Use these questions to evaluate the cost optimisation of your Kafka backup architecture:

Do you have lifecycle policies configured on all backup storage buckets?
Can you state the cost per GB of backed-up data for each environment?
Are compute resources right-sized to actual backup workload utilisation, with no more than 30% idle headroom?
Are you using Spot or Preemptible instances for non-critical backup workloads?
Is kafka-backup co-located in the same availability zone as the Kafka brokers it reads from?
Are VPC endpoints configured for all storage access paths?
Do you have documented and automated retention policies for every backed-up topic?
Are all backup resources tagged with a consistent schema (team, environment, project, cost-centre)?
Do you have budget alerts that fire before costs exceed your planned spend?
Is backup cost reviewed as a standing agenda item in your monthly operational review?

Design Principles​

Best Practices​

CO-01: Storage Cost Management​

What​

Why​

Implementation Guidance​

Configuration​

Anti-patterns​

CO-02: Compute Right-Sizing​

What​

Why​

Implementation Guidance​

Anti-patterns​

CO-03: Network Transfer Costs​

What​

Why​

Implementation Guidance​

Anti-patterns​

CO-04: Backup Retention Policies​

What​

Why​

Implementation Guidance​

Anti-patterns​

CO-05: Cost Visibility & Governance​

What​

Why​

Implementation Guidance​

Anti-patterns​

Review Questions​

Resources​

Design Principles

Best Practices

CO-01: Storage Cost Management

What

Why

Implementation Guidance

Configuration

Anti-patterns

CO-02: Compute Right-Sizing

What

Why

Implementation Guidance

Anti-patterns

CO-03: Network Transfer Costs

What

Why

Implementation Guidance

Anti-patterns

CO-04: Backup Retention Policies

What

Why

Implementation Guidance

Anti-patterns

CO-05: Cost Visibility & Governance

What

Why

Implementation Guidance

Anti-patterns

Review Questions

Resources