Skip to main content

Cost Optimisation

"Managing and reducing the total cost of Kafka backup operations through efficient resource utilisation, smart storage tiering, and cost-aware architecture decisions."

Backup infrastructure can become a significant and often invisible cost centre. Without active cost management, storage grows unbounded, oversized compute runs around the clock, and network transfer charges accumulate unnoticed. The Cost Optimisation pillar ensures you achieve reliable data protection at the lowest reasonable cost by continuously measuring, right-sizing, and governing every component of your backup architecture.

Design Principles

  1. Implement cloud financial management — Tag every backup resource, track spending against budgets, and make cost data visible to the teams that control it.

  2. Adopt a consumption model — Pay only for what you use. Scale compute to match backup windows, use lifecycle policies to tier storage, and avoid paying for idle capacity.

  3. Measure overall efficiency — Define a unit cost metric such as cost per GB backed up. Track it over time and use it to evaluate architecture changes.

  4. Stop spending on undifferentiated heavy lifting — Use managed object storage (S3, GCS, Azure Blob) rather than self-hosted storage. Let the cloud provider handle durability, availability, and scaling.

  5. Analyse and attribute expenditure — Break down backup costs by team, environment, and topic. Attribution drives accountability and surfaces optimisation opportunities.

  6. Right-size retention to actual business need — Not all data needs the same retention period. Match retention to compliance, operational, and business requirements rather than applying a single blanket policy.


Best Practices

CO-01: Storage Cost Management

What

Minimise storage costs through lifecycle policies, compression, deduplication, and continuous monitoring — without compromising data durability or restore capability.

Why

Storage is typically the largest component of backup cost. A single unmanaged S3 bucket can grow from manageable to expensive within months. Lifecycle policies alone can reduce long-term storage costs by 95% compared to keeping everything in the default storage class.

Implementation Guidance

AWS S3 Lifecycle Tiers

AgeStorage ClassApprox. Cost (USD/GB/mo)Use Case
0–30 daysS3 Standard$0.023Active backups, frequent restores
31–90 daysS3 Standard-IA$0.0125Infrequent access, still fast retrieval
91–365 daysS3 Glacier Instant Retrieval$0.004Archival with millisecond access
1+ yearsS3 Glacier Deep Archive$0.00099Long-term compliance, rare access

Azure Blob Storage Tiers

AgeAccess TierUse Case
0–30 daysHotActive backups
31–90 daysCoolInfrequent access
91–365 daysColdArchival with moderate retrieval time
1+ yearsArchiveLong-term compliance

GCS Storage Classes

AgeStorage ClassUse Case
0–30 daysStandardActive backups
31–90 daysNearlineMonthly access pattern
91–365 daysColdlineQuarterly access pattern
1+ yearsArchiveAnnual access or compliance

Compression and deduplication:

  • Enable compression in kafka-backup for a typical 3–5x reduction in stored data size
  • Use deduplication to eliminate redundant segments across incremental backups
  • Automate deletion of backups beyond retention policy

Understand total cost components:

  • Storage at rest (the largest component)
  • API calls (PUT, GET, LIST operations)
  • Data transfer (retrieval and cross-region)
  • Compute (the backup process itself)
tip

Enable S3 Intelligent-Tiering for buckets where access patterns are unpredictable. It automatically moves objects between tiers based on access frequency, with no retrieval fees for the frequent and infrequent access tiers.

Configuration

S3 Lifecycle Policy:

{
"Rules": [
{
"ID": "kafka-backup-lifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "backups/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER_IR"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555
}
}
]
}

kafka-backup compression configuration:

backup:
compression:
enabled: true
algorithm: zstd
level: 3
storage:
s3:
bucket: my-kafka-backups
region: eu-west-1
storage-class: STANDARD

Anti-patterns

Anti-patterns
  • All Standard, forever — Storing every backup in S3 Standard with no lifecycle policy. A 10 TB dataset costs ~$230/month in Standard vs ~$10/month in Deep Archive.
  • No lifecycle policies — Relying on manual cleanup that never happens. Storage grows linearly and silently.
  • No cost monitoring — Discovering a $5,000/month storage bill during quarterly budget review instead of catching it at $500.
  • No compression — Storing uncompressed Kafka segments when 3–5x compression is available with minimal CPU overhead.

CO-02: Compute Right-Sizing

What

Match compute resources to actual backup workload requirements, scaling up for backup windows and down during idle periods.

Why

Backup workloads are inherently bursty. Running large instances 24/7 for a workload that peaks during a two-hour backup window wastes 90% of the compute spend.

Implementation Guidance

  • Start with PE-04 sizing guidance — Use the performance efficiency pillar's sizing recommendations as a baseline, then refine based on observed utilisation.
  • Monitor actual utilisation — Track CPU, memory, and network usage during backup runs. If peak utilisation is below 50%, you are over-provisioned.
  • Right-size with 20–30% headroom — Allow enough capacity to handle spikes without throttling, but no more.

Kubernetes requests and limits:

resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
  • Use Spot/Preemptible instances for non-critical workloads — Validation jobs and development backups tolerate interruption. Enable checkpointing so interrupted jobs resume rather than restart.
# Kubernetes node affinity for Spot instances
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- spot
  • Schedule scale-up during backup windows — Use Kubernetes CronJobs or cluster autoscaler profiles to add capacity before backup runs and release it afterwards.
  • Leverage Rust efficiencykafka-backup is written in Rust, which delivers significantly lower CPU and memory consumption compared to JVM-based alternatives. This translates directly into smaller instance sizes and lower compute costs.
tip

Combine Spot instances with checkpointing for development and validation workloads. If a Spot instance is reclaimed, the job resumes from the last checkpoint rather than restarting from scratch.

Anti-patterns

Anti-patterns
  • Large instances 24/7 for a 2-hour daily backup — Running an m5.4xlarge around the clock when a m5.xlarge during the backup window would suffice.
  • No utilisation monitoring — Provisioning based on initial estimates and never revisiting. Workloads change; compute allocation should too.
  • Ignoring Spot/Preemptible — Paying full on-demand price for fault-tolerant workloads that can run on Spot at 60–90% discount.

CO-03: Network Transfer Costs

What

Minimise data transfer charges by co-locating backup components, using private endpoints, and compressing data before transmission.

Why

Cloud providers charge for data that crosses availability zone, region, or internet boundaries. For high-throughput Kafka clusters, transfer costs can rival or exceed storage costs if not managed carefully.

Implementation Guidance

Typical transfer pricing (AWS):

PathApprox. Cost (USD/GB)
Same AZ (private IP)Free
Cross-AZ$0.01
Cross-region$0.02–$0.09
Internet egress$0.05–$0.12
  • Co-locate in the same AZ — Run kafka-backup in the same availability zone as your Kafka brokers. This eliminates cross-AZ charges for the data read path.
# Pod topology constraint for same-AZ placement
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: kafka-backup
  • Use VPC endpoints — Access S3, GCS, or Azure Blob via private endpoints rather than public internet. This eliminates NAT gateway charges and reduces latency.
# Create S3 VPC Gateway Endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-abc123 \
--service-name com.amazonaws.eu-west-1.s3 \
--route-table-ids rtb-abc123
  • Backup locally, then replicate — Write backups to a storage bucket in the same region as Kafka. Use storage-native cross-region replication (e.g., S3 CRR) for DR copies — this is cheaper and more reliable than backing up directly to a remote region.
  • Compression reduces transfer cost proportionally — A 4x compression ratio means 75% less data traverses the network, reducing transfer charges by the same proportion.

Anti-patterns

Anti-patterns
  • Direct cross-region backup — Streaming backup data from eu-west-1 Kafka to a us-east-1 bucket, paying cross-region transfer on every byte.
  • No VPC endpoints — Routing S3 traffic through a NAT gateway, paying $0.045/GB for NAT processing on top of transfer charges.
  • No transfer cost accounting — Tracking storage costs but ignoring the transfer charges that can exceed them.

CO-04: Backup Retention Policies

What

Define retention periods per topic or data classification, automating the deletion of backups that are no longer needed.

Why

Keeping all backups indefinitely is the single largest driver of runaway storage costs. A well-designed retention policy reduces storage by 60–80% while still meeting every compliance and operational requirement.

Implementation Guidance

Retention tiers by data classification:

TierExample TopicsRetentionJustification
Compliancefinancial-transactions, audit-log7 yearsRegulatory requirement (e.g., SOX, MiFID II)
Criticalorders, payments, customer-updates90 daysOperational recovery and dispute resolution
Standarduser-events, page-views, click-stream30 daysAnalytics replay, short-term debugging
Ephemerallogs, metrics, health-checks7 daysTroubleshooting only, easily regenerated

Automate with CRD configuration:

apiVersion: kafka-backup.io/v1
kind: BackupPolicy
metadata:
name: tiered-retention
spec:
policies:
- topicPattern: "financial-*"
retention:
maxAge: 2555d
- topicPattern: "orders|payments|customer-*"
retention:
maxAge: 90d
- topicPattern: "user-events|click-*"
retention:
maxAge: 30d
maxCount: 30
- topicPattern: "logs|metrics|health-*"
retention:
maxAge: 7d
maxCount: 7
  • Use S3 lifecycle as a safety net — Even with application-level retention, set a bucket-level expiration policy as a backstop to catch anything the application misses.
  • Audit retention quarterly — Review which topics are being backed up, how much storage each consumes, and whether the retention period is still appropriate.
  • Document justification — Record why each retention period was chosen. Compliance requirements change; undocumented policies cannot be reviewed.
  • Support legal hold — Ensure your retention automation can be overridden for specific backups when legal hold is required (e.g., litigation or regulatory investigation).
warning

Automated deletion is irreversible. Before enabling retention policies, verify that your compliance team has signed off on the retention periods for regulated data.

Anti-patterns

Anti-patterns
  • Keep everything forever — The most expensive and least compliant approach. Indefinite retention increases both cost and risk surface.
  • Manual cleanup — Relying on an engineer to periodically delete old backups. This never happens consistently.
  • Same retention for all topics — Applying a 7-year retention to ephemeral logs because "it's easier than classifying topics".

CO-05: Cost Visibility & Governance

What

Implement tagging, dashboards, budgets, and review processes that make backup costs transparent and actionable.

Why

You cannot optimise what you cannot see. Without visibility, backup costs are absorbed into general cloud spend, optimisation opportunities go unnoticed, and there is no accountability for cost growth.

Implementation Guidance

Tag all resources consistently:

# Standard tagging schema
tags:
team: platform-engineering
environment: production
project: kafka-backup
cost-centre: CC-4521
managed-by: terraform

Build cost dashboards that show:

  • Total backup cost per month (trend over 6+ months)
  • Breakdown by category: storage, compute, network
  • Cost per GB backed up (efficiency metric)
  • Cost by environment (production vs staging vs development)
  • Month-over-month growth rate

Set alerts and budgets:

# AWS Budget example
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "kafka-backup-monthly",
"BudgetLimit": {"Amount": "500", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {
"TagKeyValue": ["user:project$kafka-backup"]
}
}' \
--notifications-with-subscribers '[{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80
},
"Subscribers": [{
"SubscriptionType": "EMAIL",
"Address": "platform-team@example.com"
}]
}]'
  • Monthly cost review — Include backup costs in your team's monthly operational review. Compare actual spend against budget and investigate variances.
  • Cost per topic — Where possible, attribute storage costs to individual topics. This surfaces topics that are disproportionately expensive and drives conversations about retention and compression.
tip

Use AWS Cost Explorer tag filtering, Azure Cost Management scopes, or GCP billing labels to create dedicated backup cost views without building custom dashboards.

Anti-patterns

Anti-patterns
  • No tagging — Backup resources are untagged, making it impossible to separate backup costs from general infrastructure spend.
  • Lumped into general spend — Backup costs are not broken out, so no one knows whether the $2,000/month increase is from backups, compute, or something else entirely.
  • No budget or alerts — The team discovers cost overruns during quarterly business review instead of when they happen.

Review Questions

Use these questions to evaluate the cost optimisation of your Kafka backup architecture:

  1. Do you have lifecycle policies configured on all backup storage buckets?
  2. Can you state the cost per GB of backed-up data for each environment?
  3. Are compute resources right-sized to actual backup workload utilisation, with no more than 30% idle headroom?
  4. Are you using Spot or Preemptible instances for non-critical backup workloads?
  5. Is kafka-backup co-located in the same availability zone as the Kafka brokers it reads from?
  6. Are VPC endpoints configured for all storage access paths?
  7. Do you have documented and automated retention policies for every backed-up topic?
  8. Are all backup resources tagged with a consistent schema (team, environment, project, cost-centre)?
  9. Do you have budget alerts that fire before costs exceed your planned spend?
  10. Is backup cost reviewed as a standing agenda item in your monthly operational review?

Resources