Cost Optimisation
"Managing and reducing the total cost of Kafka backup operations through efficient resource utilisation, smart storage tiering, and cost-aware architecture decisions."
Backup infrastructure can become a significant and often invisible cost centre. Without active cost management, storage grows unbounded, oversized compute runs around the clock, and network transfer charges accumulate unnoticed. The Cost Optimisation pillar ensures you achieve reliable data protection at the lowest reasonable cost by continuously measuring, right-sizing, and governing every component of your backup architecture.
Design Principles
-
Implement cloud financial management — Tag every backup resource, track spending against budgets, and make cost data visible to the teams that control it.
-
Adopt a consumption model — Pay only for what you use. Scale compute to match backup windows, use lifecycle policies to tier storage, and avoid paying for idle capacity.
-
Measure overall efficiency — Define a unit cost metric such as cost per GB backed up. Track it over time and use it to evaluate architecture changes.
-
Stop spending on undifferentiated heavy lifting — Use managed object storage (S3, GCS, Azure Blob) rather than self-hosted storage. Let the cloud provider handle durability, availability, and scaling.
-
Analyse and attribute expenditure — Break down backup costs by team, environment, and topic. Attribution drives accountability and surfaces optimisation opportunities.
-
Right-size retention to actual business need — Not all data needs the same retention period. Match retention to compliance, operational, and business requirements rather than applying a single blanket policy.
Best Practices
CO-01: Storage Cost Management
What
Minimise storage costs through lifecycle policies, compression, deduplication, and continuous monitoring — without compromising data durability or restore capability.
Why
Storage is typically the largest component of backup cost. A single unmanaged S3 bucket can grow from manageable to expensive within months. Lifecycle policies alone can reduce long-term storage costs by 95% compared to keeping everything in the default storage class.
Implementation Guidance
AWS S3 Lifecycle Tiers
| Age | Storage Class | Approx. Cost (USD/GB/mo) | Use Case |
|---|---|---|---|
| 0–30 days | S3 Standard | $0.023 | Active backups, frequent restores |
| 31–90 days | S3 Standard-IA | $0.0125 | Infrequent access, still fast retrieval |
| 91–365 days | S3 Glacier Instant Retrieval | $0.004 | Archival with millisecond access |
| 1+ years | S3 Glacier Deep Archive | $0.00099 | Long-term compliance, rare access |
Azure Blob Storage Tiers
| Age | Access Tier | Use Case |
|---|---|---|
| 0–30 days | Hot | Active backups |
| 31–90 days | Cool | Infrequent access |
| 91–365 days | Cold | Archival with moderate retrieval time |
| 1+ years | Archive | Long-term compliance |
GCS Storage Classes
| Age | Storage Class | Use Case |
|---|---|---|
| 0–30 days | Standard | Active backups |
| 31–90 days | Nearline | Monthly access pattern |
| 91–365 days | Coldline | Quarterly access pattern |
| 1+ years | Archive | Annual access or compliance |
Compression and deduplication:
- Enable compression in
kafka-backupfor a typical 3–5x reduction in stored data size - Use deduplication to eliminate redundant segments across incremental backups
- Automate deletion of backups beyond retention policy
Understand total cost components:
- Storage at rest (the largest component)
- API calls (PUT, GET, LIST operations)
- Data transfer (retrieval and cross-region)
- Compute (the backup process itself)
Enable S3 Intelligent-Tiering for buckets where access patterns are unpredictable. It automatically moves objects between tiers based on access frequency, with no retrieval fees for the frequent and infrequent access tiers.
Configuration
S3 Lifecycle Policy:
{
"Rules": [
{
"ID": "kafka-backup-lifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "backups/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER_IR"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555
}
}
]
}
kafka-backup compression configuration:
backup:
compression:
enabled: true
algorithm: zstd
level: 3
storage:
s3:
bucket: my-kafka-backups
region: eu-west-1
storage-class: STANDARD
Anti-patterns
- All Standard, forever — Storing every backup in S3 Standard with no lifecycle policy. A 10 TB dataset costs ~$230/month in Standard vs ~$10/month in Deep Archive.
- No lifecycle policies — Relying on manual cleanup that never happens. Storage grows linearly and silently.
- No cost monitoring — Discovering a $5,000/month storage bill during quarterly budget review instead of catching it at $500.
- No compression — Storing uncompressed Kafka segments when 3–5x compression is available with minimal CPU overhead.
CO-02: Compute Right-Sizing
What
Match compute resources to actual backup workload requirements, scaling up for backup windows and down during idle periods.
Why
Backup workloads are inherently bursty. Running large instances 24/7 for a workload that peaks during a two-hour backup window wastes 90% of the compute spend.
Implementation Guidance
- Start with PE-04 sizing guidance — Use the performance efficiency pillar's sizing recommendations as a baseline, then refine based on observed utilisation.
- Monitor actual utilisation — Track CPU, memory, and network usage during backup runs. If peak utilisation is below 50%, you are over-provisioned.
- Right-size with 20–30% headroom — Allow enough capacity to handle spikes without throttling, but no more.
Kubernetes requests and limits:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
- Use Spot/Preemptible instances for non-critical workloads — Validation jobs and development backups tolerate interruption. Enable checkpointing so interrupted jobs resume rather than restart.
# Kubernetes node affinity for Spot instances
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- spot
- Schedule scale-up during backup windows — Use Kubernetes CronJobs or cluster autoscaler profiles to add capacity before backup runs and release it afterwards.
- Leverage Rust efficiency —
kafka-backupis written in Rust, which delivers significantly lower CPU and memory consumption compared to JVM-based alternatives. This translates directly into smaller instance sizes and lower compute costs.
Combine Spot instances with checkpointing for development and validation workloads. If a Spot instance is reclaimed, the job resumes from the last checkpoint rather than restarting from scratch.
Anti-patterns
- Large instances 24/7 for a 2-hour daily backup — Running an
m5.4xlargearound the clock when am5.xlargeduring the backup window would suffice. - No utilisation monitoring — Provisioning based on initial estimates and never revisiting. Workloads change; compute allocation should too.
- Ignoring Spot/Preemptible — Paying full on-demand price for fault-tolerant workloads that can run on Spot at 60–90% discount.
CO-03: Network Transfer Costs
What
Minimise data transfer charges by co-locating backup components, using private endpoints, and compressing data before transmission.
Why
Cloud providers charge for data that crosses availability zone, region, or internet boundaries. For high-throughput Kafka clusters, transfer costs can rival or exceed storage costs if not managed carefully.
Implementation Guidance
Typical transfer pricing (AWS):
| Path | Approx. Cost (USD/GB) |
|---|---|
| Same AZ (private IP) | Free |
| Cross-AZ | $0.01 |
| Cross-region | $0.02–$0.09 |
| Internet egress | $0.05–$0.12 |
- Co-locate in the same AZ — Run
kafka-backupin the same availability zone as your Kafka brokers. This eliminates cross-AZ charges for the data read path.
# Pod topology constraint for same-AZ placement
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: kafka-backup
- Use VPC endpoints — Access S3, GCS, or Azure Blob via private endpoints rather than public internet. This eliminates NAT gateway charges and reduces latency.
# Create S3 VPC Gateway Endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-abc123 \
--service-name com.amazonaws.eu-west-1.s3 \
--route-table-ids rtb-abc123
- Backup locally, then replicate — Write backups to a storage bucket in the same region as Kafka. Use storage-native cross-region replication (e.g., S3 CRR) for DR copies — this is cheaper and more reliable than backing up directly to a remote region.
- Compression reduces transfer cost proportionally — A 4x compression ratio means 75% less data traverses the network, reducing transfer charges by the same proportion.
Anti-patterns
- Direct cross-region backup — Streaming backup data from
eu-west-1Kafka to aus-east-1bucket, paying cross-region transfer on every byte. - No VPC endpoints — Routing S3 traffic through a NAT gateway, paying $0.045/GB for NAT processing on top of transfer charges.
- No transfer cost accounting — Tracking storage costs but ignoring the transfer charges that can exceed them.
CO-04: Backup Retention Policies
What
Define retention periods per topic or data classification, automating the deletion of backups that are no longer needed.
Why
Keeping all backups indefinitely is the single largest driver of runaway storage costs. A well-designed retention policy reduces storage by 60–80% while still meeting every compliance and operational requirement.
Implementation Guidance
Retention tiers by data classification:
| Tier | Example Topics | Retention | Justification |
|---|---|---|---|
| Compliance | financial-transactions, audit-log | 7 years | Regulatory requirement (e.g., SOX, MiFID II) |
| Critical | orders, payments, customer-updates | 90 days | Operational recovery and dispute resolution |
| Standard | user-events, page-views, click-stream | 30 days | Analytics replay, short-term debugging |
| Ephemeral | logs, metrics, health-checks | 7 days | Troubleshooting only, easily regenerated |
Automate with CRD configuration:
apiVersion: kafka-backup.io/v1
kind: BackupPolicy
metadata:
name: tiered-retention
spec:
policies:
- topicPattern: "financial-*"
retention:
maxAge: 2555d
- topicPattern: "orders|payments|customer-*"
retention:
maxAge: 90d
- topicPattern: "user-events|click-*"
retention:
maxAge: 30d
maxCount: 30
- topicPattern: "logs|metrics|health-*"
retention:
maxAge: 7d
maxCount: 7
- Use S3 lifecycle as a safety net — Even with application-level retention, set a bucket-level expiration policy as a backstop to catch anything the application misses.
- Audit retention quarterly — Review which topics are being backed up, how much storage each consumes, and whether the retention period is still appropriate.
- Document justification — Record why each retention period was chosen. Compliance requirements change; undocumented policies cannot be reviewed.
- Support legal hold — Ensure your retention automation can be overridden for specific backups when legal hold is required (e.g., litigation or regulatory investigation).
Automated deletion is irreversible. Before enabling retention policies, verify that your compliance team has signed off on the retention periods for regulated data.
Anti-patterns
- Keep everything forever — The most expensive and least compliant approach. Indefinite retention increases both cost and risk surface.
- Manual cleanup — Relying on an engineer to periodically delete old backups. This never happens consistently.
- Same retention for all topics — Applying a 7-year retention to ephemeral logs because "it's easier than classifying topics".
CO-05: Cost Visibility & Governance
What
Implement tagging, dashboards, budgets, and review processes that make backup costs transparent and actionable.
Why
You cannot optimise what you cannot see. Without visibility, backup costs are absorbed into general cloud spend, optimisation opportunities go unnoticed, and there is no accountability for cost growth.
Implementation Guidance
Tag all resources consistently:
# Standard tagging schema
tags:
team: platform-engineering
environment: production
project: kafka-backup
cost-centre: CC-4521
managed-by: terraform
Build cost dashboards that show:
- Total backup cost per month (trend over 6+ months)
- Breakdown by category: storage, compute, network
- Cost per GB backed up (efficiency metric)
- Cost by environment (production vs staging vs development)
- Month-over-month growth rate
Set alerts and budgets:
# AWS Budget example
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "kafka-backup-monthly",
"BudgetLimit": {"Amount": "500", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {
"TagKeyValue": ["user:project$kafka-backup"]
}
}' \
--notifications-with-subscribers '[{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80
},
"Subscribers": [{
"SubscriptionType": "EMAIL",
"Address": "platform-team@example.com"
}]
}]'
- Monthly cost review — Include backup costs in your team's monthly operational review. Compare actual spend against budget and investigate variances.
- Cost per topic — Where possible, attribute storage costs to individual topics. This surfaces topics that are disproportionately expensive and drives conversations about retention and compression.
Use AWS Cost Explorer tag filtering, Azure Cost Management scopes, or GCP billing labels to create dedicated backup cost views without building custom dashboards.
Anti-patterns
- No tagging — Backup resources are untagged, making it impossible to separate backup costs from general infrastructure spend.
- Lumped into general spend — Backup costs are not broken out, so no one knows whether the $2,000/month increase is from backups, compute, or something else entirely.
- No budget or alerts — The team discovers cost overruns during quarterly business review instead of when they happen.
Review Questions
Use these questions to evaluate the cost optimisation of your Kafka backup architecture:
- Do you have lifecycle policies configured on all backup storage buckets?
- Can you state the cost per GB of backed-up data for each environment?
- Are compute resources right-sized to actual backup workload utilisation, with no more than 30% idle headroom?
- Are you using Spot or Preemptible instances for non-critical backup workloads?
- Is
kafka-backupco-located in the same availability zone as the Kafka brokers it reads from? - Are VPC endpoints configured for all storage access paths?
- Do you have documented and automated retention policies for every backed-up topic?
- Are all backup resources tagged with a consistent schema (team, environment, project, cost-centre)?
- Do you have budget alerts that fire before costs exceed your planned spend?
- Is backup cost reviewed as a standing agenda item in your monthly operational review?