June 1, 2026
Datadog Bill Too High? Start With Logs, Custom Metrics, and Kubernetes Noise
A diagnostic guide to finding the real Datadog cost drivers before migrating: logs, custom metrics, high-cardinality tags, Kubernetes churn, APM traces, and duplicate cloud telemetry.
For the broader framework, see Datadog Cost Reduction: What to Keep in Datadog and What to Offload to Zabbix/Grafana.
A high Datadog bill is usually not caused by one bad dashboard or one expensive feature. It is usually caused by unmanaged telemetry volume.
Teams add logs, metrics, traces, Kubernetes integrations, cloud integrations, APM, and custom tags over time. Each addition looks harmless by itself. The bill becomes painful when the same environment starts sending high-volume logs, high-cardinality custom metrics, container churn, duplicated cloud telemetry, and broad APM tracing into the same premium observability platform.
The first move should not be a blind migration to open source. That only moves the mess somewhere else. The first move is cost attribution: identify which services, teams, environments, tags, logs, metrics, traces, and retention policies are driving usage.
Why Datadog costs grow
Datadog is useful because it combines infrastructure monitoring, logs, metrics, APM traces, dashboards, monitors, and incident context. That same integration is also why cost can grow quickly. A single noisy service can affect multiple billing dimensions at once.
The most common cost drivers are:
- log ingestion and indexing
- custom metrics
- high-cardinality tags
- Kubernetes container churn
- APM trace volume
- duplicate telemetry from cloud integrations and agents
- unused dashboards, monitors, and log-based metrics
- non-production environments treated like production
The practical question is not “why is Datadog expensive?” The practical question is “which telemetry is valuable enough to justify Datadog, and which telemetry should be reduced, sampled, archived, or moved elsewhere?”
Cost driver 1: logs
Logs are often the first place the bill goes wrong.
Datadog log cost has more than one layer. Teams need to distinguish between logs that are submitted to Datadog and logs that are indexed for search, dashboards, and monitors. Exclusion filters can reduce indexed volume, but they are not a substitute for controlling what gets collected and forwarded in the first place.
Common log cost problems:
- DEBUG and TRACE logs left enabled in production
- HTTP 200 access logs indexed at full volume
- Kubernetes liveness and readiness probes
- ingress controller and load balancer access logs
- VPC Flow Logs, CDN logs, DNS logs, and firewall logs sent without filtering
- non-production logs kept with production retention
- logs with large unused JSON fields
- logs used as a poor substitute for metrics
Start by grouping log usage by service, source, environment, status code, index, and team. The top five producers usually explain most of the problem.
What to do first
Do not delete logs randomly. Classify them.
| Log type | Typical action |
|---|---|
| Production errors, HTTP 5xx, auth failures, admin actions | Keep searchable and alertable |
| HTTP 200/302 access logs, normal info logs | Sample, shorten retention, or move to cheaper storage |
| Health checks, readiness probes, repetitive DEBUG logs | Drop at the edge or archive only |
| Audit, payment, identity, firewall, security logs | Preserve according to retention and investigation needs |
If a log stream is needed only for historical lookup, it may not need to live in a hot Datadog index. If a log stream is repetitive operational noise, reduce it before it reaches Datadog.
Cost driver 2: custom metrics
Custom metrics are where small tagging mistakes become expensive.
A custom metric is not just a metric name. It is the metric name plus the unique combinations of tags attached to it. A single metric can become thousands or millions of time series when teams add uncontrolled tags.
Safe tags usually describe stable operational dimensions:
envserviceregionstatus_codecustomer_tiercluster
Dangerous tags usually describe unique or fast-changing values:
user_idsession_idrequest_idpod_uidcontainer_id- raw URL paths with IDs
- timestamps
- transaction hashes
Example: a metric named api.request.latency with 10 endpoints, 5 status codes, and 3 customer tiers creates 150 time series. Add 10,000 user IDs as a tag, and the same metric can become 1,500,000 time series. That is the usual shape of a custom metric surprise.
What to do first
Use Datadog’s metric volume and custom metric governance tools to find the largest metric names by estimated cardinality. Then classify the tags:
| Tag type | Action |
|---|---|
| Stable operational tag | Usually keep |
| Rarely queried tag | Remove from indexing or drop upstream |
| Unique identifier | Move to logs/traces, not metrics |
| Debug-only dimension | Remove from production metrics |
| Business analytics dimension | Send to BI/analytics, not infrastructure monitoring |
Metrics without Limits can help reduce indexed custom metric cardinality without forcing every application team to immediately change code. That is useful as a first containment step. It is not a replacement for tag governance.
Cost driver 3: high-cardinality Kubernetes telemetry
Kubernetes makes cost attribution harder because infrastructure changes constantly.
Pods start and stop. Containers restart. Deployments create new replica sets. Jobs run briefly and disappear. Sidecars multiply telemetry volume. Labels and annotations can create many dimensions. A cluster can be healthy from an application perspective while still generating wasteful telemetry.
Common Kubernetes cost drivers:
- high-density nodes with many containers
- short-lived jobs and ephemeral pods
- CrashLoopBackOff churn
- sidecars collected at full detail
- labels that include unique build, pod, or deployment identifiers
- logs from kube-system and platform namespaces
- collecting metrics and logs from development namespaces the same way as production
What to do first
Review which containers and namespaces actually need Datadog collection. Exclude low-value workloads and sidecars where possible. Use separate rules for logs and metrics.
Practical candidates for exclusion or reduction:
- sandbox namespaces
- short-lived CI/test jobs
- noisy sidecars
- health check containers
- development workloads
- platform components already monitored elsewhere
The point is not to make Kubernetes invisible. The point is to stop treating every pod, sidecar, and ephemeral job as premium telemetry.
Cost driver 4: APM traces and span volume
APM is valuable when it helps engineering teams understand latency, errors, dependencies, and root cause. It becomes expensive when every successful request from every service is retained at high volume.
Trace costs usually grow because of:
- high request volume
- many microservices per request
- full retention of successful HTTP 200 traces
- tracing health checks and low-value endpoints
- duplicate instrumentation
- too many generated span metrics
- no sampling policy by service criticality
What to do first
Keep error and latency visibility. Reduce routine success noise.
A sane APM policy usually keeps:
- 100 percent of error traces
- high-latency outliers
- traces for revenue-critical workflows
- traces for new or unstable services
A sane policy usually samples or drops:
- successful health check traces
- repetitive polling endpoints
- known low-value background jobs
- high-volume successful requests with no diagnostic value
Aggressive sampling is fine only when failure paths are preserved. Dropping successful traces is different from dropping the traces that explain a production outage.
Cost driver 5: duplicate cloud telemetry
Many environments pay twice for similar infrastructure data.
A common pattern is importing cloud provider metrics through an integration while also running agents on the same hosts. If metadata, hostnames, tags, or instance IDs do not line up cleanly, teams can end up with duplicated hosts, duplicated metrics, or noisy dashboards that nobody trusts.
Examples:
- AWS CloudWatch metrics imported into Datadog and also collected by the Agent
- EC2 hosts appearing twice because metadata does not match
- container metrics collected by multiple integrations
- network metrics imported from both cloud APIs and agents
- dashboards built on both native cloud metrics and Datadog-generated metrics
What to do first
Audit the infrastructure list and usage views. Confirm whether cloud integration data and agent data are being correlated correctly. Then decide which system owns each signal.
For basic cloud utilization, native cloud monitoring may be enough. For deeper host and application context, the Datadog Agent may be justified. The waste is paying for both without a clear reason.
What not to cut blindly
A high invoice creates pressure to cut fast. That is how teams create blind spots.
Do not blindly remove:
- authentication and authorization logs
- administrative action logs
- payment and transaction audit trails
- firewall and security events
- production error traces
- monitors tied to SLOs or incident response
- metrics used by autoscaling or capacity planning
- logs required for customer support investigations
A log or metric can look unused until the one incident where it becomes essential. Before removing it, check whether it is tied to a monitor, dashboard, notebook, SLO, runbook, audit process, or investigation workflow.
What to audit first
Start with attribution, not migration.
1. Top log producers
Find the highest-volume services, sources, indexes, and environments. Look for obvious patterns: access logs, health checks, non-production logs, DEBUG logs, and large JSON payloads.
2. Top custom metrics by cardinality
Identify the metrics with the highest number of time series. Look at the tags attached to those metrics. Unique identifiers should be removed from metric tags.
3. Kubernetes namespace and container volume
Group usage by namespace, workload, and container. Find development namespaces, sidecars, and ephemeral jobs that do not need full Datadog visibility.
4. APM retention and sampling
Review which spans are retained. Preserve errors and high-latency traces. Reduce successful request noise.
5. Duplicate infrastructure views
Check whether hosts, containers, cloud metrics, and agent metrics are duplicated. Fix metadata mapping before assuming the bill reflects real infrastructure size.
6. Unused telemetry
Review metrics and logs that are not queried, not shown on active dashboards, and not tied to monitors. Do not delete automatically. Put them into a deprecation review.
Practical 7-step cost review checklist
- Pull 30 days of usage data and identify the biggest cost categories.
- Group log volume by service, source, environment, index, and status code.
- Find the top custom metrics by cardinality and remove high-cardinality tags from indexing.
- Review Kubernetes namespaces, sidecars, jobs, and container collection rules.
- Preserve error traces and critical transactions, then reduce successful request trace volume.
- Audit cloud integration and agent duplication.
- Create usage monitors so the next spike is caught before the invoice arrives.
When migration makes sense
Migration can make sense after attribution.
Once teams know what is driving the bill, they can decide what belongs in Datadog and what should move elsewhere:
| Telemetry type | Possible destination |
|---|---|
| Static server and network monitoring | Zabbix |
| Kubernetes infrastructure metrics | Prometheus, VictoriaMetrics, Grafana |
| High-volume logs | Loki, OpenSearch, OpenObserve, object storage |
| Long-term archive | S3, GCS, Azure Blob |
| Critical APM and incident views | Datadog |
| Security and audit logs | Datadog, SIEM, or controlled archive depending on policy |
This is not a tool religion problem. Datadog can remain the premium layer for critical visibility while lower-value telemetry moves to cheaper systems.
Bottom line
A high Datadog bill is usually a telemetry governance problem before it is a vendor problem.
Start with logs, custom metrics, Kubernetes churn, APM trace volume, and duplicate cloud telemetry. Find the sources. Fix the worst offenders. Then decide what to keep, reduce, or offload.
If your Datadog bill is growing faster than the value you get from it, I can review the telemetry mix, identify the main cost drivers, and recommend what to keep, reduce, or offload.
Related guides
- Datadog Cost Reduction: What to Keep and What to Offload
- How to Reduce Datadog Log Ingestion Cost Without Losing Visibility
- Zabbix vs Datadog Cost: When Zabbix Still Makes Sense
- Datadog to Zabbix Migration: What Should Move and What Should Stay
- Datadog to Grafana Migration: Practical Path for Infrastructure Dashboards
Telemetry Audit & Consultation
Identify your cost drivers
I help enterprise engineering teams design telemetry pipelines, implement edge-routing with Vector/Fluent Bit, and offload static checks to Zabbix and Grafana - saving up to 60% on SaaS bills without losing incident visibility.
Get a Cost AuditSources
- Datadog Billing Documentation - host, log, APM, container, and usage metering concepts.
- Datadog Custom Metrics Billing - custom metric allotments, indexing, and overage concepts.
- Datadog Metrics without Limits - controlling indexed metric tags and custom metric volume.
- Datadog Custom Metrics Governance - identifying high-cardinality metrics and unused telemetry.
- Datadog Containers Billing - container usage, exclusions, and Kubernetes billing behavior.
- Datadog Trace Retention Documentation - span indexing and retention controls.
- Datadog Cloud Cost Management Documentation - cost attribution and cloud optimization views.
Written by
Tymur Chmeruk
Cloud Security & Infrastructure Engineer · Baltimore–Washington Metro · [email protected]