← Back

June 1, 2026

Datadog Bill Too High? Start With Logs, Custom Metrics, and Kubernetes Noise

A diagnostic guide to finding the real Datadog cost drivers before migrating: logs, custom metrics, high-cardinality tags, Kubernetes churn, APM traces, and duplicate cloud telemetry.

For the broader framework, see Datadog Cost Reduction: What to Keep in Datadog and What to Offload to Zabbix/Grafana.

A high Datadog bill is usually not caused by one bad dashboard or one expensive feature. It is usually caused by unmanaged telemetry volume.

Teams add logs, metrics, traces, Kubernetes integrations, cloud integrations, APM, and custom tags over time. Each addition looks harmless by itself. The bill becomes painful when the same environment starts sending high-volume logs, high-cardinality custom metrics, container churn, duplicated cloud telemetry, and broad APM tracing into the same premium observability platform.

The first move should not be a blind migration to open source. That only moves the mess somewhere else. The first move is cost attribution: identify which services, teams, environments, tags, logs, metrics, traces, and retention policies are driving usage.

Why Datadog costs grow

Datadog is useful because it combines infrastructure monitoring, logs, metrics, APM traces, dashboards, monitors, and incident context. That same integration is also why cost can grow quickly. A single noisy service can affect multiple billing dimensions at once.

The most common cost drivers are:

  • log ingestion and indexing
  • custom metrics
  • high-cardinality tags
  • Kubernetes container churn
  • APM trace volume
  • duplicate telemetry from cloud integrations and agents
  • unused dashboards, monitors, and log-based metrics
  • non-production environments treated like production

The practical question is not “why is Datadog expensive?” The practical question is “which telemetry is valuable enough to justify Datadog, and which telemetry should be reduced, sampled, archived, or moved elsewhere?”

Cost driver 1: logs

Logs are often the first place the bill goes wrong.

Datadog log cost has more than one layer. Teams need to distinguish between logs that are submitted to Datadog and logs that are indexed for search, dashboards, and monitors. Exclusion filters can reduce indexed volume, but they are not a substitute for controlling what gets collected and forwarded in the first place.

Common log cost problems:

  • DEBUG and TRACE logs left enabled in production
  • HTTP 200 access logs indexed at full volume
  • Kubernetes liveness and readiness probes
  • ingress controller and load balancer access logs
  • VPC Flow Logs, CDN logs, DNS logs, and firewall logs sent without filtering
  • non-production logs kept with production retention
  • logs with large unused JSON fields
  • logs used as a poor substitute for metrics

Start by grouping log usage by service, source, environment, status code, index, and team. The top five producers usually explain most of the problem.

What to do first

Do not delete logs randomly. Classify them.

Log typeTypical action
Production errors, HTTP 5xx, auth failures, admin actionsKeep searchable and alertable
HTTP 200/302 access logs, normal info logsSample, shorten retention, or move to cheaper storage
Health checks, readiness probes, repetitive DEBUG logsDrop at the edge or archive only
Audit, payment, identity, firewall, security logsPreserve according to retention and investigation needs

If a log stream is needed only for historical lookup, it may not need to live in a hot Datadog index. If a log stream is repetitive operational noise, reduce it before it reaches Datadog.

Cost driver 2: custom metrics

Custom metrics are where small tagging mistakes become expensive.

A custom metric is not just a metric name. It is the metric name plus the unique combinations of tags attached to it. A single metric can become thousands or millions of time series when teams add uncontrolled tags.

Safe tags usually describe stable operational dimensions:

  • env
  • service
  • region
  • status_code
  • customer_tier
  • cluster

Dangerous tags usually describe unique or fast-changing values:

  • user_id
  • session_id
  • request_id
  • pod_uid
  • container_id
  • raw URL paths with IDs
  • timestamps
  • transaction hashes

Example: a metric named api.request.latency with 10 endpoints, 5 status codes, and 3 customer tiers creates 150 time series. Add 10,000 user IDs as a tag, and the same metric can become 1,500,000 time series. That is the usual shape of a custom metric surprise.

What to do first

Use Datadog’s metric volume and custom metric governance tools to find the largest metric names by estimated cardinality. Then classify the tags:

Tag typeAction
Stable operational tagUsually keep
Rarely queried tagRemove from indexing or drop upstream
Unique identifierMove to logs/traces, not metrics
Debug-only dimensionRemove from production metrics
Business analytics dimensionSend to BI/analytics, not infrastructure monitoring

Metrics without Limits can help reduce indexed custom metric cardinality without forcing every application team to immediately change code. That is useful as a first containment step. It is not a replacement for tag governance.

Cost driver 3: high-cardinality Kubernetes telemetry

Kubernetes makes cost attribution harder because infrastructure changes constantly.

Pods start and stop. Containers restart. Deployments create new replica sets. Jobs run briefly and disappear. Sidecars multiply telemetry volume. Labels and annotations can create many dimensions. A cluster can be healthy from an application perspective while still generating wasteful telemetry.

Common Kubernetes cost drivers:

  • high-density nodes with many containers
  • short-lived jobs and ephemeral pods
  • CrashLoopBackOff churn
  • sidecars collected at full detail
  • labels that include unique build, pod, or deployment identifiers
  • logs from kube-system and platform namespaces
  • collecting metrics and logs from development namespaces the same way as production

What to do first

Review which containers and namespaces actually need Datadog collection. Exclude low-value workloads and sidecars where possible. Use separate rules for logs and metrics.

Practical candidates for exclusion or reduction:

  • sandbox namespaces
  • short-lived CI/test jobs
  • noisy sidecars
  • health check containers
  • development workloads
  • platform components already monitored elsewhere

The point is not to make Kubernetes invisible. The point is to stop treating every pod, sidecar, and ephemeral job as premium telemetry.

Cost driver 4: APM traces and span volume

APM is valuable when it helps engineering teams understand latency, errors, dependencies, and root cause. It becomes expensive when every successful request from every service is retained at high volume.

Trace costs usually grow because of:

  • high request volume
  • many microservices per request
  • full retention of successful HTTP 200 traces
  • tracing health checks and low-value endpoints
  • duplicate instrumentation
  • too many generated span metrics
  • no sampling policy by service criticality

What to do first

Keep error and latency visibility. Reduce routine success noise.

A sane APM policy usually keeps:

  • 100 percent of error traces
  • high-latency outliers
  • traces for revenue-critical workflows
  • traces for new or unstable services

A sane policy usually samples or drops:

  • successful health check traces
  • repetitive polling endpoints
  • known low-value background jobs
  • high-volume successful requests with no diagnostic value

Aggressive sampling is fine only when failure paths are preserved. Dropping successful traces is different from dropping the traces that explain a production outage.

Cost driver 5: duplicate cloud telemetry

Many environments pay twice for similar infrastructure data.

A common pattern is importing cloud provider metrics through an integration while also running agents on the same hosts. If metadata, hostnames, tags, or instance IDs do not line up cleanly, teams can end up with duplicated hosts, duplicated metrics, or noisy dashboards that nobody trusts.

Examples:

  • AWS CloudWatch metrics imported into Datadog and also collected by the Agent
  • EC2 hosts appearing twice because metadata does not match
  • container metrics collected by multiple integrations
  • network metrics imported from both cloud APIs and agents
  • dashboards built on both native cloud metrics and Datadog-generated metrics

What to do first

Audit the infrastructure list and usage views. Confirm whether cloud integration data and agent data are being correlated correctly. Then decide which system owns each signal.

For basic cloud utilization, native cloud monitoring may be enough. For deeper host and application context, the Datadog Agent may be justified. The waste is paying for both without a clear reason.

What not to cut blindly

A high invoice creates pressure to cut fast. That is how teams create blind spots.

Do not blindly remove:

  • authentication and authorization logs
  • administrative action logs
  • payment and transaction audit trails
  • firewall and security events
  • production error traces
  • monitors tied to SLOs or incident response
  • metrics used by autoscaling or capacity planning
  • logs required for customer support investigations

A log or metric can look unused until the one incident where it becomes essential. Before removing it, check whether it is tied to a monitor, dashboard, notebook, SLO, runbook, audit process, or investigation workflow.

What to audit first

Start with attribution, not migration.

1. Top log producers

Find the highest-volume services, sources, indexes, and environments. Look for obvious patterns: access logs, health checks, non-production logs, DEBUG logs, and large JSON payloads.

2. Top custom metrics by cardinality

Identify the metrics with the highest number of time series. Look at the tags attached to those metrics. Unique identifiers should be removed from metric tags.

3. Kubernetes namespace and container volume

Group usage by namespace, workload, and container. Find development namespaces, sidecars, and ephemeral jobs that do not need full Datadog visibility.

4. APM retention and sampling

Review which spans are retained. Preserve errors and high-latency traces. Reduce successful request noise.

5. Duplicate infrastructure views

Check whether hosts, containers, cloud metrics, and agent metrics are duplicated. Fix metadata mapping before assuming the bill reflects real infrastructure size.

6. Unused telemetry

Review metrics and logs that are not queried, not shown on active dashboards, and not tied to monitors. Do not delete automatically. Put them into a deprecation review.

Practical 7-step cost review checklist

  1. Pull 30 days of usage data and identify the biggest cost categories.
  2. Group log volume by service, source, environment, index, and status code.
  3. Find the top custom metrics by cardinality and remove high-cardinality tags from indexing.
  4. Review Kubernetes namespaces, sidecars, jobs, and container collection rules.
  5. Preserve error traces and critical transactions, then reduce successful request trace volume.
  6. Audit cloud integration and agent duplication.
  7. Create usage monitors so the next spike is caught before the invoice arrives.

When migration makes sense

Migration can make sense after attribution.

Once teams know what is driving the bill, they can decide what belongs in Datadog and what should move elsewhere:

Telemetry typePossible destination
Static server and network monitoringZabbix
Kubernetes infrastructure metricsPrometheus, VictoriaMetrics, Grafana
High-volume logsLoki, OpenSearch, OpenObserve, object storage
Long-term archiveS3, GCS, Azure Blob
Critical APM and incident viewsDatadog
Security and audit logsDatadog, SIEM, or controlled archive depending on policy

This is not a tool religion problem. Datadog can remain the premium layer for critical visibility while lower-value telemetry moves to cheaper systems.

Bottom line

A high Datadog bill is usually a telemetry governance problem before it is a vendor problem.

Start with logs, custom metrics, Kubernetes churn, APM trace volume, and duplicate cloud telemetry. Find the sources. Fix the worst offenders. Then decide what to keep, reduce, or offload.

If your Datadog bill is growing faster than the value you get from it, I can review the telemetry mix, identify the main cost drivers, and recommend what to keep, reduce, or offload.


Telemetry Audit & Consultation

Identify your cost drivers

I help enterprise engineering teams design telemetry pipelines, implement edge-routing with Vector/Fluent Bit, and offload static checks to Zabbix and Grafana - saving up to 60% on SaaS bills without losing incident visibility.

Get a Cost Audit

Sources

Tymur Chmeruk

Written by

Tymur Chmeruk

Cloud Security & Infrastructure Engineer · Baltimore–Washington Metro · [email protected]