June 1, 2026

Datadog Bill Too High? Start With Logs, Custom Metrics, and Kubernetes Noise

A diagnostic guide to finding the real Datadog cost drivers before migrating: logs, custom metrics, high-cardinality tags, Kubernetes churn, APM traces, and duplicate cloud telemetry.

For the broader framework, see Datadog Cost Reduction: What to Keep in Datadog and What to Offload to Zabbix/Grafana.

A high Datadog bill is usually not caused by one bad dashboard or one expensive feature. It is usually caused by unmanaged telemetry volume.

Teams add logs, metrics, traces, Kubernetes integrations, cloud integrations, APM, and custom tags over time. Each addition looks harmless by itself. The bill becomes painful when the same environment starts sending high-volume logs, high-cardinality custom metrics, container churn, duplicated cloud telemetry, and broad APM tracing into the same premium observability platform.

The first move should not be a blind migration to open source. That only moves the mess somewhere else. The first move is cost attribution: identify which services, teams, environments, tags, logs, metrics, traces, and retention policies are driving usage.

Why Datadog costs grow

Datadog is useful because it combines infrastructure monitoring, logs, metrics, APM traces, dashboards, monitors, and incident context. That same integration is also why cost can grow quickly. A single noisy service can affect multiple billing dimensions at once.

The most common cost drivers are:

log ingestion and indexing
custom metrics
high-cardinality tags
Kubernetes container churn
APM trace volume
duplicate telemetry from cloud integrations and agents
unused dashboards, monitors, and log-based metrics
non-production environments treated like production

The practical question is not “why is Datadog expensive?” The practical question is “which telemetry is valuable enough to justify Datadog, and which telemetry should be reduced, sampled, archived, or moved elsewhere?”

Cost driver 1: logs

Logs are often the first place the bill goes wrong.

Datadog log cost has more than one layer. Teams need to distinguish between logs that are submitted to Datadog and logs that are indexed for search, dashboards, and monitors. Exclusion filters can reduce indexed volume, but they are not a substitute for controlling what gets collected and forwarded in the first place.

Common log cost problems:

DEBUG and TRACE logs left enabled in production
HTTP 200 access logs indexed at full volume
Kubernetes liveness and readiness probes
ingress controller and load balancer access logs
VPC Flow Logs, CDN logs, DNS logs, and firewall logs sent without filtering
non-production logs kept with production retention
logs with large unused JSON fields
logs used as a poor substitute for metrics

Start by grouping log usage by service, source, environment, status code, index, and team. The top five producers usually explain most of the problem.

What to do first

Do not delete logs randomly. Classify them.

Log type	Typical action
Production errors, HTTP 5xx, auth failures, admin actions	Keep searchable and alertable
HTTP 200/302 access logs, normal info logs	Sample, shorten retention, or move to cheaper storage
Health checks, readiness probes, repetitive DEBUG logs	Drop at the edge or archive only
Audit, payment, identity, firewall, security logs	Preserve according to retention and investigation needs

If a log stream is needed only for historical lookup, it may not need to live in a hot Datadog index. If a log stream is repetitive operational noise, reduce it before it reaches Datadog.

Cost driver 2: custom metrics

Custom metrics are where small tagging mistakes become expensive.

A custom metric is not just a metric name. It is the metric name plus the unique combinations of tags attached to it. A single metric can become thousands or millions of time series when teams add uncontrolled tags.

Safe tags usually describe stable operational dimensions:

env
service
region
status_code
customer_tier
cluster

Dangerous tags usually describe unique or fast-changing values:

user_id
session_id
request_id
pod_uid
container_id
raw URL paths with IDs
timestamps
transaction hashes

Example: a metric named api.request.latency with 10 endpoints, 5 status codes, and 3 customer tiers creates 150 time series. Add 10,000 user IDs as a tag, and the same metric can become 1,500,000 time series. That is the usual shape of a custom metric surprise.

What to do first

Use Datadog’s metric volume and custom metric governance tools to find the largest metric names by estimated cardinality. Then classify the tags:

Tag type	Action
Stable operational tag	Usually keep
Rarely queried tag	Remove from indexing or drop upstream
Unique identifier	Move to logs/traces, not metrics
Debug-only dimension	Remove from production metrics
Business analytics dimension	Send to BI/analytics, not infrastructure monitoring

Metrics without Limits can help reduce indexed custom metric cardinality without forcing every application team to immediately change code. That is useful as a first containment step. It is not a replacement for tag governance.

Cost driver 3: high-cardinality Kubernetes telemetry

Kubernetes makes cost attribution harder because infrastructure changes constantly.

Pods start and stop. Containers restart. Deployments create new replica sets. Jobs run briefly and disappear. Sidecars multiply telemetry volume. Labels and annotations can create many dimensions. A cluster can be healthy from an application perspective while still generating wasteful telemetry.

Common Kubernetes cost drivers:

high-density nodes with many containers
short-lived jobs and ephemeral pods
CrashLoopBackOff churn
sidecars collected at full detail
labels that include unique build, pod, or deployment identifiers
logs from kube-system and platform namespaces
collecting metrics and logs from development namespaces the same way as production

What to do first

Review which containers and namespaces actually need Datadog collection. Exclude low-value workloads and sidecars where possible. Use separate rules for logs and metrics.

Practical candidates for exclusion or reduction:

sandbox namespaces
short-lived CI/test jobs
noisy sidecars
health check containers
development workloads
platform components already monitored elsewhere

The point is not to make Kubernetes invisible. The point is to stop treating every pod, sidecar, and ephemeral job as premium telemetry.

Cost driver 4: APM traces and span volume

APM is valuable when it helps engineering teams understand latency, errors, dependencies, and root cause. It becomes expensive when every successful request from every service is retained at high volume.

Trace costs usually grow because of:

high request volume
many microservices per request
full retention of successful HTTP 200 traces
tracing health checks and low-value endpoints
duplicate instrumentation
too many generated span metrics
no sampling policy by service criticality

What to do first

Keep error and latency visibility. Reduce routine success noise.

A sane APM policy usually keeps:

100 percent of error traces
high-latency outliers
traces for revenue-critical workflows
traces for new or unstable services

A sane policy usually samples or drops:

successful health check traces
repetitive polling endpoints
known low-value background jobs
high-volume successful requests with no diagnostic value

Aggressive sampling is fine only when failure paths are preserved. Dropping successful traces is different from dropping the traces that explain a production outage.

Cost driver 5: duplicate cloud telemetry

Many environments pay twice for similar infrastructure data.

A common pattern is importing cloud provider metrics through an integration while also running agents on the same hosts. If metadata, hostnames, tags, or instance IDs do not line up cleanly, teams can end up with duplicated hosts, duplicated metrics, or noisy dashboards that nobody trusts.

Examples:

AWS CloudWatch metrics imported into Datadog and also collected by the Agent
EC2 hosts appearing twice because metadata does not match
container metrics collected by multiple integrations
network metrics imported from both cloud APIs and agents
dashboards built on both native cloud metrics and Datadog-generated metrics

What to do first

Audit the infrastructure list and usage views. Confirm whether cloud integration data and agent data are being correlated correctly. Then decide which system owns each signal.

For basic cloud utilization, native cloud monitoring may be enough. For deeper host and application context, the Datadog Agent may be justified. The waste is paying for both without a clear reason.

What not to cut blindly

A high invoice creates pressure to cut fast. That is how teams create blind spots.

Do not blindly remove:

authentication and authorization logs
administrative action logs
payment and transaction audit trails
firewall and security events
production error traces
monitors tied to SLOs or incident response
metrics used by autoscaling or capacity planning
logs required for customer support investigations

A log or metric can look unused until the one incident where it becomes essential. Before removing it, check whether it is tied to a monitor, dashboard, notebook, SLO, runbook, audit process, or investigation workflow.

What to audit first

Start with attribution, not migration.

1. Top log producers

Find the highest-volume services, sources, indexes, and environments. Look for obvious patterns: access logs, health checks, non-production logs, DEBUG logs, and large JSON payloads.

2. Top custom metrics by cardinality

Identify the metrics with the highest number of time series. Look at the tags attached to those metrics. Unique identifiers should be removed from metric tags.

3. Kubernetes namespace and container volume

Group usage by namespace, workload, and container. Find development namespaces, sidecars, and ephemeral jobs that do not need full Datadog visibility.

4. APM retention and sampling

Review which spans are retained. Preserve errors and high-latency traces. Reduce successful request noise.

5. Duplicate infrastructure views

Check whether hosts, containers, cloud metrics, and agent metrics are duplicated. Fix metadata mapping before assuming the bill reflects real infrastructure size.

6. Unused telemetry

Review metrics and logs that are not queried, not shown on active dashboards, and not tied to monitors. Do not delete automatically. Put them into a deprecation review.

Practical 7-step cost review checklist

Pull 30 days of usage data and identify the biggest cost categories.
Group log volume by service, source, environment, index, and status code.
Find the top custom metrics by cardinality and remove high-cardinality tags from indexing.
Review Kubernetes namespaces, sidecars, jobs, and container collection rules.
Preserve error traces and critical transactions, then reduce successful request trace volume.
Audit cloud integration and agent duplication.
Create usage monitors so the next spike is caught before the invoice arrives.

When migration makes sense

Migration can make sense after attribution.

Once teams know what is driving the bill, they can decide what belongs in Datadog and what should move elsewhere:

Telemetry type	Possible destination
Static server and network monitoring	Zabbix
Kubernetes infrastructure metrics	Prometheus, VictoriaMetrics, Grafana
High-volume logs	Loki, OpenSearch, OpenObserve, object storage
Long-term archive	S3, GCS, Azure Blob
Critical APM and incident views	Datadog
Security and audit logs	Datadog, SIEM, or controlled archive depending on policy

This is not a tool religion problem. Datadog can remain the premium layer for critical visibility while lower-value telemetry moves to cheaper systems.

Bottom line

A high Datadog bill is usually a telemetry governance problem before it is a vendor problem.

Start with logs, custom metrics, Kubernetes churn, APM trace volume, and duplicate cloud telemetry. Find the sources. Fix the worst offenders. Then decide what to keep, reduce, or offload.

If your Datadog bill is growing faster than the value you get from it, I can review the telemetry mix, identify the main cost drivers, and recommend what to keep, reduce, or offload.

Telemetry Audit & Consultation

Identify your cost drivers

I help enterprise engineering teams design telemetry pipelines, implement edge-routing with Vector/Fluent Bit, and offload static checks to Zabbix and Grafana - saving up to 60% on SaaS bills without losing incident visibility.

Get a Cost Audit

Sources

Datadog Billing Documentation - host, log, APM, container, and usage metering concepts.
Datadog Custom Metrics Billing - custom metric allotments, indexing, and overage concepts.
Datadog Metrics without Limits - controlling indexed metric tags and custom metric volume.
Datadog Custom Metrics Governance - identifying high-cardinality metrics and unused telemetry.
Datadog Containers Billing - container usage, exclusions, and Kubernetes billing behavior.
Datadog Trace Retention Documentation - span indexing and retention controls.
Datadog Cloud Cost Management Documentation - cost attribution and cloud optimization views.

Written by

Tymur Chmeruk

Cloud Security & Infrastructure Engineer · Baltimore–Washington Metro · [email protected]

Datadog Bill Too High? Start With Logs, Custom Metrics, and Kubernetes Noise

Why Datadog costs grow

Cost driver 1: logs

What to do first

Cost driver 2: custom metrics

What to do first

Cost driver 3: high-cardinality Kubernetes telemetry

What to do first

Cost driver 4: APM traces and span volume

What to do first

Cost driver 5: duplicate cloud telemetry

What to do first

What not to cut blindly

What to audit first

1. Top log producers

2. Top custom metrics by cardinality

3. Kubernetes namespace and container volume

4. APM retention and sampling

5. Duplicate infrastructure views

6. Unused telemetry

Practical 7-step cost review checklist

When migration makes sense

Bottom line

Related guides

Telemetry Audit & Consultation

Identify your cost drivers

Sources

Tymur Chmeruk