← Back

June 1, 2026

How to Reduce Datadog Log Ingestion Cost Without Losing Visibility

A practical log reduction strategy for classifying high-value logs, sampling noisy streams, filtering at the edge, and routing lower-value telemetry to cheaper storage.

For the broader framework, see Datadog Cost Reduction: What to Keep in Datadog and What to Offload to Zabbix/Grafana.

Datadog is useful when teams need fast access to production logs, metrics, traces, alerts, and incident context in one place. The cost problem usually starts when every log line is treated as equally valuable.

A good log reduction strategy does not begin by deleting data. It begins by classifying logs by operational value, deciding which logs need hot search, which logs only need archive retention, and which logs should never reach Datadog in the first place.

The practical goal is simple: keep critical production and security visibility in Datadog, but stop paying premium observability prices for repetitive, low-value noise.

Why Datadog Log Costs Grow

Datadog log billing has two separate concepts that matter for cost planning: ingestion and indexing. Datadog charges for ingested logs based on the total number of gigabytes submitted to the Datadog Logs service. It also charges for indexed log events based on the number of log events submitted for indexing under the selected retention policy.

That split matters. Index exclusion filters can reduce searchable log volume, but they do not automatically mean the log never touched the Datadog platform. If the log has already been submitted to Datadog, you still need to understand whether you paid an ingestion cost before reducing indexing.

The usual cost drivers are not mysterious:

  • debug logging left enabled in production
  • health check and readiness probe logs
  • HTTP 200 access logs from web servers, load balancers, and ingress controllers
  • Kubernetes control plane and container churn
  • VPC flow logs, CDN logs, DNS logs, firewall logs, and service mesh logs
  • logs with large unused fields
  • duplicated telemetry already available in another system
  • logs converted into a substitute for metrics

Datadog’s own guidance describes several ways log volume becomes wasteful: debug logs, error loops, unnecessary performance data inside logs, extra fields that are never used, and log streams that are not all equal in value.

The First Rule: Do Not Cut Blindly

Random log cuts create two predictable problems:

  1. Incident response gets worse. A team may save money on a monthly bill and then lose hours during a production incident because the one log stream that explained the failure was excluded.
  2. Security and compliance evidence can disappear. Authentication logs, authorization failures, administrative changes, payment-related audit events, and privileged access events may be needed for investigations or audits.

The correct question is not “which logs can we delete?” The correct question is “which logs need hot search, which need cheaper retention, and which should be reduced before they reach Datadog?”

Classify Logs into Three Tiers

A simple three-tier model is enough for most environments.

Tier 1: Keep hot and searchable

These logs need to stay searchable in Datadog or another hot search platform:

  • production errors
  • HTTP 5xx responses
  • failed authentication attempts
  • privileged access and configuration changes
  • payment, identity, and security events
  • database connection failures
  • logs directly tied to active monitors or incident workflows

These logs are expensive to lose. They should have clear retention rules, ownership, alerting, and access controls.

Tier 2: Sample, reduce, or move to warm storage

These logs are useful, but usually not useful at full volume:

  • HTTP 200 and 302 access logs
  • normal application info logs
  • successful transaction logs
  • high-volume web server logs
  • load balancer access logs
  • standard service mesh traffic logs

For these streams, keeping 100 percent of events in hot search is often wasteful. A common pattern is to index a sample, generate metrics from the stream, and route the full raw feed to cheaper storage for later investigation.

Tier 3: Drop, archive, or bypass Datadog

These logs often provide little value in hot search:

  • Kubernetes liveness and readiness probes
  • repetitive health checks
  • verbose DEBUG and TRACE logs
  • non-production debug noise
  • heartbeat messages
  • duplicate telemetry already collected elsewhere

Some of this data can be dropped. Some should be archived to object storage if the organization wants forensic retention. The key point is that low-value logs should be reduced at the edge whenever possible, not after they have already driven ingestion volume.

What to Do Inside Datadog First

Before deploying new infrastructure, use the controls already available in Datadog.

Review top log producers

Start with a seven-day usage review. Group logs by service, environment, source, status code, and team. Identify the top five producers by volume. The goal is to find patterns, not to argue about individual log lines.

Useful questions:

  • Which services produce the most logs?
  • Which indexes are rarely queried?
  • Which logs are mostly HTTP 2xx or health checks?
  • Which logs are generated by non-production environments?
  • Which teams can explain their log volume?

Datadog recommends usage monitoring, index query review, and exclusion filters for high-volume logs. Log Patterns can also be used to find repetitive log lines that are good exclusion candidates.

Segment indexes

Do not send all logs into one catch-all index with one retention policy. Separate indexes by value and use case.

A practical structure:

IndexExample contentSuggested treatment
Production criticalerrors, auth, payment, security eventshot search, alerting, normal retention
Production operationalaccess logs, info logs, normal trafficsampled or shorter retention
Non-productiondev and staging logsshort retention or heavy exclusion
Archive-onlylow-value historical dataroute to archive or lower-cost tier

Datadog index filters are evaluated in order, and logs enter the first index whose filter they match. That means index order matters. Put specific indexes above broad catch-all filters.

Use exclusion filters carefully

Exclusion filters are useful for controlling indexing. Datadog documents that excluded logs are discarded from indexes, but can still flow through Live Tail, generate metrics, and be archived.

Good exclusion candidates:

  • http.status_code:[200 TO 299] for high-volume access logs
  • status:DEBUG in production
  • /healthz, /ready, and similar health probes
  • non-production logs not tied to incidents

Do not use exclusion filters as a substitute for governance. Create owners, change control, and usage alerts. Otherwise one team can accidentally re-enable a noisy stream and recreate the bill problem.

Convert repetitive logs into metrics

Some logs exist only because teams want a count, rate, or trend. In those cases, generate a metric and reduce the raw log volume.

Examples:

  • count successful logins by region
  • count payment failures by processor
  • count API responses by status code
  • count queue processing results

Once the metric is validated, the raw informational log can often be sampled, shortened, or excluded from hot indexing.

Where Edge Filtering Matters

In-platform controls help, but edge filtering is where larger cost reductions become possible. If noisy logs never leave the host, node, or cluster, they do not become Datadog ingestion volume.

This is where Vector, Fluent Bit, Datadog Observability Pipelines, or another telemetry router fits.

A practical routing policy looks like this:

Log typeDestination
ERROR, FATAL, HTTP 5xx, auth failuresDatadog hot index
sampled HTTP 2xx access logsDatadog or warm log store
full raw access logsS3, GCS, Azure Blob, Loki, OpenSearch, or OpenObserve
health checks and probesdrop or archive only
non-production DEBUG logsshort retention or local archive

This keeps Datadog focused on high-value operational visibility while preserving lower-value data somewhere cheaper.

Vector and Fluent Bit as routing layers

Vector and Fluent Bit are common choices for log routing. Both can collect logs, parse fields, filter events, sample streams, add metadata, and send different classes of logs to different destinations.

For example, a Kubernetes cluster could route app logs this way:

# simplified routing example
sources:
  app_logs:
    type: kubernetes_logs

transforms:
  route_logs:
    type: route
    inputs: [app_logs]
    route:
      critical: '.level == "ERROR" || .status >= 500'
      noisy: '.path == "/healthz" || .path == "/ready" || .level == "DEBUG"'

sinks:
  datadog_critical:
    type: datadog_logs
    inputs: [route_logs.critical]
  s3_archive:
    type: aws_s3
    inputs: [route_logs.noisy]

This is not a production-ready configuration. It is the pattern: classify at the edge, send critical logs to Datadog, and send noisy logs to cheaper storage.

Choosing lower-cost destinations

There is no universal replacement for Datadog Logs. The right backend depends on query behavior.

Object storage

S3, GCS, or Azure Blob is the cheapest default archive for raw logs. It is useful for long-term retention, audit evidence, and rare forensic retrieval. It is not a good hot investigation interface by itself.

Grafana Loki

Loki is useful for high-volume logs when teams usually query by labels such as cluster, namespace, pod, service, or environment. It avoids full-text indexing of every log line, which can reduce storage overhead. The tradeoff is slower broad text search across large time windows.

OpenSearch

OpenSearch is useful when teams need full-text search, flexible filtering, and exploratory log analysis. The tradeoff is operational cost: cluster sizing, shard management, JVM tuning, storage performance, and upgrades.

OpenObserve or columnar backends

Columnar, object-storage-backed systems can be attractive for lower-cost log analytics. Treat vendor savings claims as benchmarks, not guarantees. The real cost depends on volume, query patterns, retention, cloud storage, compute, and the engineering time required to operate the system.

Risks to Handle Before Cutting Volume

Loss of incident context

If traces stay in Datadog but logs move elsewhere, engineers may lose one-click correlation. That does not make offload impossible, but it means trace IDs, service names, environment tags, and request IDs must be preserved across systems.

Compliance gaps

Do not reduce authentication, authorization, administrative, or payment audit logs without confirming retention and access requirements. A cheap log strategy that fails an audit is not cheap.

Pipeline backpressure

If the destination is unavailable, the routing layer must buffer safely. Enable disk-backed buffers where possible. Memory-only buffering can fail during network problems or backend throttling.

Poor ownership

Log cost reduction fails when nobody owns the policy. Each major service should have an owner, a retention class, a sampling rule, and a review schedule.

Practical 30-Day Action Plan

Week 1: Measure

  • Pull seven days of log usage.
  • Identify top services, sources, indexes, status codes, and environments.
  • Find indexes that are rarely queried.
  • Identify debug, health check, access log, and non-production noise.

Week 2: Reduce indexing

  • Create or reorder indexes by value.
  • Add exclusion filters for obvious low-value logs.
  • Sample high-volume HTTP 2xx logs.
  • Set usage monitors and alerts on indexed volume.

Week 3: Move reduction upstream

  • Add edge filtering with Vector, Fluent Bit, or Observability Pipelines.
  • Drop or sample health checks and repetitive info logs before Datadog ingestion.
  • Archive full raw streams to object storage where needed.

Week 4: Validate

  • Run incident simulations against the new log policy.
  • Verify security and audit retention.
  • Confirm trace IDs and request IDs still connect logs across systems.
  • Review usage before and after changes.
  • Document ownership and change control.

Cost Reduction Checklist

  • Identify top five log producers by volume.
  • Separate production, non-production, security, and archive-only streams.
  • Confirm which logs are needed for active alerts.
  • Add exclusion filters for repetitive low-value logs.
  • Sample high-volume access logs instead of indexing everything.
  • Generate metrics from repetitive informational logs.
  • Route full raw logs to object storage when retention is needed.
  • Use edge filtering to reduce ingestion, not just indexing.
  • Preserve trace IDs, request IDs, service names, and environment tags.
  • Enable pipeline buffering and retry behavior.
  • Review log policy monthly.

Conclusion

Datadog log cost reduction is not a tool replacement project. It is a telemetry classification project.

Keep Datadog for the logs that need hot search, alerting, correlation, and incident response. Reduce, sample, or offload the repetitive streams that rarely help during an outage. The cleanest savings usually come from moving filtering upstream, before low-value logs become Datadog ingestion volume.

I help infrastructure and security teams reduce Datadog log costs by classifying log value, routing noisy logs to lower-cost storage, and keeping the data needed for incidents, security, and compliance.

Telemetry Audit & Consultation

High log ingestion fees?

I help enterprise engineering teams design telemetry pipelines, implement edge-routing with Vector/Fluent Bit, and offload static checks to Zabbix and Grafana - saving up to 60% on SaaS bills without losing incident visibility.

Reduce Ingestion Cost

Sources

Tymur Chmeruk

Written by

Tymur Chmeruk

Cloud Security & Infrastructure Engineer · Baltimore–Washington Metro · [email protected]