June 1, 2026

Datadog Cost Reduction: What to Keep in Datadog and What to Offload to Zabbix/Grafana

A practical framework for reducing observability spend by keeping critical telemetry in Datadog and moving lower-value infrastructure, logs, and metrics to Zabbix, Grafana, Prometheus, Loki, OpenSearch, or object storage.

Datadog is often worth the money for critical application observability. The problem starts when every log line, every low-value metric, every static server check, and every Kubernetes label lands in the same premium SaaS billing path.

A useful Datadog cost reduction project is usually not a full replacement. It is a classification exercise: keep the telemetry that needs Datadog’s correlation, APM, SLOs, incident workflows, and managed platform. Move lower-value infrastructure monitoring, noisy logs, and high-cardinality metric streams to cheaper systems such as Zabbix, Grafana, Prometheus, Loki, OpenSearch, OpenObserve, VictoriaMetrics, or object storage.

The goal is not to make monitoring cheaper by making it worse. The goal is to stop using the most expensive layer for data that does not need it.

Why Datadog Bills Grow

Datadog costs tend to grow for four technical reasons.

First, hosts and containers scale. Datadog meters hosts and custom metrics hourly, and the billable host count is calculated using a high-water-mark style method that excludes the top 1 percent of usage hours. In dynamic environments, especially Kubernetes, the monitored footprint can grow faster than expected if teams do not control where agents run and what they collect.

Second, log volume grows without governance. Application logs, access logs, Kubernetes logs, cloud load balancer logs, VPC flow logs, and debug streams can quickly become the largest observability data source. Datadog log management separates usage around submitted data and indexed/searchable events. Reducing indexing helps, but upstream routing is still needed when the raw submitted volume is the problem.

Third, custom metrics multiply through tags. In Datadog, a custom metric is not just a metric name. It is a unique combination of the metric name, host, and tag values. Tags such as user_id, request_id, pod_uid, container_id, full URL path, tenant_id, or build SHA can turn one metric into thousands or millions of series.

Fourth, teams enable overlapping features without ownership. Infrastructure Monitoring, APM, logs, RUM, synthetics, database monitoring, network monitoring, cloud cost features, and security products all have value. They also create separate usage streams. Without attribution by service, team, and environment, nobody owns the bill until finance notices it.

What Should Stay in Datadog

The expensive system should keep the data where Datadog’s value is hard to reproduce.

Critical application services usually stay first. Revenue-impacting APIs, customer-facing applications, payment flows, login paths, and services with strict SLOs benefit from Datadog’s correlation between metrics, traces, logs, alerts, dashboards, and incidents.

APM and distributed tracing often stay, at least during the first phase. Rebuilding trace correlation with self-hosted components can be done, but it adds engineering work and operational risk. For systems where MTTR matters more than storage cost, Datadog can remain the primary view.

High-severity production logs should stay indexed or searchable in a hot tier. Authentication failures, privilege changes, payment errors, HTTP 5xx spikes, database connection failures, application exceptions, and security-relevant events are poor candidates for blind offload.

Executive dashboards and cross-team incident views can also stay. If product, support, security, and engineering all depend on a shared Datadog dashboard during incidents, replacing it on day one is usually a bad trade.

What Can Often Move Out

The first offload candidates are boring, high-volume signals that rarely justify premium indexing or per-host SaaS monitoring.

Static infrastructure metrics are a common starting point. Long-lived VMs, bare-metal servers, network devices, firewalls, switches, routers, storage appliances, and SNMP-heavy environments often need CPU, memory, disk, interface, service, and availability monitoring. That is exactly where Zabbix is strong.

Low-value logs are another strong candidate. Health checks, HTTP 200 access logs, repetitive readiness probes, debug logs, verbose framework logs, successful polling noise, and routine Kubernetes controller chatter usually do not need to be searchable in Datadog at full volume.

High-cardinality exploratory metrics should move out of Datadog early. Metrics tagged by user, tenant, request, pod UID, container ID, dynamic route, or transaction hash belong in a system designed and governed for that pattern. In many cases, they should be aggregated before they ever reach a billing-sensitive backend.

Duplicate cloud telemetry should be reviewed. If AWS CloudWatch, Azure Monitor, or Google Cloud Monitoring already stores a basic metric, importing every dimension into Datadog without filters may create duplicate cost without adding much operational value.

Where Zabbix Fits

Zabbix is not a Datadog clone. It is an infrastructure monitoring platform. That is the point.

For servers, network devices, SNMP, ICMP checks, TCP service checks, VMware-style infrastructure, Linux and Windows agents, templates, triggers, proxies, and distributed polling, Zabbix can cover a large amount of classic IT monitoring without per-host SaaS licensing.

A practical Zabbix offload project usually starts with stable infrastructure:

network devices and SNMP polling
Linux and Windows VM health
disk, CPU, memory, interface, and process checks
database availability checks
internal services that do not need APM tracing
branch office or site monitoring through Zabbix proxies

The trade-off is ownership. Zabbix still needs a server, database, storage, backups, tuning, templates, trigger cleanup, upgrade discipline, and proxy design. Moving data out of Datadog saves SaaS usage, but it does not make monitoring free.

Where Grafana Fits

Grafana is the dashboard layer. It is useful when teams want one visual surface over multiple backends.

A phased architecture can show Zabbix infrastructure health, Prometheus Kubernetes metrics, Loki logs, OpenSearch logs, and even Datadog data inside Grafana during the transition. This is useful because migration does not have to be a cliff. Teams can move backends gradually while keeping dashboards usable.

Grafana works well for:

NOC dashboards
executive infrastructure views
service health boards
Zabbix-backed infrastructure panels
Prometheus/VictoriaMetrics Kubernetes metrics
Loki or OpenSearch log drilldowns

Grafana does not magically replace Datadog APM. It needs data sources. If the traces, service maps, incident workflows, and monitors still live in Datadog, Grafana should be treated as a dashboard and integration layer, not a full replacement.

Where Prometheus, Loki, OpenSearch, OpenObserve, and VictoriaMetrics Fit

Prometheus is a good fit for Kubernetes and cloud-native metrics. It is especially useful when teams want direct control over scrape targets, labels, retention, and alert rules. For long retention or very large scale, it is commonly paired with systems such as Thanos, Mimir, Cortex, or VictoriaMetrics.

VictoriaMetrics is often useful when metric volume or cardinality becomes painful. It can serve as a high-performance metrics backend, especially where Prometheus-compatible ingestion and query patterns are desired.

Loki is a reasonable target for high-volume logs when queries are usually scoped by labels such as service, namespace, environment, pod, or trace ID. It is not ideal for broad full-text search across huge windows. That trade-off is exactly why it can be cheaper to run.

OpenSearch fits when teams need full-text search and exploratory log analysis. The trade-off is operational cost: JVM tuning, shard management, storage design, hot/warm/cold tiers, backups, and cluster maintenance.

OpenObserve and similar columnar/object-storage-backed systems are worth evaluating for log and telemetry offload. Vendor benchmarks should be treated as vendor benchmarks, not universal savings guarantees. The real question is whether the tool matches the team’s query patterns, retention needs, and operational capacity.

A Practical Offload Architecture

A clean Datadog cost reduction architecture has three layers.

Layer	Purpose	Common tools
Premium observability	Critical app telemetry, APM, SLOs, incidents, high-value alerts	Datadog
Infrastructure monitoring	Stable servers, network devices, SNMP, VM health, site monitoring	Zabbix, Grafana
Bulk telemetry storage	Noisy logs, high-volume metrics, long retention, archive	Prometheus, VictoriaMetrics, Loki, OpenSearch, OpenObserve, S3

A typical flow looks like this:

Critical production errors, APM traces, SLO signals, and incident dashboards stay in Datadog.
Static infrastructure monitoring moves to Zabbix.
Grafana provides dashboards across Zabbix, Prometheus, Loki, and selected Datadog views.
Vector, Fluent Bit, or OpenTelemetry Collector routes logs and metrics before they hit expensive endpoints.
Noisy logs go to Loki, OpenSearch, OpenObserve, or object storage.
High-cardinality or exploratory metrics go to Prometheus/VictoriaMetrics or a dedicated analytics backend.
Only high-value alerts are forwarded back into the primary incident workflow.

The key move is routing before ingestion. If the data reaches Datadog first and is filtered later, the team may reduce indexing volume but still fail to control submitted data volume.

What to Audit First

Start with usage attribution, not tool selection.

Review Datadog usage by product, team, service, environment, host group, log index, custom metric, and tag pattern. Identify the top producers before touching the architecture.

The first audit should answer:

Which services generate the most logs?
Which indexes are growing fastest?
Which custom metrics have the highest cardinality?
Which tags create the most unique series?
Which hosts or clusters drive infrastructure monitoring usage?
Which environments send production-grade telemetry even though they are dev, test, or staging?
Which dashboards and monitors are actually used during incidents?
Which data is required for compliance, audit, or security investigations?

Do not start by deleting data. Start by classifying it.

Migration Risks

Partial offload can save money, but it can also damage operations if done carelessly.

The first risk is lost correlation. If traces remain in Datadog but logs move elsewhere, engineers may need to copy a trace ID from Datadog into Grafana, Loki, or OpenSearch. That is acceptable for low-priority services. It may be unacceptable for revenue-critical systems.

The second risk is alert gaps. A Datadog monitor may depend on a metric, tag, log event, or trace attribute that disappears during offload. Every migration candidate needs a monitor mapping and validation step.

The third risk is self-hosting overhead. Zabbix, Prometheus, Loki, OpenSearch, VictoriaMetrics, and OpenObserve still require capacity planning, upgrades, backup strategy, access control, monitoring, and incident response. The bill moves from SaaS to infrastructure and labor.

The fourth risk is compliance. Logs related to authentication, authorization, admin activity, data access, payment flows, and regulated systems should not be dropped because they look noisy. They need retention and access rules.

30-Day Cost Reduction Plan

Days 1-7: Usage review

Pull Datadog usage by product. Identify the top log indexes, custom metrics, hosts, containers, APM services, and environments. Build a short list of the top five cost drivers.

Days 8-14: Classification

Mark telemetry as critical, operational, or low-value. Critical data stays in Datadog. Operational data can be sampled or moved to warm storage. Low-value data can be dropped, archived, or routed elsewhere before ingestion.

Days 15-21: Quick controls

Apply index filters, quotas, retention changes, tag cleanup, custom metric governance, and log exclusion where safe. These changes are not the final architecture, but they can stop the worst waste.

Days 22-30: Offload pilot

Pick one low-risk service or infrastructure segment. Move basic infrastructure checks to Zabbix. Route noisy logs through Vector, Fluent Bit, or OpenTelemetry Collector to Loki, OpenSearch, OpenObserve, or S3. Keep Datadog alerts for critical paths. Compare visibility before and after.

Cost Reduction Checklist

Area	Check
Logs	Top indexes by volume, retention, environment, and service
Metrics	Top custom metrics, high-cardinality tags, unused dimensions
Hosts	Static servers, duplicate monitoring, dev/test agents
Kubernetes	Pod/container churn, noisy labels, ephemeral workloads
APM	Sampling, service ownership, trace retention, unused services
Dashboards	Which views are actually used during incidents
Compliance	Logs that must be retained and searchable
Offload	Candidates for Zabbix, Grafana, Prometheus, Loki, OpenSearch, OpenObserve, S3

Conclusion

Datadog cost reduction is not a tool war. It is a telemetry routing problem.

Keep Datadog where correlation, APM, SLOs, incidents, and business-critical visibility justify the cost. Move stable infrastructure monitoring to Zabbix. Use Grafana to unify views across systems. Route high-volume logs and high-cardinality metrics into platforms designed for cheaper retention and controlled query patterns.

The cleanest architecture is usually hybrid: Datadog for premium observability, Zabbix for classic infrastructure, Grafana for dashboards, and lower-cost backends for bulk telemetry.

I help infrastructure teams reduce Datadog spend by identifying what should stay in Datadog, what can move to Zabbix/Grafana, and where logs or metrics can be safely offloaded without losing critical visibility.

Telemetry Audit & Consultation

Is your Datadog bill too high?

I help enterprise engineering teams design telemetry pipelines, implement edge-routing with Vector/Fluent Bit, and offload static checks to Zabbix and Grafana - saving up to 60% on SaaS bills without losing incident visibility.

Book a Telemetry Audit

Sources

Written by

Tymur Chmeruk

Cloud Security & Infrastructure Engineer · Baltimore–Washington Metro · [email protected]