← Back

June 1, 2026

Datadog to Zabbix Migration: What Should Move and What Should Stay

A phased migration guide for offloading static infrastructure, network devices, and basic availability checks from Datadog to Zabbix without breaking alert coverage.

For the broader framework, see Datadog Cost Reduction: What to Keep in Datadog and What to Offload to Zabbix/Grafana.

A Datadog-to-Zabbix migration should not start as a full rip-and-replace project. That is how teams lose alerts, break dashboards, and discover too late that a managed observability platform was hiding a lot of operational complexity.

The practical goal is narrower: move the monitoring workloads that Zabbix handles well, keep Datadog where it still provides high value, and validate alert parity before removing Datadog coverage. In most environments, the best first targets are static infrastructure, network devices, SNMP polling, basic uptime checks, and long-lived Linux or Windows servers. The worst first targets are APM, distributed tracing, high-churn Kubernetes workloads, RUM, and complex application analytics.

The original draft had the right direction: phased migration, dashboard parity through Grafana, and serious attention to Zabbix trigger behavior. This version keeps that structure but removes the vendor-comparison noise, excessive syntax detail, and claims that belong in a technical runbook rather than a consulting article.

Why teams consider moving from Datadog to Zabbix

The usual reason is cost control. Datadog is a managed SaaS platform with pricing tied to usage dimensions such as hosts, devices, containers, logs, APM, indexed spans, and custom metrics. Zabbix is open-source software with no per-host license fee, but that does not make it free in an operational sense.

Zabbix shifts the cost from a vendor invoice to internal ownership. Someone has to run the server, tune the database, manage proxies, maintain templates, handle upgrades, and troubleshoot performance when SNMP polling or item volume grows. That trade is still worth it for the right scope, especially when the alternative is paying premium SaaS pricing for static infrastructure metrics that do not need Datadog’s application-level correlation.

A good migration does not ask, ‘Can Zabbix replace Datadog?’ The better question is, ‘Which parts of Datadog are expensive but low-value enough to move, and which parts are still worth paying for?‘

Best first migration candidates

Network infrastructure is usually the cleanest first move. Switches, routers, firewalls, load balancers, UPS units, and other SNMP-monitored devices are relatively static. Zabbix was built for this kind of monitoring: SNMP items, templates, low-level discovery, interface graphs, trigger thresholds, dependencies, and distributed proxies.

Static Linux and Windows servers are also good candidates. Long-lived VMs, bare-metal servers, legacy databases, internal tools, and monolithic applications usually need CPU, memory, disk, network, service checks, process checks, certificate checks, and basic log pattern alerts. Zabbix Agent 2 can cover a large part of that monitoring footprint without sending every host into a premium SaaS billing model.

Basic availability checks are another safe target. ICMP ping, TCP port checks, HTTP availability checks, TLS certificate age, service status, disk capacity, and simple process monitoring are not strong reasons to keep a workload in Datadog. They are exactly the kind of checks Zabbix can perform reliably when the templates and triggers are built correctly.

What should stay in Datadog

APM and distributed tracing should usually stay in Datadog at the beginning. Zabbix does not provide a native replacement for trace waterfalls, code-level latency analysis, automatic service maps, span analytics, or tight trace-to-log correlation. Trying to reproduce that inside Zabbix is the wrong project. It turns a cost-reduction effort into a broken observability rebuild.

High-churn Kubernetes and serverless environments should also stay out of the first Zabbix migration wave. Zabbix can monitor Kubernetes, but rapidly created and destroyed pods can create database churn, housekeeping pressure, and noisy low-level discovery behavior if the design is not carefully controlled. Datadog is stronger for ephemeral application environments where tags, traces, containers, deployments, and service ownership change constantly.

RUM, synthetic browser testing, product analytics, and business-facing dashboards should remain in Datadog or another dedicated application observability platform until the team has a proven replacement. Zabbix is excellent infrastructure monitoring. It is not a full digital experience monitoring platform.

How to map Datadog monitors to Zabbix triggers

Datadog monitors and Zabbix triggers use different mental models. A Datadog monitor usually evaluates a query over a time window and applies a threshold, anomaly rule, or composite condition. A Zabbix trigger is a logical expression evaluated against item data stored in the Zabbix database.

Do not migrate monitors by copying thresholds one-for-one. That usually creates alert flapping. For volatile metrics such as CPU, bandwidth, temperature, memory, or queue depth, Zabbix triggers should use recovery expressions. For example, an alert might fire when CPU is above 90 percent for several minutes but only recover when CPU drops below 75 percent. That gap prevents the alert from constantly switching between problem and OK states.

Datadog anomaly monitors need extra care. If the Datadog rule is based on seasonality or deviation from historical behavior, a static Zabbix threshold is usually wrong. Zabbix trend functions and baseline-style expressions can help, but they need tuning and validation against real historical data. Anything that was machine-learning-backed in Datadog should go through a shadow-alert period before it becomes production paging logic.

Grafana should be the dashboard layer

Zabbix dashboards are usable, but they rarely match what operators and executives expect after using Datadog. For most migration projects, the better pattern is Zabbix for collection and alerting, Grafana for visualization, and Datadog retained for application observability that has not moved.

The Grafana Zabbix plugin can query Zabbix through the API, which is fine for small panels and short time ranges. For large dashboards, long time windows, or NOC screens, API-only querying can become slow. The more serious design is to enable a direct database connection for historical and trend data, with a tightly restricted read-only database user.

This is not a cosmetic detail. If the Grafana layer is slow, operators will blame the migration even if Zabbix is collecting data correctly. Dashboard performance must be part of the migration validation plan, not something handled after Datadog is already disabled.

Alert routing and escalation

Datadog makes alert routing feel simple because monitors can directly notify teams, Slack channels, PagerDuty services, or incident workflows. Zabbix separates the pieces: items collect data, triggers define problems, actions decide what happens, media types deliver notifications, and escalations define timing.

That separation is powerful but easy to misconfigure. Before disabling Datadog coverage, teams need to confirm that Zabbix actions route to the correct team, severity, channel, and escalation path. They also need to test maintenance windows, acknowledgement behavior, automatic recovery messages, and suppressed notifications.

Trigger dependencies are especially important for infrastructure migration. If a core router fails, every downstream server behind it may appear unreachable. Without dependencies, Zabbix can generate an alert storm. With dependencies, Zabbix can notify on the root problem and suppress downstream noise. This is one area where Zabbix can be very effective, but only if the dependency model is built intentionally.

Main migration risks

The first risk is missing alerts. Zabbix proxies can become bottlenecks if SNMP polling, discovery, preprocessing, or item volume exceeds the proxy’s capacity. Monitor the Zabbix internal queues, proxy CPU, database write latency, poller utilization, and unsupported item counts before trusting the new stack.

The second risk is wrong thresholds. Datadog often hides evaluation complexity behind a friendly UI. Zabbix exposes the logic directly. If teams migrate dynamic Datadog monitors into simple static thresholds, they will either miss real incidents or create noisy alerts that nobody trusts.

The third risk is ownership collapse. Datadog is friendly enough that application teams can build some of their own dashboards and monitors. Zabbix usually pushes more work back to infrastructure teams. Without ownership rules, the central monitoring team becomes a ticket queue for every application team.

The fourth risk is assuming license savings equal business savings. A poorly designed Zabbix environment can consume the savings through database tuning, proxy sprawl, broken templates, failed dashboards, and alert fatigue. The migration only works if the operational model is designed together with the technical architecture.

A practical phased migration plan

Phase 1: Inventory and classification. Export or document Datadog monitors, dashboards, host groups, network devices, tags, notification routes, and integrations. Classify each item into three buckets: move to Zabbix, keep in Datadog, or review later.

Phase 2: Build Zabbix core infrastructure. Deploy the Zabbix server, database, and proxies. Size the database seriously. SNMP and high item volume will expose weak storage immediately. Configure housekeeping, trends, history retention, templates, and proxy placement before importing large scopes.

Phase 3: Mirror static infrastructure. Deploy Zabbix Agent 2 to a small static server group and enable SNMP monitoring for a limited network segment. Keep Datadog active. Compare data quality, trigger behavior, alert timing, and missing items for at least two weeks.

Phase 4: Build Grafana dashboard parity. Recreate the Datadog dashboards that operators actually use, not every dashboard that exists. Validate panel load time, 7-day and 30-day views, NOC wallboard behavior, and permissions.

Phase 5: Shadow alerting. Route Zabbix alerts to a muted or secondary channel first. Compare Datadog and Zabbix incidents. Fix thresholds, dependencies, escalation routing, and maintenance behavior before any production cutover.

Phase 6: Scoped deprecation. Only remove Datadog coverage for a scope after Zabbix has shown stable data collection, accurate alerts, acceptable dashboards, and clear ownership. Start with network devices or static infrastructure, not applications.

Validation checklist before reducing Datadog usage

Before decommissioning any Datadog scope, answer these questions: Are Zabbix proxies stable under real polling load? Are SNMP timeouts and unsupported items under control? Are recovery expressions configured for volatile metrics? Are dependency chains suppressing downstream alert storms? Are Grafana dashboards fast enough for operational use? Are escalation routes tested end to end? Does every migrated monitor have an owner?

Also validate data retention. Zabbix history and trend settings must match operational needs. Keeping too much raw history can overload the database. Keeping too little can break troubleshooting and reporting. The retention design should be explicit, not inherited from defaults.

The final test is simple: if Datadog were disabled for this exact scope today, would the operations team still detect incidents, understand severity, route alerts correctly, and troubleshoot without guessing? If the answer is not clearly yes, the scope is not ready.

Conclusion

Datadog and Zabbix are not interchangeable products. Datadog is strongest where application context, tracing, dynamic environments, and SaaS convenience matter. Zabbix is strongest where infrastructure is stable, network-heavy, SNMP-based, cost-sensitive, and operationally owned by an infrastructure team.

The correct migration strategy is selective offload. Move network devices, static servers, simple availability checks, and basic infrastructure monitoring first. Keep APM, distributed tracing, high-churn Kubernetes, RUM, and critical application visibility in Datadog until a dedicated replacement is proven.

I help teams plan Datadog-to-Zabbix offload projects by mapping existing monitors, identifying safe migration candidates, and validating alert coverage before reducing Datadog usage.

Telemetry Audit & Consultation

Planning to offload to Zabbix?

I help enterprise engineering teams design telemetry pipelines, implement edge-routing with Vector/Fluent Bit, and offload static checks to Zabbix and Grafana - saving up to 60% on SaaS bills without losing incident visibility.

Plan Zabbix Migration

Sources

Tymur Chmeruk

Written by

Tymur Chmeruk

Cloud Security & Infrastructure Engineer · Baltimore–Washington Metro · [email protected]