June 1, 2026

Datadog to Grafana Migration: Practical Path for Infrastructure Dashboards

A practical guide to moving infrastructure dashboards from Datadog to Grafana-backed data sources while keeping high-value application observability in Datadog.

For the broader framework, see Datadog Cost Reduction: What to Keep in Datadog and What to Offload to Zabbix/Grafana.

Migrating from Datadog to Grafana is not a like-for-like software swap. Datadog is a managed observability platform: it collects telemetry, stores it, correlates it, alerts on it, and gives teams a polished UI. Grafana is primarily a visualization and alerting layer. It becomes useful only when it is connected to the right data sources.

That distinction matters. A team can move infrastructure dashboards from Datadog to Grafana, but it also needs to decide where metrics, logs, and traces will live after the migration. For classic infrastructure, that may mean Zabbix. For Kubernetes metrics, Prometheus or VictoriaMetrics. For logs, Loki or OpenSearch. For traces, Tempo, Jaeger, or another tracing backend.

A realistic migration keeps Datadog where it still provides high value - APM, RUM, synthetic monitoring, service maps, and fast incident correlation - while moving lower-value infrastructure dashboards and bulk telemetry to Grafana-backed data sources.

What Grafana Replaces Well

Grafana is strongest when the problem is dashboarding over already-structured telemetry. It works well for infrastructure metrics, capacity views, network graphs, Kubernetes dashboards, service health views, and NOC-style screens.

Good migration candidates include:

CPU, memory, disk, and network utilization dashboards
VM and bare-metal infrastructure dashboards
Network device dashboards backed by Zabbix or SNMP collectors
Kubernetes cluster health dashboards backed by Prometheus or VictoriaMetrics
Log-volume and error-rate dashboards backed by Loki or OpenSearch
Executive infrastructure availability dashboards

Datadog time-series widgets, query-value widgets, toplists, and basic tables usually map cleanly to Grafana panels. The main work is not the panel layout. The real work is query translation and data-source design.

Datadog view	Grafana equivalent	Migration difficulty
Time-series graph	Time series panel	Low
Query value	Stat panel	Low
Toplist	Bar gauge or table	Low
Heatmap	Heatmap panel	Medium
Notes/documentation	Text panel	Low
Free-form screenboard	Grid dashboard	Medium to high
APM/service map	External tracing backend required	High

What Grafana Does Not Replace By Itself

Grafana does not collect or store most telemetry by default. It queries other systems. If those systems are slow, under-sized, poorly secured, or missing historical retention, Grafana will not fix that.

It also does not automatically replace Datadog’s integrated workflows. Datadog can correlate an alert, metric spike, trace, and related logs inside one managed platform. With Grafana, the team has to build that correlation deliberately across Prometheus, Loki, Tempo, OpenSearch, Zabbix, or other backends.

The following areas should not be treated as simple dashboard migrations:

APM and distributed tracing. Grafana can visualize traces, but a tracing backend such as Tempo or Jaeger must be deployed and instrumented.
RUM and synthetic monitoring. Datadog’s browser tests and user-experience monitoring are not replaced by Grafana dashboards alone.
Service maps. Dependency mapping requires consistent service metadata and trace correlation.
Machine-learning alerts. Datadog anomaly and forecast-style monitors usually need to be rebuilt as explicit queries or replaced with another statistical system.
Incident workflow. Routing, escalation, silencing, ownership, and runbook links need to be rebuilt and tested.

Choose the Data Sources First

The safest way to plan a Datadog-to-Grafana migration is to start with telemetry type, not with dashboards. Each type of signal needs a backend that fits its access pattern.

Telemetry type	Practical backend	Best fit
Classic infrastructure metrics	Zabbix	VMs, bare metal, network devices, SNMP, static servers
Kubernetes and cloud-native metrics	Prometheus or VictoriaMetrics	Containers, exporters, service metrics, high-volume time series
High-volume structured logs	Loki	Label-based searches, app logs, Kubernetes logs, lower-cost retention
Full-text audit/security logs	OpenSearch	Broad text search, ad hoc investigation, complex log queries
Traces	Tempo or Jaeger	Distributed tracing and trace-to-log correlation
Dashboards and alerting	Grafana	Visualization, unified dashboarding, alert rules

This is where many migration projects fail. They export a Datadog dashboard, import something into Grafana, then discover the new backend does not have the same metric names, tags, rollups, or retention behavior.

Zabbix + Grafana Use Case

Zabbix is a strong fit when the Datadog cost problem comes from basic infrastructure and network monitoring. It is useful for routers, switches, firewalls, hypervisors, static Linux and Windows servers, storage appliances, and SNMP-heavy environments.

In this model, Zabbix handles collection and alerting for classic infrastructure, while Grafana provides a cleaner dashboard layer. Operators do not need to use the Zabbix UI for every view. Grafana can read Zabbix data through the Zabbix data source plugin and present NOC dashboards, availability summaries, and capacity views.

This architecture is most useful when:

the environment has many stable devices or VMs,
the signals are mostly CPU, memory, disk, network, SNMP, and availability,
Datadog APM is not needed for those assets,
the team wants predictable self-hosted storage and retention,
the operations team already understands infrastructure monitoring.

Zabbix is not free in the operational sense. Someone must maintain the Zabbix server, database, proxies, templates, upgrades, and backup strategy. The cost reduction comes from replacing per-host or per-device SaaS billing with infrastructure and engineering ownership.

Prometheus or VictoriaMetrics + Grafana Use Case

For Kubernetes and cloud-native metrics, Prometheus-style collection is usually the better fit. Prometheus exporters are common, Kubernetes service discovery is mature, and Grafana has strong native support for PromQL.

VictoriaMetrics can be useful when the environment needs long retention, high ingestion volume, or lower resource usage than a large Prometheus setup. It can act as long-term storage or as the primary metrics backend. The key point is not that it is magically cheaper. The key point is that the team controls retention, labels, scraping intervals, and storage architecture.

Before moving Datadog metrics into Prometheus or VictoriaMetrics, review label cardinality. Tags such as user_id, session_id, pod_uid, container_id, or full request paths can create excessive series counts. Moving those labels from Datadog to Prometheus does not make the problem disappear. It simply moves the pressure from a SaaS bill to memory, disk, and query performance.

Loki or OpenSearch + Grafana Use Case

Logs need a separate decision. Loki and OpenSearch solve different problems.

Loki is a good fit for high-volume operational logs when queries are usually scoped by labels such as cluster, namespace, service, host, or environment. It keeps storage overhead lower by indexing labels rather than every word in every log line. That makes it useful for Kubernetes app logs, ingress logs, and noisy infrastructure logs.

OpenSearch is a better fit when teams need broad full-text search, wildcard search, complex filters, and ad hoc forensic investigation. That capability costs more in compute, memory, disk, and operational management.

A practical split often looks like this:

critical production errors and security events stay in Datadog or the existing SIEM,
noisy application logs move to Loki,
audit-heavy or search-heavy logs move to OpenSearch,
long-term raw archives go to S3 or another object store,
Grafana provides the dashboard and search entry point.

Migration Steps

1. Inventory Existing Datadog Dashboards

Start by listing Datadog dashboards by owner, business purpose, usage, and linked monitors. Do not migrate dashboards that nobody uses. Do not migrate dashboards that only exist because an old project once needed them.

Classify each dashboard as:

keep in Datadog,
migrate to Grafana,
rebuild differently,
archive/delete.

2. Choose the Backend for Each Dashboard

Map each dashboard to the backend that should own the data. A VM dashboard may belong in Zabbix. A Kubernetes dashboard may belong in Prometheus or VictoriaMetrics. A log-volume dashboard may belong in Loki. A full-text investigation view may belong in OpenSearch.

This prevents the common mistake of treating Grafana as the database.

3. Build Parallel Ingestion

Run Datadog and the new backend side by side for a validation period. Do not cut over blindly. Parallel ingestion lets the team compare dashboards, identify missing tags, and verify that rollups and retention windows behave as expected.

4. Translate Queries, Not Just Panels

Panel layout is usually the easy part. Query translation is where the real migration happens.

A Datadog query like this:

avg:system.cpu.user{host:web-prod-*} by {host}.rollup(avg, 60)

may become a PromQL-style query like this:

avg by (host) (avg_over_time(system_cpu_user{host=~"web-prod-.*"}[1m]))

The result may still differ because of rollup behavior, scrape interval, missing data handling, tag naming, and aggregation semantics. Every important dashboard needs side-by-side validation.

5. Rebuild Alerts Separately

Do not assume Datadog monitors become Grafana alerts automatically. Alert evaluation windows, no-data behavior, grouping, notification routing, silencing, and escalation rules need separate testing.

Start with non-critical alerts, then move infrastructure alerts, and keep critical application alerts in Datadog until the Grafana-backed path has proven reliability.

6. Validate With Operators

The dashboard is not migrated until the people using it agree that it works. NOC teams, SREs, infrastructure engineers, and application owners should validate the Grafana replacement during real operating conditions.

Validation should cover:

same time range in Datadog and Grafana,
same host/service scope,
same alert threshold behavior,
same dashboard refresh performance,
acceptable query latency,
ownership and escalation path.

Risks and Mistakes

Treating Grafana as a Datadog Clone

Grafana is not a full Datadog replacement by itself. It is the front end for a larger observability architecture. The backend design matters more than the dashboard skin.

Moving Too Much Too Fast

The safest first candidates are infrastructure dashboards and bulk telemetry. APM, RUM, synthetic monitoring, and incident correlation should usually stay in Datadog until there is a proven replacement.

Ignoring Operational Cost

Self-hosted tools reduce vendor spend, but they add work: upgrades, storage planning, access control, backup, HA, alert routing, and performance tuning. A migration that saves license cost but consumes engineering time every week is not automatically a win.

Copying Bad Cardinality

If the existing Datadog environment has uncontrolled tags, do not copy them directly into Prometheus, VictoriaMetrics, Loki, or OpenSearch. Fix labels and ownership during migration.

Breaking Historical Comparisons

Datadog and the new backend may not retain the same history, use the same rollups, or treat missing data the same way. Teams need to decide how much historical comparison matters before disabling Datadog dashboards.

Practical Migration Checklist

Export a list of Datadog dashboards, owners, and linked monitors.
Delete or archive dashboards that no team uses.
Classify each dashboard by telemetry type.
Pick the correct backend: Zabbix, Prometheus, VictoriaMetrics, Loki, OpenSearch, Tempo, or Jaeger.
Run parallel ingestion for critical metrics.
Translate queries manually where automation is not reliable.
Validate panels against Datadog for the same time ranges.
Rebuild alerts separately and test notification routing.
Keep Datadog for high-value APM, RUM, synthetic tests, and critical service correlation until replacement workflows are proven.
Remove Datadog scope only after dashboard and alert parity is confirmed.

Conclusion

A Datadog-to-Grafana migration makes sense when the goal is to reduce cost for infrastructure dashboards and bulk telemetry. It does not make sense as a blind replacement for every Datadog feature.

Grafana works best as the presentation layer over a deliberate telemetry architecture: Zabbix for classic infrastructure, Prometheus or VictoriaMetrics for cloud-native metrics, Loki or OpenSearch for logs, and a dedicated tracing backend for traces.

The right migration is selective. Keep Datadog where it gives high operational value. Move low-value, high-volume infrastructure visibility into Grafana-backed systems where the cost model is easier to control.

I help teams move infrastructure dashboards from Datadog to Grafana-backed data sources while keeping critical Datadog views where they still make sense.

Telemetry Audit & Consultation

Need Grafana dashboards?

I help enterprise engineering teams design telemetry pipelines, implement edge-routing with Vector/Fluent Bit, and offload static checks to Zabbix and Grafana - saving up to 60% on SaaS bills without losing incident visibility.

Migrate Dashboards

Sources

Written by

Tymur Chmeruk

Cloud Security & Infrastructure Engineer · Baltimore–Washington Metro · [email protected]