Jump to content

Graphite/Deprecation Roadmap

From Wikitech

TL;DR

We have been using Prometheus in production for several years as it offers several benefits over Graphite. Migrating MW off Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective, multidimensional metrics analysis and storage. Prometheus provides more robust data labeling, storage, and query capabilities. This initiative is fundamental in unifying our metrics, enhancing monitoring, improving MW observability, and reducing tool fragmentation.

As the WMF improves its culture around MW ecosystem sustainability, we are setting our goals to complete the migration of active, production, and in-use (by dashboards/alerts) metrics to Prometheus to enable read-only mode on the Graphite cluster by April 15th 2025.

For this exercise, we define as “in-use” any metric emitted to Graphite mapped to a dashboard panel or alert active in Grafana. See Graphite Utilization Dashboard

Problem Statement

As we are operating both Graphite and Prometheus concurrently, we have reached a point of familiarity and comfort with the new platform and have observed its benefits to the point where we would like to make a deliberate and concerted effort to complete the migration to Prometheus and fully sunset graphite. Completing the migration of all graphite metrics to Prometheus is crucial for several reasons:

  • Enhanced Observability: Prometheus offers advanced capabilities, including multidimensional metrics analysis through key-value tags, robust querying, and scalable storage, which improve system monitoring and troubleshooting.
  • Resource Efficiency: Running two metrics stacks that provide overlapping functionality is inefficient and a significant waste of resources. Consolidating onto a single platform reduces operational overhead and costs.
  • Industry Alignment: Prometheus has become the industry standard for metrics, backed by a robust ecosystem and a large, active community. In contrast, Graphite, while effective in its time, is an older technology with a diminishing ecosystem and less momentum for future innovation.
  • Unified Metrics platform: Consolidating metrics into Prometheus reduces tool fragmentation, simplifies the developer workflow, and aligns the organization with modern infrastructure practices.
  • Sustainability and Support: Prometheus is actively developed and supported, ensuring long-term scalability and reliability. This transition reduces the risks associated with relying on outdated technology like Graphite.

Last year, the team set out to test whether a new interface was viable and determined that long-term sustainability required us to migrate MediaWiki metrics to Prometheus, utilizing StatsLib, a new, internally developed, Prometheus-capable metrics interface. By the end of Q2, the team had successfully tested the component in production, and by the end of Q4, it had advanced about ~42% along the migration.

The team defines “in use” as any metric that maps to a panel in Grafana or an alert. The goal is to migrate these critical metrics while ensuring minimal disruption to monitoring and avoiding work migrating metrics that are “captured but not observed.”

Assumptions

  • Prometheus is tried and tested; there is added value to migrating.
  • It is assumed that maintaining a single, modern metrics platform (Prometheus) will reduce complexity, operational overhead, and costs.
  • With ongoing community support and development, Prometheus is expected to remain the industry standard for metrics.
  • Graphite’s ecosystem is assumed to continue diminishing, making long-term reliance unsustainable.
  • Existing tooling, including StatsLib and StatsD exporters, is sufficient to facilitate the migration.
  • Teams and leadership are aligned on prioritizing this migration to meet sustainability goals.
  • MW Core metrics have been migrated, and corresponding telemetry matches.
  • No blockers for technical contributors preventing them from migrating their tools.
  • Technical documentation available is sufficient for technical contributors to migrate.
  • The deprecation, migration and plan have been sufficiently communicated.
  • The majority (67% lower bound) of in-use metrics have been migrated and retired from panels/dashboards/alerts.

Additional notes

  • The focus is limited to active and critical metrics, reducing the effort to migrate legacy or unused data.
  • Successful migration will require collaboration across teams and active participation from the broader MW ecosystem.

Decision

  • We have set a final date to cease ingesting metrics on graphite and only allow its continued (Read Only) operation for historical data on April 15th 2025
  • Retire the graphite hardware once support is set to end-of-life. (Before June 2026)

Risks and Mitigations

At this time, all risk items have been documented in a post-mortem exercise and addressed. No more technical unknowns should impede the migration and its conclusion. If the migration continues at its current rate, the deprecation scheduling should proceed on the scheduled date, special cases can be considered on request.

Decision Criteria

Decision criteria utilized to advance this project can be found outlined in the following RFC: https://phabricator.wikimedia.org/T249164

Project Roadmap

Based on our project plan, we’re identifying some target milestones globally for the whole project and per-quarter goals and targets.

Global

Global metrics and goals cover the entirety of the Fiscal year. As the key result and working group are structured, teams and contributing hypotheses are expected to work on their hypothesis for three quarters and assess the impact during Q4.

Goals

  • Ensure MediaWiki platform sustainability.
  • Complete migration of metrics to Prometheus.
  • Sunset Graphite into “read-only mode” by the end of Q3
  • Formally announce Graphite's final deprecation date/timeline one year after Q3.

Success Metrics:

  1. Migration % of dashboard panels using Graphite queries (metrics ingested used last 90d)
  2. Overall StatsLib utilization in contrast to the Graphite data source (metrics emitted last 90d)

Q1-FY2024/2025

Goals

  • [In Progress] Identify (and disable) unused MW Graphite metrics to reduce noise actionable metrics to migrate.
  • Update dashboards in Grafana to use Prometheus-sourced metrics instead of Graphite-source.
  • Update the default data source in Grafana to be Prometheus, not Graphite https://phabricator.wikimedia.org/T269333
  • Formally announce technical deprecation of Graphite (read-only Q3, termination one year later).
    • Phabricator: https://phabricator.wikimedia.org/T228380
    • Wikitech/Docs: https://wikitech.wikimedia.org/wiki/Graphite
    • Grafana: https://grafana.wikimedia.org (under service updates)
    • wikitech-l : [draft] WE5.1.2 Graphite deprecation notice for wikitech-l and tech-all
    • Tech-all: [draft] WE5.1.2 Graphite deprecation notice for wikitech-l and tech-all

Success Metric Targets

  • Increase migration progress (by intake) by an increased 30%. (ended at 40% by volume)
  • Increase migration progress by 30% (9.5% in panels/dashboards converted)

Q2-FY2024/2025

Goals

Success Metric Targets

  • Increase migration progress (by intake volume) by another 20%. (76% completed) -- metric retired
  • Increase migration progress by 30% (50% in panels/dashboards converted)

Q3-FY2024/2025

Goals

Success Metric Targets

  • Increase migration progress (by intake) to 90% as the ideal target
    • > 67% lower bound is consired minimal "ciritcal mass" if core services have been migrated for a “read only” implementation

Q4-FY2024/2025

Goals

Success Metric Targets

  • Increase migration progress (by intake) to as close as 100% as possible
  • Increase migration progress to 100% (in panels/dashboards converted)

Q4-FY2025/2026

  • Retire hardware by Q4 of the end of FY 2025/2026 (June 2026).

Frequently Asked Questions

Will you be migrating historical data from Graphite to Prometheus?

We do not plan to migrate data from graphite to Prometheus, and while technically feasible, we don't have enough documented requests. Instead, we will run both systems in parallel for a while (1yr) to allow new historical data to cross over before read-only.

What if I need data longer than 1 year?

We can also provide graphite files for projects interested in longer retention and work on possible backfilling alternatives for specific cases. Details are in T349521.

What if I needed access to graphite for longer?

We can provide a subset of the data in a VM, with a graphite service (in read-only) available for a discretionary period longer than the year after the hardware is sunset with limited support as we are sunsetting the technology. This workaround will be offered until the current graphite and os deployments are supported.

Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant