Graphite/Deprecation Roadmap
TL;DR
We have been using Prometheus in production for several years as it offers several benefits over Graphite. Migrating MW off Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective, multidimensional metrics analysis and storage. Prometheus provides more robust data labeling, storage, and query capabilities. This initiative is fundamental in unifying our metrics, enhancing monitoring, improving MW observability, and reducing tool fragmentation.
As the WMF improves its culture around MW ecosystem sustainability, we are setting our goals to complete the migration of active, production, and in-use (by dashboards/alerts) metrics to Prometheus to enable read-only mode on the Graphite cluster by April 15th 2025.
For this exercise, we define as “in-use” any metric emitted to Graphite mapped to a dashboard panel or alert active in Grafana. See Graphite Utilization Dashboard
Problem Statement
As we are operating both Graphite and Prometheus concurrently, we have reached a point of familiarity and comfort with the new platform and have observed its benefits to the point where we would like to make a deliberate and concerted effort to complete the migration to Prometheus and fully sunset graphite. Completing the migration of all graphite metrics to Prometheus is crucial for several reasons:
- Enhanced Observability: Prometheus offers advanced capabilities, including multidimensional metrics analysis through key-value tags, robust querying, and scalable storage, which improve system monitoring and troubleshooting.
- Resource Efficiency: Running two metrics stacks that provide overlapping functionality is inefficient and a significant waste of resources. Consolidating onto a single platform reduces operational overhead and costs.
- Industry Alignment: Prometheus has become the industry standard for metrics, backed by a robust ecosystem and a large, active community. In contrast, Graphite, while effective in its time, is an older technology with a diminishing ecosystem and less momentum for future innovation.
- Unified Metrics platform: Consolidating metrics into Prometheus reduces tool fragmentation, simplifies the developer workflow, and aligns the organization with modern infrastructure practices.
- Sustainability and Support: Prometheus is actively developed and supported, ensuring long-term scalability and reliability. This transition reduces the risks associated with relying on outdated technology like Graphite.
Last year, the team set out to test whether a new interface was viable and determined that long-term sustainability required us to migrate MediaWiki metrics to Prometheus, utilizing StatsLib, a new, internally developed, Prometheus-capable metrics interface. By the end of Q2, the team had successfully tested the component in production, and by the end of Q4, it had advanced about ~42% along the migration.
The team defines “in use” as any metric that maps to a panel in Grafana or an alert. The goal is to migrate these critical metrics while ensuring minimal disruption to monitoring and avoiding work migrating metrics that are “captured but not observed.”
Assumptions
- Prometheus is tried and tested; there is added value to migrating.
- It is assumed that maintaining a single, modern metrics platform (Prometheus) will reduce complexity, operational overhead, and costs.
- With ongoing community support and development, Prometheus is expected to remain the industry standard for metrics.
- Graphite’s ecosystem is assumed to continue diminishing, making long-term reliance unsustainable.
- Existing tooling, including StatsLib and StatsD exporters, is sufficient to facilitate the migration.
- Teams and leadership are aligned on prioritizing this migration to meet sustainability goals.
- MW Core metrics have been migrated, and corresponding telemetry matches.
- No blockers for technical contributors preventing them from migrating their tools.
- Technical documentation available is sufficient for technical contributors to migrate.
- The deprecation, migration and plan have been sufficiently communicated.
- The majority (67% lower bound) of in-use metrics have been migrated and retired from panels/dashboards/alerts.
Additional notes
- The focus is limited to active and critical metrics, reducing the effort to migrate legacy or unused data.
- Successful migration will require collaboration across teams and active participation from the broader MW ecosystem.
Decision
- We have set a final date to cease ingesting metrics on graphite and only allow its continued (Read Only) operation for historical data on April 15th 2025
- Retire the graphite hardware once support is set to end-of-life. (Before June 2026)
Risks and Mitigations
At this time, all risk items have been documented in a post-mortem exercise and addressed. No more technical unknowns should impede the migration and its conclusion. If the migration continues at its current rate, the deprecation scheduling should proceed on the scheduled date, special cases can be considered on request.
Decision Criteria
Decision criteria utilized to advance this project can be found outlined in the following RFC: https://phabricator.wikimedia.org/T249164
Project Roadmap
Based on our project plan, we’re identifying some target milestones globally for the whole project and per-quarter goals and targets.
Global
Global metrics and goals cover the entirety of the Fiscal year. As the key result and working group are structured, teams and contributing hypotheses are expected to work on their hypothesis for three quarters and assess the impact during Q4.
Goals
- Ensure MediaWiki platform sustainability.
- Complete migration of metrics to Prometheus.
- Sunset Graphite into “read-only mode” by the end of Q3
- Formally announce Graphite's final deprecation date/timeline one year after Q3.
Success Metrics:
- Migration % of dashboard panels using Graphite queries (metrics ingested used last 90d)
- Overall StatsLib utilization in contrast to the Graphite data source (metrics emitted last 90d)
Q1-FY2024/2025
Goals
- [In Progress] Identify (and disable) unused MW Graphite metrics to reduce noise actionable metrics to migrate.
Update dashboards in Grafana to use Prometheus-sourced metrics instead of Graphite-source.Update the default data source in Grafana to be Prometheus, not Graphitehttps://phabricator.wikimedia.org/T269333Formally announce technical deprecation of Graphite (read-only Q3, termination one year later).Phabricator:https://phabricator.wikimedia.org/T228380Wikitech/Docs:https://wikitech.wikimedia.org/wiki/GraphiteGrafana:https://grafana.wikimedia.org(under service updates)wikitech-l : [draft] WE5.1.2 Graphite deprecation notice for wikitech-l and tech-allTech-all: [draft] WE5.1.2 Graphite deprecation notice for wikitech-l and tech-all
Success Metric Targets
- Increase migration progress (by intake) by an increased 30%. (ended at 40% by volume)
- Increase migration progress by 30% (9.5% in panels/dashboards converted)
Q2-FY2024/2025
Goals
- Migrate non-MW metrics producers completely off Graphite
- Continue updating dashboards to use Prometheus-sourced metrics instead of Graphite-source https://phabricator.wikimedia.org/T350592
- Implementation plan and approach to configuring Graphite as “read-only” for a year before sunset. https://phabricator.wikimedia.org/T372856
- Identify and mitigate unknown metrics/sources (should there be any).
- Establish office hours support for the rest of the organization regarding StatsLib/Migration.
Success Metric Targets
- Increase migration progress (by intake volume) by another 20%. (76% completed) -- metric retired
- Increase migration progress by 30% (50% in panels/dashboards converted)
Q3-FY2024/2025
Goals
- Continue updating dashboards to use Prometheus-sourced metrics instead of Graphite-source. https://phabricator.wikimedia.org/T350592
- Identify and mitigate unknown metrics/sources (should there be any). https://phabricator.wikimedia.org/T228380
- Continue office hours support for the rest of the organization regarding StatsLib/Migration.
- Close the tail end of the migration, identify and migrate any pending extensions/modules/sources.
- Prep for enabling “read-only” mode by Graphite end of quarter https://phabricator.wikimedia.org/T372856
Success Metric Targets
- Increase migration progress (by intake) to 90% as the ideal target
- > 67% lower bound is consired minimal "ciritcal mass" if core services have been migrated for a “read only” implementation
Q4-FY2024/2025
Goals
- Implementation of “read-only” mode by Graphite on April 15th 2025 https://phabricator.wikimedia.org/T372856
- Analysis and retrospective
- Updated dashboard panels.
- Sustainability intervention reports.
Success Metric Targets
- Increase migration progress (by intake) to as close as 100% as possible
- Increase migration progress to 100% (in panels/dashboards converted)
Q4-FY2025/2026
- Retire hardware by Q4 of the end of FY 2025/2026 (June 2026).
Frequently Asked Questions
Will you be migrating historical data from Graphite to Prometheus?
We do not plan to migrate data from graphite to Prometheus, and while technically feasible, we don't have enough documented requests. Instead, we will run both systems in parallel for a while (1yr) to allow new historical data to cross over before read-only.
What if I need data longer than 1 year?
We can also provide graphite files for projects interested in longer retention and work on possible backfilling alternatives for specific cases. Details are in T349521.
What if I needed access to graphite for longer?
We can provide a subset of the data in a VM, with a graphite service (in read-only) available for a discretionary period longer than the year after the hardware is sunset with limited support as we are sunsetting the technology. This workaround will be offered until the current graphite and os deployments are supported.
Relevant links
- T228380: Graphite technical deprecation – this is the main task tracking the overall deprecation.
- T350592: Migrate "in-use" metrics and dashboards to StatsLib – this task that tracks WM emitted graphite metrics, once this task is close to completion the bulk of the work will be complete.