TL;DR:
The SRE Observability team asks that no new metrics be deployed to
Graphite. *On April 15th, 2025, the service will be configured to a
read-only state*, disabling new metric ingestion—details in T228380 [1].
Please disable or migrate all existing Graphite metrics to Prometheus [2]
and retire the corresponding panels and dashboards as applicable before the
noted date.
Technical Sunsetting of Graphite for Prometheus Reminder
The SRE Observability team has been operating Prometheus [1] in production
for several years, offering several operational benefits over Graphite.
After a long period of observation and usage, the team has determined that
migrating MW off Graphite ensures we stay ahead with a supported, scalable
metrics platform for more effective dimensional metrics analysis and
storage.
Notice and Action Required
The team plans to make Graphite read-only on April 15th 2025 and begin the
formal deprecation of Graphite in production [1].
We ask all teams and maintainers to check this dashboard [3] and related
task T350592 [4] and claim metrics and dashboards in associated tasks or
components under their care. Disable/remove any unused metrics and
dashboards first, then follow the process outlined in the task to migrate
all “in-use” metrics before April 15th. After this date, Graphite will be
read-only, and no new data will be ingested.
Graphite will continue to be available for another year to provide
historical data in read-only “mode” while new history is recorded in
Prometheus. Please see the tracking task T228380 [2] or roadmap [5] for
additional details.
Why We’re Migrating from Graphite to Prometheus
We have been utilizing Prometheus in production for several years as it offers
several benefits over Graphite
<https://prometheus.io/docs/introduction/comparison/>. Migrating MW off
Graphite ensures we stay ahead with a supported, scalable metrics platform
for more effective, multidimensional metrics analysis and storage.
Prometheus provides more robust data labeling, storage, and query
capabilities. This initiative is fundamental in unifying our metrics,
enhancing monitoring, improving MW observability, and reducing tool
fragmentation. We’re moving from Graphite to Prometheus because of critical
limitations in our current setup.
Here’s also what you need to know:
-
Graphite is Dropping Data: Our existing Graphite hosts have recently
been saturated by too much metrics traffic (UDP). The 1G network interfaces
on these hosts are overloaded, causing packets (and, therefore, data) to be
lost. We don’t know precisely how much data is dropped, but it’s enough to
be noticeable. Instead of investing in fixing this old system, we’re
focusing on migrating Prometheus, which is more reliable. The hardware
that powers the current system will also reach its end-of-life and be
retired by Q4 of the end of FY 2025/2026 (June 2026).
-
Prometheus Works Differently: Different internal methods for data
processing, sampling, and calculation between Graphite and Prometheus mean
that numbers on both sides won't necessarily align or match 100%; this is
expected. More information available
-
More Accurate Metrics: Prometheus handles timing metrics and counters
differently from Graphite. You may see higher counts for certain metrics,
such as timing metrics—this is expected and can be explained
further in statsite:
architecture
<https://github.com/statsite/statsite?tab=readme-ov-file#architecture>
if you’d like the specifics.
-
Compare Patterns, Not Values: If you’re comparing numbers between
Graphite and Prometheus, focus on the pattern and trends rather than exact
values. Differences in how the two systems process data mean that exact
numbers won’t always match. However, the overall trend should be the same.
Frequently Asked Questions
-
Will you be migrating historical data from Graphite to Prometheus?
We do not plan to migrate data from graphite to Prometheus, and while it is
technically feasible, we don't have enough documented requests. Instead, we
will run both systems in parallel for a while (1yr) to allow new historical
data to cross over before read-only.
-
What if I need data longer than 1 year?
We can also provide graphite files for projects interested in longer
retention and work on possible backfilling alternatives for specific
cases. Details
are in T349521 <https://phabricator.wikimedia.org/T349521>.
-
What if I needed access to graphite for longer?
We can provide a subset of the data in a VM, with a graphite service (in
read-only) available for a discretionary period longer than the year after
the hardware is sunset with limited support as we are sunsetting the
technology. This workaround will be offered until the current graphite and
os deployments are supported.
Related Links:
[1] Tech debt: sunsetting of Graphite
https://phabricator.wikimedia.org/T228380
[2] Wikitech:Prometheus https://wikitech.wikimedia.org/wiki/Prometheus
[3] List of dashboards w/Graphite queries
https://grafana.wikimedia.org/d/K6DEOo5Ik/grafana-graphite-datasource-utili…
[4] EPIC: Migrate in-use metrics and dashboards to statslib
https://phabricator.wikimedia.org/T350592
[5] Graphite Deprecation Roadmap
https://wikitech.wikimedia.org/wiki/Graphite/Deprecation_Roadmap
Thank you for reading! Be safe and happy.
Best,
Leo
*Leo Mata* (he/him)
Engineering Manager - Observability
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi everyone,
TL;DR: You might not have to wait very long on CI if your patchset has
already passed its tests in a prior run with the same exact
dependencies and CI configuration.
At our WMF Developer Experience offsite back in December 2024, the
Release Engineering team discussed a few different ways we might
reduce continuous integration wait times for MediaWiki contributors.
One hypothesis was that there was a significant number of cases where
automated testing was repeated against the exact same setup: the same
MediaWiki patchset, the same dependency versions, and the same CI
configuration.
This could happen during retest scenarios, for example, or for
backports to release branches. If correct, we might be able to safely
skip redundant test execution and save developers and deployers some
serious wait time.
So we ran an experiment to prove _or disprove_ our hypothesis. If we
were right, maybe we could skip execution during those scenarios. If
we were wrong, there would be no reason to complicate our CI jobs
further.
We first rolled out a partial "success caching" implementation in
Quibble[1] that computed a SHA256 digest to represent the uniqueness
of the test run and stuck it in a memcached instance at the end of a
successful run.
The digest was computed from:
1. The job name (a good and easy proxy for overall CI setup).
2. The `HEAD^{tree}` of each of the sorted Git repos under test
(core, extensions, skins, etc).
(Note the `HEAD^{tree}` is used over simply `HEAD` because it more
accurately represents the working tree on disk after you checkout a
commit, and the `HEAD` commit is almost never unique due to our gating
system creating temporary merge commits, etc.)
Quibble would then check its cache on subsequent runs for an identical
digest/key, and report to the console if it found a match. We let this
run for a couple of weeks, scraping the Jenkins logs, and then did
some reporting on it.
Our hypothesis was proven to be correct. Redundant test execution was
occuring, and more so for the gate-and-submit pipelines that test
mainline bound changes, and even more often in pipelines testing
changes against release branches (weekly train branches and long-term
release).
You can see my summary on the task for details[2], but the most
striking numbers were:
- 6.4% of test runs in gate-and-submit (merges to mainline branches)
were redundant
- 28.3% of test runs in gate-and-submit-wmf (merges to weekly release
branches) were redundant
- 163 _hours_ of CI wall time could have been saved had we skipped
execution of redundant tests
Naturally, with such encouraging numbers, we went ahead with the final
implementation. Quibble will now exit early and successfully if:
1. The patch under test has not changed from a previously successful run.
2. The extensions/skins/vendor dependencies have not changed.
3. The setup for MediaWiki and its testsuite has not changed
(database type, vendor vs. composer usage, etc).
Hopefully this leads to some pleasant surprises for folks when waiting
on CI, especially during backport deployment windows.
Thanks to everyone in Release Engineering for collaborating on the
idea, and a special thanks to Antoine Musso for thinking through the
details with me and for reviewing my Python code!
Please reply or reach out on IRC (#wikimedia-releng) if you want to
know more about it.
To pleasant surprises!
Cheers,
Dan
[1]: https://www.mediawiki.org/wiki/Continuous_integration/Quibble
[2]: https://phabricator.wikimedia.org/T383243#10584349
--
Dan Duvall
Staff Software Engineer, Release Engineering
Wikimedia Foundation
Hello everyone,
The Spring 2025 MediaWiki Users & Developers Conference will be held about
two months from now, on May 14-16 (with an optional half-day workshop on
May 13), at the NASA offices in Sandusky, Ohio, USA:
https://meza.wiki/mwplus/MediaWiki_Users_and_Developers_Spring_2025_Confere…
Everyone is encouraged to attend, to meet and discuss MediaWiki-related
topics relating to both the Wikimedia world and uses within companies,
organizations, etc.
As the program chair, I would like to specifically encourage you all to
consider giving a talk, on any topic related to MediaWiki usage or
development. Talks can be given remotely, although priority will be given
to in-person attendees if there's an overflow. To propose a talk for this
conference, just fill out this form:
https://meza.wiki/mwplus/Form:Proposed_talk
Thank you, and I hope to see you in Sandusky!
-Yaron
Dear all,
On Wednesday March 19th 2025, the SRE team will run a planned datacentre
switchover <https://wikitech.wikimedia.org/wiki/Switch_Datacenter>, moving
all wikis from codfw to eqiad. This is an important periodic test of our
tools and procedures, to ensure the wikis will continue to be available
even in the event of major technical issues in our primary home. It also
gives all our SRE and ops teams a chance to do maintenance and upgrades on
systems in codfw that normally run 24 hours a day.
The switchover process requires a brief read-only period for all
Foundation-hosted wikis, which will start on Wednesday March 19th 2025 @
14:00 UTC <https://zonestamp.toolforge.org/1742392800>, and will last for
just a few minutes while we execute the migration as efficiently as
possible. All our public and private wikis will be continuously available
for reading, as usual, but editing will be unavailable during the process.
Users will see a notification of the upcoming maintenance, and anyone still
editing will be asked to try again in a few minutes.
If you like, you can follow along on the day in the public
#wikimedia-operations channel on IRC. To report any issues, you can reach
us in #wikimedia-sre on IRC, or file a Phabricator ticket with the
#datacenter-switchover tag (
https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=Data…;
we'll be monitoring closely for reports of trouble during and after the
switchover. The switchover and its preparation will be tracked under
https://phabricator.wikimedia.org/T385155.
On behalf of the SRE team, please excuse the disruption, and we would like
to thank everyone in various departments who are involved in planning this
work. If you have any questions, please reply directly to this email, or
follow up on Phabricator or IRC.
Kind regards,
Hugh
Hello all!
The public WDQS Split Graph endpoints have been available for ~6 months, it
is time to have a look at what has been happening and at the next steps.
We don’t see a strong adoption of the new endpoints (~20 req/min for
query-scholary [1]). But we’ve identified almost 90% of the current
requests that would require migration to the split endpoints. The large
majority (~80%) are generated by a tool that is unfinished and has been
dropped by its author. Those queries are already broken or don’t have value
and will never be migrated. Unsurprisingly, Scholia is a major user of the
scholarly subgraph and has not migrated yet.
While we want to move forward, we also want to limit disruption, and give
more time to the projects that need it. To ease the transition, we’ve
created a new endpoint (query-legacy-full.wikidata.org) which contains the
full Wikidata graph, but is limited in terms of performances and
availability [2]. This new endpoint can be used in place of the current
query.wikidata.org for the few projects that need the additional migration
time. This endpoint will be available until December 2025.
The next big step is to drop support for the full Wikidata graph on
query.wikidata.org [3]. This should happen around April 10. After that
step, requests to query.wikidata.org that require the full graph will fail
or return invalid results if they are not rewritten to use SPARQL
federation [4]. You can ask for help to rewrite your queries [5].
In related news, Peter [6] has been exploring the performances of various
alternative RDF backends [7]. This is going to be invaluable when we work
on replacing Blazegraph!
Have fun!
Guillaume
[1]
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&re…
[2] https://phabricator.wikimedia.org/T384422
[3] https://phabricator.wikimedia.org/T388134
[4]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_spli…
[5] https://www.wikidata.org/wiki/Wikidata:Request_a_query
[6] https://www.wikidata.org/wiki/User:Peter_F._Patel-Schneider
[7] https://www.wikidata.org/wiki/Wikidata:Scaling_Wikidata/Benchmarking
--
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
Deb - thanks for explaining what happened. We all make mistakes (except for
this Dennis guy, I suppose!), and it's at least good to know that Google
have not suddenly changed their minds about the Wikimedia Foundation.
However, perhaps something can be salvaged from this. There have been
various organization-specific mentorship programs that have been directly
inspired by the Google Summer of Code; here are some of them:
https://mentorship.kde.org/sok/https://www.x.org/wiki/XorgEVoC/https://www.summerofbitcoin.org/
What about having a "Wikimedia Summer of Code" or some such this summer? It
could more or less match what GSoC does, with the same timeframe(s), same
country-specific stipends, etc. Like these other org-specific programs, it
would piggyback on the GSoC concept, so that potential students will
already know what to expect.
Wikimedia has a big advantage over most other organizations that might wish
to do similar things, in that it already has a funding mechanism that could
be repurposed for this: the Rapid Grants program, whose $5,000 limit is
larger than all but the largest possible GSoC stipends. So in a sense,
nothing (as far as I know) would need to change officially: it would just
be a matter of putting up a wiki page telling potential mentors that they
need to apply via a Rapid Grant, as opposed to the GSoC website. (Of
course, it would be good for anyone handling the Rapid Grant applications
to know to expect a spate of technology-related ones!)
There have already been some great Wikimedia project ideas this year, and
(as with every year) there are students who are excited to specifically
work on Wikipedia- and Wikimedia-related projects. It would be a shame to
give up on these projects, and also potentially to lose momentum for
upcoming years, if the funding potential is there.
And yes, I'm aware of Outreachy, which as far as I know is still happening,
but it doesn't allow most of the people who would potentially be applying
for GSoC, so I don't see it as a real substitute. Others may, however.
Any thoughts?
-Yaron
--
WikiWorks · MediaWiki Consulting · http://wikiworks.com
Hello all!
The Search Platform Team usually holds an open meeting on the first
Wednesday of each month. Come talk to us about anything related to
Wikimedia search, Wikidata Query Service (WDQS), Wikimedia Commons Query
Service (WCQS), etc.!
Feel free to add your items to the Etherpad Agenda for the next meeting.
Details for our next meeting:
Date: Wednesday, March 5, 2025
Time: 16:00-17:00 UTC / 08:00 PST / 11:00 EST / 17:00 CET
Etherpad: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours
Google Meet link: https://meet.google.com/vgj-bbeb-uyi
Join by phone: https://tel.meet/vgj-bbeb-uyi?pin=8118110806927
Have fun and see you soon!
Guillaume
--
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi everyone,
Summary
After a long series of consultations with various teams in the Wikimedia
Foundation, I find myself happy to inform you of the sunset of an API from
the current Wikimedia offerings.
That would be *recommendation-api*, the one powering
*https://<project_domain>/api/rest_v1/#/Recommendation* family of
endpoints, e.g. https://commons.wikimedia.org/api/rest_v1/#/ or
https://en.wikipedia.org/api/rest_v1/#/Recommendation.
The codebase of that API resides in
*https://gerrit.wikimedia.org/g/mediawiki/services/recommendation-api
<https://gerrit.wikimedia.org/g/mediawiki/services/recommendation-api>*.
The removal date will be 2025-03-31.
There is some more information below as to why, what and how as well as an
estimation of the impact this is expected to have (minimal to zero). It is
also my intention to proceed with a diff <https://diff.wikimedia.org/> post
explaining our process in more detail.
Intro
Recommendation-api was being used solely by the Wikipedia official Android
application. Given that recently <https://phabricator.wikimedia.org/T373611>,
the Android Wikipedia Official application has moved away from using this
API, the SRE team wanted to remove this service from production to focus on
more impactful services. The problem was ensuring that we turn off this API
responsibly so that we do not have a significant impact on users who have
not yet upgraded their apps and do not waste SRE effort on supporting a
service we are moving away from, for more than we need to. A complication
was that the service wasn't anymore owned by anyone, meaning maintenance
(Node.js and Operating System upgrades) had to be carried out by people
that were not acquainted at all with the code base.
Process
To achieve the above goals, a process was jump started by a group of
Principal Engineers in the Foundation (Jon Robson, Moriel Schottlender and
yours truly) collecting data and feedback regarding the remaining usage of
the API as well as the estimated amount of effort that would be needed to
continue maintaining the service vs sunsetting it. Since October, we spoke
to various stakeholders in the Foundation to figure out the best possible
plan, going through a variety of possible paths. We ended up recommending
that the service be sunset at 2025-03-31, a recommendation that was
accepted.
Impact
The service is only accessed by users of the Wikipedia Android application.
Recent versions of it no longer rely on it and there will be zero (0)
impact on users of these. But users with application versions older than 6
months by the time of removal (2025-03-31), will see reduced functionality
in the form of the *Suggested Edits* part of the application no longer
functioning. The rest of the application will continue to function as
usual. We encourage users of versions older than *r/2.7.50504-r-2024-10-01
<https://github.com/wikimedia/apps-android-wikipedia/releases/tag/r%2F2.7.50…>*
to
update.
We wanted to make sure we don't break 3rd party users without giving them a
heads up. Going through access logs and metrics, we identified no valid 3rd
party users. Furthermore, we intend to review traffic/errors to that API 1
month past the cutoff date and evaluate the effectiveness of our solution.
Note
This API should not be confused with another one, also named
recommendation-api
<https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_content_translat…
endpoint.
The latter is powered by a different codebase and, conflicting naming
aside, has no other relation to the former or this sunsetting.
Regards,
--
Alexandros Kosiaris
Principal Site Reliability Engineer
Wikimedia Foundation