Jump to content

Topic on Talk:Wikimedia Search Platform

Transparency of the Wikimedia search algorithms

3
Prototyperspective (talkcontribs)

Is there any information anywhere on the search algorithms of Wikimedia search?

I'm interested in how the search algorithm works and would like to suggest some changes such as those under "Proposed solution" in this proposal of the recent Wikimedia Commons technical needs survey so that it shows more relevant, useful, up-to-date, high-quality media further up. Other reasons include that under "Categories and Pages" on WMC, it seems to show very many galleries (which are usually far less useful) and not the category even when the search term matches the cat name 1:1 so probably I'd also like to propose a change to that.

Then I'd also like to propose that if the user searches for something that is also a category name or if it detects that it exactly matches the same topic (e.g. searching for "animal" when category "Animals" exist), a hint link to the category shows up in MediaSearch. Maybe here would be a good place to propose this if not directly in phabricator or another Wishlist item.

Is this the right place to ask or is it here or somewhere else? I thought the search algorithms / search engine tech were open but maybe that is wrong.

EBernhardson (WMF) (talkcontribs)

Everything is open source, if you are looking for source code related to search on commons i would suggest looking at the CirrusSearch and MediaSearch extensions. Some information about how media search works is found at Help:MediaSearch. You can download the data that is found in the search indexes, for commonswiki-file the most recent dump is 2024-09-09. I don't know for sure if that is a complete dump, our live search index for files on commonswiki is about 1.4TB while that dump is only 100GB so it seems a little suspect. The live index of course has various data structures and duplicates of the data analyzed in different ways, but the ratio feels a little off (There is a separate dumps 2.0 project which should allow us to make these more reliable). Some basic information about how to load that into a local search engine is found in the help text of the scripts that create those dumps. Alternatively you can issue queries directly to a replica of the prod search indices from wikimedia cloud. You can also simply ask mediawiki for the current representation of a page in the search engine with action=cirrusdump or via the mediawiki api with action=query&prop=cirrusdoc.

You can see how a particular search query was transformed into what we send to the search engine by appending &cirrusDumpQuery to most requests that execute a search request. Similarly you can see the backend response with &cirrusDumpResult. The response can be augmented with detailed scoring information by appending &cirrusDumpResult&cirrusExplain=verbose. cirrusExplain can be set to `raw`, `pretty`, or `verbose`, although the `pretty` version currently excludes far too much information from mediasearch to be useful.

Prototyperspective (talkcontribs)

Thank you for all this helpful info! I will propose things at Help:MediaSearch. I don't think the dumps contain what I'm looking for which is the search engine algorithm in regards to ranking and which result it shows. For example, I think it should make use of data on whether and where the file is used (e.g. the Animal article on ENWP) to name one example from the linked proposal. The link for detailed scoring information seems quite useful. I would have suggested to make it better known by putting this onto e.g. the CirrusSearch help page but apparently it's already there albeit only in a section called API without subsection so it's hard to find when looking for info on the search algorithm. Seems like a good implementation for explainable AI albeit still quite hard to make sense of. I think more detailed info and examples would be good to add to Help:MediaSearch or to a new page dedicated to the search algorithm in specific.

Reply to "Transparency of the Wikimedia search algorithms"