Jump to content

Talk:Wikimedia Search Platform

About this board

Transparency of the Wikimedia search algorithms

3
Prototyperspective (talkcontribs)

Is there any information anywhere on the search algorithms of Wikimedia search?

I'm interested in how the search algorithm works and would like to suggest some changes such as those under "Proposed solution" in this proposal of the recent Wikimedia Commons technical needs survey so that it shows more relevant, useful, up-to-date, high-quality media further up. Other reasons include that under "Categories and Pages" on WMC, it seems to show very many galleries (which are usually far less useful) and not the category even when the search term matches the cat name 1:1 so probably I'd also like to propose a change to that.

Then I'd also like to propose that if the user searches for something that is also a category name or if it detects that it exactly matches the same topic (e.g. searching for "animal" when category "Animals" exist), a hint link to the category shows up in MediaSearch. Maybe here would be a good place to propose this if not directly in phabricator or another Wishlist item.

Is this the right place to ask or is it here or somewhere else? I thought the search algorithms / search engine tech were open but maybe that is wrong.

EBernhardson (WMF) (talkcontribs)

Everything is open source, if you are looking for source code related to search on commons i would suggest looking at the CirrusSearch and MediaSearch extensions. Some information about how media search works is found at Help:MediaSearch. You can download the data that is found in the search indexes, for commonswiki-file the most recent dump is 2024-09-09. I don't know for sure if that is a complete dump, our live search index for files on commonswiki is about 1.4TB while that dump is only 100GB so it seems a little suspect. The live index of course has various data structures and duplicates of the data analyzed in different ways, but the ratio feels a little off (There is a separate dumps 2.0 project which should allow us to make these more reliable). Some basic information about how to load that into a local search engine is found in the help text of the scripts that create those dumps. Alternatively you can issue queries directly to a replica of the prod search indices from wikimedia cloud. You can also simply ask mediawiki for the current representation of a page in the search engine with action=cirrusdump or via the mediawiki api with action=query&prop=cirrusdoc.

You can see how a particular search query was transformed into what we send to the search engine by appending &cirrusDumpQuery to most requests that execute a search request. Similarly you can see the backend response with &cirrusDumpResult. The response can be augmented with detailed scoring information by appending &cirrusDumpResult&cirrusExplain=verbose. cirrusExplain can be set to `raw`, `pretty`, or `verbose`, although the `pretty` version currently excludes far too much information from mediasearch to be useful.

Prototyperspective (talkcontribs)

Thank you for all this helpful info! I will propose things at Help:MediaSearch. I don't think the dumps contain what I'm looking for which is the search engine algorithm in regards to ranking and which result it shows. For example, I think it should make use of data on whether and where the file is used (e.g. the Animal article on ENWP) to name one example from the linked proposal. The link for detailed scoring information seems quite useful. I would have suggested to make it better known by putting this onto e.g. the CirrusSearch help page but apparently it's already there albeit only in a section called API without subsection so it's hard to find when looking for info on the search algorithm. Seems like a good implementation for explainable AI albeit still quite hard to make sense of. I think more detailed info and examples would be good to add to Help:MediaSearch or to a new page dedicated to the search algorithm in specific.

Reply to "Transparency of the Wikimedia search algorithms"

Non-free images in search results

5
Sdkb (talkcontribs)

@MPham (WMF), there was recently some discussion on en-WP about whether or not to allow non-free images in search results. Many editors were hesitant because of concern that it might be valid under fair use rules. I recently got input from WMF Legal that it is extremely likely that this would be acceptable fair use, which makes me inclined to reopen the discussion. Before I do so, I wanted to check with you, as the product lead for search, to see whether you have any thoughts about this topic that we should know going into the discussion. Cheers,

Sdkb (talkcontribs)

@MPham (WMF), I plan to launch the discussion later today. If you're intending to get back to me but just haven't had a chance feel free to lmk and I can postpone. Also pinging @CBogen (WMF) as I'm not 100% sure who the best point person is. Cheers,

MPham (WMF) (talkcontribs)

Thanks for reaching out. I was out of office and am only getting to these messages now. I saw that you moved ahead with reopening the discussion, so thanks for moving this along! For future reference, @Sannita (WMF) has been a good point person for funneling relevant information to the right people.

Sdkb (talkcontribs)

Thanks, @MPham (WMF)! As I mentioned in the comment where I pinged you, I'm curious if there is any research or documentation about why the team added images overall to search results in the new search format. I think it'd be helpful context to link to that, as some editors seem to be describing them as merely "decorative", which to my understanding misses the fact that good images can help make search results easier to parse, making them a navigational aid. You're also welcome to comment directly in the discussion there.

CBogen (WMF) (talkcontribs)

Hi @Sdkb, thanks for this question! This is the project page for the Search Improvements project -- the thumbnails were part of the Round 1 mockups at the bottom of the page. This is the phabricator ticket that describes the thumbnail work specifically - it says that "thumbnails can provide a visual anchor on a text heavy page and assist in locating an article". We did user research that shows that images were very important to users when navigating search results. Like you said, it makes search results easier to parse and offers a navigational aid. I hope this helps!

Reply to "Non-free images in search results"

defaults for *insource: searching

1
SashiRolls (talkcontribs)

I've noticed that insource:"lemonde.fr" does not function the same way on en.wp as it does on fr.wp. By default on en.wp, all namespaces are searched (meaning that to find out how many entries link to lemonde.fr requires searching twice, or going to the advanced search before searching), whereas on fr.wp namespace 0 is the default.

Is there a reason for this difference? FWIW, just searching in mainspace seems like a much more logical default behavior.

Reply to "defaults for *insource: searching"
Ainali (talkcontribs)

Could you add your office hours to the Communications section?

DTankersley (WMF) (talkcontribs)

Done! Thanks for the suggestion!

Reply to "Office hours?"
There are no older topics