Jump to content

Topic on Help talk:CirrusSearch

REST API to search?

10
2001:4898:80E8:C:0:0:0:1E4 (talkcontribs)

ElasticSearch supports a rest api endpoint (typically something like /_search?q=search for something. However, I can't seem to find any endpoints that MediaWiki or Wikipedia expose that allow searching via a rest api. Is this capability part of the CirrusSearch extension? I was hoping to find out to try it out and analyze the results to see if they would suit my needs.

Thanks!

2601:600:8300:1B70:50B8:D76E:1B82:152A (talkcontribs)

I should note also that opensearch won't work because it doesn't provide a relevancy score at all. And even though opensearch supports a relevancy score, there is no extension for it and it isn't built into the core opensearch implementation that is part of MW core.

197.218.88.234 (talkcontribs)
2601:600:8300:1B70:50B8:D76E:1B82:152A (talkcontribs)

Okay, that helps, I'm able to execute the query now. However, it looks like the score data was removed. Is there any way to return a weight, relevancy or score weighting anymore? Is it possible to search elastic search directly for an MW site?

197.218.88.234 (talkcontribs)

Err, the page has more about internal information , perhaps reading it more thoroughly would help.

EBernhardson (WMF) (talkcontribs)

There is an internal undocumented query string argument, cirrusDumpResult, which you can append. This is an undocumented debug api. You are free to use it, but i can't promise it will always work and has no compatibility guarantees with respect to format and such:

https://www.mediawiki.org/w/api.php?action=query&list=search&srsearch=test&cirrusDumpResult

There are also plans working their way through to offer elasticsearch within the wikimedia labs environment with a full copy of production indices and full elasticsearch query access. If everthing goes to plan and budgets are approved this might go live sometime in the first half of 2018.

2601:600:8300:1B70:6921:732F:B07:5B2A (talkcontribs)

This is exactly what I was looking for. The other cirrus dump functions returned too much data and didn't return the score.

Will this be added to the Cirrus Extension or made available in some way as part MW? Even bringing back the score value which was deprecated for some reason would do the trick. Ultimately I need this for an on-premises installation in order to aggregate with an internal search engine that pulls data from multiple sources. The relevancy score is required in order to consistently and accurately merge the results.

Thank you!

EBernhardson (WMF) (talkcontribs)

We could look into bringing the score field back. I wasn't around when it was deprecated so can't say why exactly it was removed. According to the git history most search backends used by mediawiki don't have a score that can be exposed, and those that do use very arbitrary scores. For example whenever we change how queries are built the scores change, but those changes have no bearing other than their use in relation to other results for the same query.

For example the best score for https://www.mediawiki.org/wiki/?search=developer+summit&fulltext=1&cirrusDumpResult is 638, but the best score for https://www.mediawiki.org/wiki/?search=developer+summit+mediawiki&fulltext=1&cirrusDumpResult is 358. That certainly doesn't mean the top result for the first query is twice as good as the top result for the second query, it's just arbitrary. As another example, we changed how we build search queries abotu 6 months ago and the top score for the first query dropped from 1057 to the current 638. The resulting top result (in this case) is exactly the same, but the score dropped. This isn't an indication the result is worse, it's just an arbitrary number.

2001:4898:80E8:4:0:0:0:4DA (talkcontribs)

Thanks. Yeah, an arbitrary score isn't great as I could add that after the fact theoretically :). The score from the unofficial cirrusDumpResult call though, is that provided by ElasticSearch? If it is, I would expect that it isn't arbitrary then. So access to ElasticSearch's scoring results would be the most ideal. Any chance that unofficial call could get added to the public API and supported?

Thanks.

This post was hidden by DatGuy (history)
Reply to "REST API to search?"