Jump to content

Help talk:CirrusSearch/Flow

About this board

This page is an archive. Do not add new topics here.

Please ask new questions at Help talk:CirrusSearch instead.

Automatik (talkcontribs)

Hi. Is there any way to exclude redirects from search results? I want, e.g., to find entries that are not redirections and that contain some character in their title. How to do that?

TJones (WMF) (talkcontribs)

Unfortunately, there's no easy way to exclude redirects from search results.

However, depending on the scope of the task you are trying to complete and your technical ability you could try to use the Search API to semi-automatically do what you need.

This query will give you back the top results with "English Wikipedia" in the title or a redirect:

https://en.wikipedia.org/w/api.php?action=query&list=search&srlimit=50&srsearch=intitle:%22english%20wikipedia%22

The default format is JSON converted to HTML so it's easy to read for a human, but hard to read for a computer. If you only have a small number of queries to deal with, and only need a limited number of results from each (up to 500—set by srlimit), you might be able to get what you need by getting these results and looking through the titles by hand.

If you need a computer to process the results for you, say, because you have many queries, you can get real JSON by adding &format=json:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=search&srlimit=50&srsearch=intitle:%22english%20wikipedia%22

On a Unix-like command line (I'm working in Terminal on OS X) you can use curl to fetch the JSON, python to make it pretty, and grep to pull out the titles, and grep again to find the specific ones you want:

curl -s "https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srlimit=50&srsearch=intitle:%22english%20wikipedia%22" | python -m json.tool | grep "\"title\":" | grep -i "english wikipedia"

Note that the API URL is URL-encoded (spaces become %20, quotes become %22, etc.).

Results:

   "title": "English Wikipedia",
   "title": "Simple English Wikipedia",
   "title": "Notability in the English Wikipedia",

The results aren't pretty, and in this case there are only 8 results total and 3 that are not redirects. If you are searching for specific characters, you may need to do some more pre-processing before the final grep. (If you are searching for "e", everything will match, because "title" has an "e" in it, for example.) If you need to go through more than the top 500 results, you'll have to figure out how to get the API to give you additional results, etc.

It's not pretty and it's not easy, but it's a start.

Automatik (talkcontribs)

Thanks for this answer. It is clearly not easy or convenient, and pretty similar to run the query manually (then, filtering visually with CTRL+F "(redirection" and picking only the results without the "(redirection" text highlighted. Developers should add an option "do not follow redirects", to avoid tedious work for all users using this functionality. I guess it is not so difficult, as this option already exists in some use cases (e.g. when displaying a page with &redirect=no).

TJones (WMF) (talkcontribs)

It is very similar to the ctrl-F solution, just more automatic! For me, somewhere around 25 to 50 queries it would be faster (or at least less boring and thus less error-prone) to go for a hacked-together semi-automatic solution.

Adding a title-only index is probably not a trivial change to make from our current state. We have a search index for intitle:, with the text from titles and redirects in it. There's no differentiation between the title and redirect text once it's in the index. I think we'd have to create another field that was title-only (and maybe a redirect-only field would be equally useful—which together would be bigger than the size of the current title index).

It's not clear to me how many people would need such an index. I'm really curious what your use case is—both to get a sense of how useful title-only search would be, and to see if there's a better clever way to get what you need.

You could open a Phabricator ticket and ask for this feature, but that certainly doesn't guarantee that it would be implemented any time soon.

Automatik (talkcontribs)

On the French Wiktionary, we use the typographic apostrophe in titles, instead of the typewriter/vertical apostrophe. I was looking for titles that use the vertical apostrophe, without being a redirection.

Moreover, I am using Windows, which is less convenient than Unix-like command line regarding command-line tools (documentation unclear/not a unified way to run commands in Windows, etc.)

TJones (WMF) (talkcontribs)

Ah.. that's a sensible use case. No other obvious solution comes to mind, but I'll think about it more and if I think of anything useful I'll let you know.

If you are already familiar with Unix-like commands (or want to learn), but just don't have them available because you are on Windows, you could look at Cygwin (English WP, French WP, website)—it's not an emulator or virtual machine, it just gives you versions of standard Unix commands that work on Windows. I used it about 15 years ago when I had a Windows machine for my job. I found it very useful back then, but haven't used it since.

Automatik (talkcontribs)

Thanks for the advice, however the bash terminal from Cygwin does not work (and the solution suggested in https://superuser.com/questions/1172759/cygwin-error-failed-to-run-bin-bash-no-such-file-or-directory does not work out either). Moreover, now that I have installed the program, I cannot uninstall it anymore (at least, not easily), as it does not appear in "Programs and features", and when I click "Uninstall" from a right click on the program icon, it opens the "Programs and features" windows, anyway.

TJones (WMF) (talkcontribs)

Oh no! I should have known better than to suggest software I haven't used in so long—but it was so nice back in the day. I haven't used Windows in almost 15 years either, so I don't really have any helpful advice. Crap, I'm sorry!

Automatik (talkcontribs)

No worries: I "uninstalled" it by removing its folders, and re-installed it using another repository, and now it works! Thanks for the tip then. To look for more than 500 results, I added the &sroffset=500 parameter (then 1000, 1500,... until no results are found)

Speravir (talkcontribs)

Oh, slightly funny: Unaware of this thread I recently opened a ticket on Phabricator: phab:T204089.

197.235.98.211 (talkcontribs)

It seems that it used to be possible to filter redirects at some point, and this was removed https://phabricator.wikimedia.org/T5174, https://phabricator.wikimedia.org/rMW52e699441edf2958701cea692a5dc3243ec3b064.

It seems developers are confused and going back and forth between removing and readding redirects to search. As the old saying goes, "clients don't know what they want". Anyway, a more sensible approach would be a degree of faceting, where it returns all results but aggregates similar properties, e.g. many pages will be in the same category, or many pages will be redirects, disambiguations, poor quality stubs, etc...

It is probably simpler to resolve this using the API, since it already has options for redirect titles. There are also at most about 10000 results, so it would probably be less challenging to filter through those. Anyway, if the search results aren't too many it is easier to include redirect title in API search results and use your favorite replace tool to clean up all those that don't match, e.g. https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=shakespeare&srlimit=500&srprop=redirecttitle . This would be easier if CSV was a valid API output format.

197.235.98.211 (talkcontribs)
Speravir (talkcontribs)

(Nitpicking) @IP, apparently not: User/developer debt closed phab:T90807 as declined, but with the words “If there is more of a use case than what is in this ticket, please reopen and show examples / steps to reproduce.” Well I did not reopen, because this ticket was not found in a search for older tickets, but the same user/dev debt did not close the ticket opened by me. It seems I showed some valid use cases.

197.235.98.211 (talkcontribs)

Well, it seems more sensible to formulate it as "restore ability to remove redirects from search results" . This was explicitly and deliberately removed for specific reasons.

The general problem with wikis is that they attempt to cater to two sometimes conflicting groups. Pure readers, and editors. The average reader wants the best results, and doesn't even know about the existence of redirects. An editor sometimes wants worse results because they want to address a specific problem.

There are several orders of magnitude more readers than editors, and that's likely the reason it was removed . There is no doubt that such filters have its uses, although the question is whether it justifies the older functionality being restored. Also chances are that "debt" probably forgot about the older ticket or they would likely reopen it, and duplicate that task.

Speravir (talkcontribs)

Fair enough.

Fgnievinski (talkcontribs)

A partial workaround is to restrict search to Talk pages, which often are missing for redirects.

Reply to "How to exclude redirects from search results?"

Can't run UpdateSearchIndexConfig.php file

7
Summary by DCausse (WMF)

Solved by downgrading from php 8.4 to php 8.2

75.130.249.175 (talkcontribs)

MediaWiki 1.39:

When I run the command php UpdateSearchIndexConfig.php in the CirrusSearch/maintenance folder, I get the following error:

[930f130bf0cbf86ca7483c41] [no req] Error: Class "MediaWiki\Extension\AbuseFilter\Parser\RuleCheckerFactory" not found Backtrace: from /var/www/html/w/extensions/AbuseFilter/includes/ServiceWiring.php(113) #0 /var/www/html/w/vendor/wikimedia/services/src/ServiceContainer.php(124): require() #1 /var/www/html/w/includes/MediaWikiServices.php(447): Wikimedia\Services\ServiceContainer->loadWiringFiles() #2 /var/www/html/w/includes/MediaWikiServices.php(285): MediaWiki\MediaWikiServices::newInstance() #3 /var/www/html/w/includes/Hooks.php(174): MediaWiki\MediaWikiServices::getInstance() #4 /var/www/html/w/includes/exception/MWExceptionHandler.php(807): Hooks::runner() #5 /var/www/html/w/includes/exception/MWExceptionHandler.php(336): MWExceptionHandler::logError() #6 /var/www/html/w/includes/AutoLoader.php(244): MWExceptionHandler::handleError() #7 /var/www/html/w/includes/AutoLoader.php(244): require(string) #8 /var/www/html/w/extensions/AbuseFilter/includes/ServiceWiring.php(113): AutoLoader::autoload() #9 /var/www/html/w/vendor/wikimedia/services/src/ServiceContainer.php(124): require(string) #10 /var/www/html/w/includes/MediaWikiServices.php(447): Wikimedia\Services\ServiceContainer->loadWiringFiles() #11 /var/www/html/w/includes/MediaWikiServices.php(285): MediaWiki\MediaWikiServices::newInstance() #12 /var/www/html/w/includes/Setup.php(322): MediaWiki\MediaWikiServices::getInstance() #13 /var/www/html/w/maintenance/doMaintenance.php(83): require_once(string) #14 /var/www/html/w/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(117): require_once(string) #15 {main}


This file does exist in the AbuseFilter/includes/Parser folder. Does anyone know what's going on here?

PMiazga (WMF) (talkcontribs)

It's difficult to find out what could cause. Let me ask you couple questions/throw some suggestions before hand:

  • AbuseFilter files are autoloaded automatically thanks to composer `AutoloadNamespaces`. Please check if you have the merge-plugin enabled - Composer#Using composer-merge-plugin
  • Did you update/install anything recently? Is it a new set-up/installing new extensions you're trying to finalise, or is it something that worked before but stopped working after update?
  • I assume you already have both AbuseFilter and CirrusSearch extensions enabled (by calling wfLoadExtension() in LocalSettings file).
  • Can you specify exact versions of MediaWiki, are you on 1.39.10? I tried to run it and it worked to me, therefore it may be related to specific version or a version mismatch.
This post was hidden by DCausse (WMF) (history)
Bawolff (talkcontribs)

To confirm, does mediawiki work normally (like during web views) and do maintenance scripts in mediawiki core work fine (like e.g. view.php)? What version of AbuseFilter do you have?

75.130.249.175 (talkcontribs)

- I'm on MediaWiki 1.39.5.

- Just enabled the merge plugin, all composer json files should be correct. It should be noted that I installed the plugin using the tarball file, as I can't figure out how to install it from git and have it be compatible for MediaWiki 1.39.

- Both AbuseFilter and CirrusSearch extensions are enabled in LocalSettings.php

- I haven't installed any new plugins recently - I just did a full php re-install to see if that was the problem and it wasn't

- My wiki works normally in web view; scripts like view.php don't work because they run into the same error from AbuseFilter

DCausse (WMF) (talkcontribs)

Could you check if you have multiple versions of AbuseFilter installed?

I wonder if with multiple version installed, an old one gets its class loaded (likely <= 1.36) but the new one gets its ServiceWiring file executed.

Perhaps one way to investigate would be to debug the issue by printing the location of the AbuseFilter classes location:

$reflector = new \ReflectionClass( 'MediaWiki\Extension\AbuseFilter\FilterUser' );
print("FilterUser class location: " . $reflector->getFileName() . "\n");

You could perhaps put this at the very beginning of /var/www/html/w/extensions/AbuseFilter/includes/ServiceWiring.php?

Bawolff (talkcontribs)

Just to close the loop, this user reports that the issue went away after they downgraded from php 8.4 -> php 8.2

Reply to "Can't run UpdateSearchIndexConfig.php file"
Jonteemil (talkcontribs)

In the page it says that the search index will be updated, at least once a day. I've been trying to fix broken files over at Commons that have 0 x 0 px. I used the search fileh:0 filew:0 filetype:image -filemime:image/tiff to find them. Now, files I fixed weeks ago are still listed in the results. When will they go away?

DCausse (WMF) (talkcontribs)

Thanks for reporting the problem, there seems to be a problem in the way CirrusSearch is handling these edits, I filed Phab:T342562 to track and fix the issue.

Jonteemil (talkcontribs)

Okay, perfect.

Reply to "Search index update"

How to search the fields of the File information template on Commons?

6
Prototyperspective (talkcontribs)

Please see this thread. How to search for example for a specific string specifically in the source field?

Also how can one search for files from a specific uploader? (I'd like to check which of my video2commons uploads were imported below resolution at source.)

EBernhardson (WMF) (talkcontribs)

Unfortunately, the image description is simply an argument to a template. CirrusSearch doesn't do anything at that level and can't be that specific. Something like insource:kathmandu would require the wikitext source to have the word kathmandu in it, but it's not a great substitute.


Regarding filtering by uploader, I'm not too familiar with how the P170 there is structured, but with structured data available it seems plausible the appropriate information could be indexed. Today though P170 is indexed as a plain statement and does not include any context about it. The best workaround i could provide is that the Information template used on many images renders such that the searching for "Author <name>" , with the quotes, tends to bring up only pictures from them.

Prototyperspective (talkcontribs)
  1. I don't know why but the results for insource:"kathmandu" don't seem to show the intended results
  2. The uploader username is not in the structured data
  3. The link you shared only shows original works by that username
  4. So I will create an issue for enabling showing uploads by a particular user (please let me know if this could/should be changed in a tool other than CirrusSearch)
  5. I think the best workaround currently would be to use insource with the field name first so for example I searched for insource:"|source=[https://soundcloud.com to identify files for c:Category:Audio files from Soundcloud.com. I think easily searching fields of the File pages' Information template could be enabled by
    1. Developing some regex that searches for any content after e.g. |source=
    2. Creating some alias for it so instead of writing some complex regex query every time one can simply enter e.g. info-source:"soundcloud.com"
Keith D (talkcontribs)

{{user|keith_d}}

A problem with searching the information template fields for things like author is that author also appears in the {{tl|Credit line}} template and the 2 could be different.

Prototyperspective (talkcontribs)

I first misunderstood what you were saying but understood it via your comment in your proposal. That's may be an issue for other templates, but I think in that case it doesn't matter because it would also contain the same author name so it would even be best if both fields are searched (actually it would be a problem if it doesn't search both fields).

This post was hidden by Clump (history)
Reply to "How to search the fields of the File information template on Commons?"

Searching talk pages that use Structured Discussions

3
HaeB (talkcontribs)

Is it possible to use CirrusSearch to search (the topic pages of) a particular talk page that uses Structured Discussions (like this one)? I.e. restrict search results to only topics from that talk page.

Pppery (talkcontribs)

No.

Tacsipacsi (talkcontribs)

Unfortunately, it’s not possible to search Structured Discussions pages using CirrusSearch at all, with or without constraining the search to a particular page. This is one of the many reasons for which Structured Discussions is deprecated and to be replaced with DiscussionTools.

Reply to "Searching talk pages that use Structured Discussions"

Abuse filter logs on plwikiquote

4
Ferien (talkcontribs)

I'm not really sure why this is occuring, but I'm pretty sure this isn't supposed to happen in abuse filter logs to this level.

Tacsipacsi (talkcontribs)

I’m pretty sure plwikiquote shouldn’t block the account from being created (filter 3). I don’t think CirrusSearch is really at fault here – it just tries to create its account on first use. (Since it doesn’t succeed, the next time also counts as the first use. And the next one. And so on.)

Ferien (talkcontribs)

Thanks, I didn't know why it was occurring or what abuse filter it was relating to as I can't understand the language.

DCausse (WMF) (talkcontribs)

Thanks, I reported phab:T373778 to have a closer look into it.

Reply to "Abuse filter logs on plwikiquote"
Beland (talkcontribs)
Pppery (talkcontribs)

That's intentionally blank, as a result of an untidy refactoring in 2015 that's not worth fixing now. This page uses Structured Discussions, which doesn't have the concept of archiving, and instead uses an infinite scroll system.

Beland (talkcontribs)

Aha! Hmm, that seems somewhat poor. There's no indication in the UI that scrolling down to the bottom of the page and staying there will show more threads, and there's no apparent facility for searching the entire history of the page? The URL I loaded was:

https://www.mediawiki.org/wiki/Help_talk:CirrusSearch#Relevance_52070

It seems like that should take me to the thread if it's on the page, but I can't tell if it is or isn't, and searching on my username doesn't really work because some threads are collapsed. -- Beland (talk) 03:40, 25 June 2024 (UTC)

Pppery (talkcontribs)

That would be Topic:S8cojikw0xzel2u8 (found via your contributions). The URL seems to have dated from way back when LiquidThreads was involved, and stopped working in 2015 when this page was migrated.

The chance of this getting fixed, realistically, is zero, since both discussion systems are deprecated and going to be removed someday.

Beland (talkcontribs)

Ah, whew, I was a bit worried these were going to be spreading to other wikis. 8)

Reply to "Archive broken?"
2001:14BA:9CD6:4200:D43C:5ABA:9AD8:104 (talkcontribs)
2001:14BA:9CD6:4200:D43C:5ABA:9AD8:104 (talkcontribs)

That ping didn't work so I'll try again: User:JWBTH

JWBTH (talkcontribs)

Done, thanks for pointing out.

This post was hidden by JWBTH (history)
Reply to "Update en-wiki"

Automatically jump to first result

2
Aschroet (talkcontribs)
TheDJ (talkcontribs)

There is no such functionality. What you are seeing is title matching. If your search exactly matches the title of a page, it will take you to that page. For wikidata the title of a page is its Q id. So you can do https://www.wikidata.org/w/index.php?search=Q111351350 and it will take you to that Q id.

Multiple keyword searches

2
Seeker1030 (talkcontribs)

Hi how to search using multiple key words, For eg: Libra ascendant born on 1965 how could we search this parameters

Speravir (talkcontribs)

Simply by typing libra ascendant born 1965 into the search form (I assume "on" is a so called stop word). If there are dedicated categories for a topic you could also use the filter word incategory, e.g. ascendant libra incategory:"1965 births".

Reply to "Multiple keyword searches"