Jump to content

Extension talk:CirrusSearch/Flow

About this board

This page is an archive. Do not add new topics here.

Please ask new questions at Extension talk:CirrusSearch instead.

Search in Japanese incorrectly parsed

2
Screaming Bell (talkcontribs)

Currently, we are having issues on Ylvapedia with Japanese searches due to the way phrases and words are parsed. Users are reporting that their search phrases are being broken up into individual words, rather than searching the full query (such as in the case of 2+ kanji/kana words). What configuration changes should we make to better support Japanese searches? For example:

https://ylvapedia.wiki/index.php?search=きのこ

Compare to:

https://ylvapedia.wiki/index.php?search=%22きのこ%22

The expected result for either search is for a page with the string "きのこ" to be the first result, rather than pages merely containing き,の, and こ.

Our software version page can be found here:

https://ylvapedia.wiki/wiki/Special:Version

TJones (WMF) (talkcontribs)

TL;DR: Looks like you have some older configuration, and possibly some older code. I'd suggest updating both if possible. Your language processing could be improved for your multi-lingual data, but I'm not sure that's all you need, so I've pinged a couple of other people who might be able to help further.

From the menu on your wiki, you are supporting English, Spanish, Portuguese, Chinese, Japanese, and Korean on one wiki. That is a lot going on!

According to your Cirrus Settings you have the CJK analyzer enabled, with no customization,[*] for your "text" field, and for the "plain" field (which is used with quotes) you have a fairly bare-bones configuration. Slightly re-ordered and reformatted:

"plain": {
    "type": "custom",
    "char_filter": [ "word_break_helper" ],
    "tokenizer": "standard",
    "filter": [ "lowercase" ]
},
...
"text": {
    "type": "cjk",
    "char_filter": [ "word_break_helper" ]
},

[*] Note that "word_break_helper" doesn't actually do anything in the "text" field because you are using the monolithic CJK analyzer. (That is an old bug in the configuration that was fixed quite a while ago.) Generally, if your "type" is anything but "custom", extra specifications for "char_filter", "filter", "tokenizer", etc. are ignored. It ought to throw an error, or at least a warning, but it does not. Anyway, "word_break_helper" converts a few characters (_.():) to spaces, which it is doing in the "plain" field.

For reference, the CJK analyzer in Elasticsearch 7.17 (from your versions page) also uses the standard tokenizer (which splits CJK strings into single characters). But it then uses "cjk_bigram", which puts CJK characters back together in overlapping bigrams.

So, on Japanese Wikipedia (which also uses the CJK analyzer, though further customized), when searching for きのこ (without quotes) I'd expect it to actually be looking for きの and のこ (the two overlapping bigrams from the query). The "plain" field will be looking for き, の, and こ as separate tokens (according to the "standard" tokenizer), but should look for them all in a row, because of the quotes.

And that's what I see in the detailed results dump for Japanese Wikipedia—there are lines for "text:きの" and "text:のこ" (the overlapping bigrams). (You also see "text.plain:きのこ", but that's because it's using a different tokenizer for the plain field—more on that in a minute.) On Ylvapedia, I see "all.plain:の", "all.plain:き", and "all.plain:こ" for most results, and "all:のこ" and "all:きの" for a few.

The first unexpected thing is that the Ylvapedia results are using results from the "all" field. I also noticed, comparing Ylvapedia Cirrus Config to Japanese Wikipedia that your profiles are different—I particularly noticed "CirrusSearchFullTextQueryBuilderProfile", though I'm not sure that's the most important thing. This is outside my area of expertise, because I'm usually optimizing for the config we have on Wikipedia and its sister projects. Overall, though, while you seem to be up-to-date on your software (though I'm surprised at the plain monolithic CJK config—that's pretty old), it seems like you have some old configuration files... and maybe old code, too. I also see that you have Semantic MediaWiki enabled, which is another unknown for me. I'm not sure what to suggest, but maybe @DCausse (WMF) or @EBernhardson (WMF) will have some good ideas, or at least better questions.

I also don't know what Elastic plugins you have installed (on the command line, you can do something similar to bin/elasticsearch-plugin list, or curl elastichost:9200/_cat/plugins). Given your multilingual set up, if you can handle the complexity, I'd suggest separate indexes for each language, like we do for Commons, but that is definitely a lot. Otherwise, I'd suggest at least installing the ICU plugin and using the "icu_tokenizer"—it has a decent dictionary for CJK and other East Asian spaceless languages, and gives more targeted results. (It isn't perfect, but any complex approach for CJK parsing is going to have some errors here and there.)

Maybe something like this:

"plain": {
    "type": "custom",
    "char_filter": [ "word_break_helper" ],
    "tokenizer": "icu_tokenizer",
    "filter": [ "lowercase" ],
    },
...
"text": {
    "type": "custom",
    "char_filter": [ "word_break_helper" ],
    "tokenizer": "icu_tokenizer",
    "filter": [ "cjk_width", "lowercase" ]
}

If you want to use a stop word list, you could use the one used by CJK (which is all English words), use the default English list (as used by the English analyzer), or build a custom stop word filter, combining lists for English, Spanish, and Portuguese (possibly dropping any that look like they might cause cross-language interference.. nothing really jumped out at me on a quick glance, though).

I'd personally also prefer to use "icu_normalizer" instead of "lowercase" in both (and then you also shouldn't need "cjk_width", if I recall correctly).

Also, the "icu_normalizer" has a couple of problems with the standalone forms of handakuten and dakuten (they get regularized to the combining forms with an extra space), so you might want to copy the "cjk_charfilter" from either the Commons or Japanese configs if your editors commonly use those forms.. it seems like the combining forms are more commonly used, though.

Reply to "Search in Japanese incorrectly parsed"
TobyFu (talkcontribs)

Is there any chance to have hidden keywords inside articles which are not displayed for the user but are sensed by the search?

Ideally, I would like to have a high weighting or prioritization in the search query.

Thank you very much for any idea.

DCausse (WMF) (talkcontribs)
EBernhardson (WMF) (talkcontribs)

The closest that can be provide in the current setup might be redirects. They are considered alternate titles and generally matches there push the result up higher in the list.

Reply to "Keyword recognition"

index(es) do not exist. Did you forget to run updateSearchIndexConfig

3
Summary by Spiros71

The path for the runJobs.php cronjob was not compatible with the new server

Spiros71 (talkcontribs)

I migrated to a new server (Almalinux 9, MW 1.39, ElasticSearch 7.10.2) and when I run php ForceSearchIndex.php --skipLinks --indexOnSkip I get the above error message.

Then I ran php UpdateSearchIndexConfig.php --startOver and I was able to (apparently) create an index with

php ForceSearchIndex.php --skipLinks --indexOnSkip
php ForceSearchIndex.php --skipParse

But checking no autocomplete or search results appeared. Checking the indexes path, there were very small files.

DCausse (WMF) (talkcontribs)

ForceSearchIndex.php should tell you how many pages are being indexed, did you seen numbers that match your wiki?

It might happen that in case of errors that some index requests are being retried, can you check the status of your Manual:Job_queue and list the number of jobs via Manual:ShowJobs.php?

What could explain the behavior you are seeing is that the pages are failing to be indexed and you don't notice it. Checking MW logs or elasticsearch logs might help.

Spiros71 (talkcontribs)

Thanks, David. The path for the runJobs.php cronjob was not compatible with the new server, I had to change that for the index to work.

Reply to "index(es) do not exist. Did you forget to run updateSearchIndexConfig"

Incompatible with ElasticSearch 7.17

5
Mshastchi (talkcontribs)

I installed ElasticSearch but running the maintenance scripts I get the error that Cirrus Search is only compatible with ElasticSearch 7.10 which is EOL for a long time. When will the extension be updated to support the latest versions of ElasticSearch?

Kghbln (talkcontribs)

This is a guess only: Probably not at all since WMF is migrating to OpenSearch. Here, I'd expect support for a supported version. Let's see what others think.

MetinPueye (talkcontribs)

We also have the same Problem:

We have an Elasticsearch 7.17 installation.

Before the MediaWiki update, we had version 1.39, and it worked without any problems.

Now we are using MediaWiki version 1.42. According to the wiki, only Elasticsearch 7.10.x is supported.

With maintenance-scripts, I also get the error message that the version is not supported.

Is it intentional that versions > 7.10.x do not work?

Is there perhaps a workaround?  

EBernhardson (WMF) (talkcontribs)

Unfortunately there was a licencing change at Elastic, versions of elasticsearch after 7.10.2 have a different license which we chose not to move forward with. It's plausible that the compatability check in includes/Maintenance/ConfigUtils.php could be loosened up and everything would work on 7.17, but we've never tested it.

Longer term the plan is indeed to migrate everything over to OpenSearch. This should happen over the coming months, we already have test instances running CirrusSearch with OpenSearch 1.

46.151.207.30 (talkcontribs)

Since I recently upgraded to 1.42.3 I'm in limbo, the upgrade took me half an hour. I had to give up on upgrading the search engine after struggling through lots of conflicting documentation.

Please publish the opensearch branch of cirrussearch, so I can switch now and help with development.

Reply to "Incompatible with ElasticSearch 7.17"

Greek search no longer truly diacritics insensitive

9
Spiros71 (talkcontribs)

For example, go to https://en.wiktionary.org/wiki/Wiktionary:Main_Page and try inputting ανθρωπος. Two existing entries will not appear: άνθρωπος and ἄνθρωπος. The same can be seen in my recent upgrade to ElasticSearch 7.10.2, core ICU plugin and extra:7.10.2-wmf12. Go to https://lsj.gr/ and try inputting σιφων. Missing entries will appear when using σίφων (σίφων and σίφωνας). Any advice on how to remedy this would be warmly appreciated!

TJones (WMF) (talkcontribs)

After writing up everything below, I realized I'm diagnosing the current behavior because @EBernhardson (WMF) thought this might be related to some recent work I did on diacritic folding, but now I don't think that's it. The info below might still be helpful, though. It's possible that there have been some changes to the weighting of exact prefix matches in suggestions, so I'll also invite @DCausse (WMF) to weigh in. He's more likely to remember any autocomplete changes that weren't so recent.


I believe you are talking about the drop-down list of suggestions (which we call the "autocomplete" suggestions), since ἄνθρωπος and άνθρωπος are the top two results in the full search results list for ανθρωπος.

The autocomplete search isn't truly insensitive to anything—including case, spaces, punctuation, and diacritics—in that exact matches can always be ranked a little better than inexact matches.

For case, consider the autocomplete suggestions for hun, Hun, ȟun, and hün on English Wiktionary:

  • hun: hun, hunger, hunt, hundred, hund, Hund, Hun, hung...
  • Hun: Hun, hunger, hun, hunt, hundred, hund, Hund, hung...
  • ȟun: hunger, hun, hunt, hundred, hund, Hund, Hun, hung...
  • hün: hün, Hündin, Hüne, hünkâr, Hündchen, hündür, hünnap...

I think the ȟun results are the "truest" results for h+u+n because there are no exact matches. In the other cases, exact matches (hun, Hun) become the first result, and exact prefix matches (everything starting with hün..) can also rank higher.

Note that if you add spaces to hün and search for h ü n, you get the same list as for ȟun above, because spaces can be ignored, and there are no exact matches or exact prefix matches with those spaces.

The problem with ανθρωπος is that it has many exact prefix matches (for those following along who don't read Greek, it's "anthropos", which is the beginning of way more than ten other Greek words), so they rank higher than άνθρωπος and ἄνθρωπος and push them out of the top ten suggestions. If you instead search for ἇνθρωπος (analogous to ȟun in the examples above), you get the results that I think you expect, with ἄνθρωπος and άνθρωπος as the first two suggestions because there is no exact match or exact prefix match.

Unless I'm misreading the diacritics (which is 100% possible with Greek diacritics!) it looks like σιφων does the right thing on both English Wiktionary and LSJ, presumably because there aren't as many exact prefix matches competing for space in the suggestions list.

As for remedies, it depends on what you are looking for. If you want exact prefix matches to count for less, or for diacritics to be completely ignored, I'm not sure there's anything to be done. It might help in this case, but it would cause problems in general.

If you want a solution that you, as a savvy searcher, can use in cases like this where you know or suspect that there might be relevant results that differ by diacritics but which are being swamped by exact prefix matches, you can use the space hack we used for h ü n: if you search for α ν θ ρ ω π ο ς (or less ridiculously, just α νθρωπος or ανθρωπο ς), then you get suggestions without any exact prefix re-ranking. Of course there is always the chance that you get some exact prefix matches after adding one space. If there are too many, add another space—not a great solution, but it works.

For less savvy searchers, hitting return will give you the full-text search results, which do not look for arbitrary prefix matches (though stemming matches can still be prefixes), and at least in this case, the desired results are the top two.

(Note: I've been testing in the search bar on the search results page rather than the search box at the top of the page. These are usually the same, but weird differences can occur. The only thing I've noticed today is that the two boxes seem to use different events to trigger autocomplete searches. Editing hun to hün gives different results because typing ü on my American keyboard uses dead keys, which trigger Javascript events in the big search results search box, but not in the search box at the top of the page. Historical UI cruft, that is. Sigh.)

Spiros71 (talkcontribs)

Tray, that is a very thorough and exhaustive reply as usual!

The points I am making are:

1) I can see a clear change on this from the times of the ElasticSearch 5.6 implementation, and

2) usability (for Greek and Ancient Greek)—being able to get what one is looking for with the minimum effort. When it comes to Ancient Greek (polytonic) many "weird" accents/spirits are used which are not readily available in most keyboards, cases, etc. and users prefer to omit them (this is also typical of how Greek users search on Google even for Modern Greek which only has one accent/spirit). So, in the specific example, using ανθρωπος I would expect to get two search results in autocomplete which are "perfect" matches (minus the diacritics of course). But I do not get these results! A savvy user or a scholar "might" use the full diacritics version (speaking of Ancient Greek here), but the average user will be dumbfounded as they get no results at all with the no-diacritics approach. Also, yes, one could hit search and still get them, but the point of autocomplete is faster access to information.

I am not advocating a sweeping approach here for all languages, as I am not an expert, but I can see clearly the benefit for Greek and Ancient Greek.

DCausse (WMF) (talkcontribs)

Regarding ανθρωπος and άνθρωπος and ἄνθρωπος on english wiktionary:

These two results are found at position 11 and 12: https://en.wiktionary.org/w/api.php?action=opensearch&format=json&formatversion=2&search=%CE%B1%CE%BD%CE%B8%CF%81%CF%89%CF%80%CE%BF%CF%82&namespace=0&limit=12

Unfortunately we display only 10.

If you enter Special:Search these two should move back to the top: https://en.wiktionary.org/w/index.php?go=Go&search=%CE%B1%CE%BD%CE%B8%CF%81%CF%89%CF%80%CE%BF%CF%82&title=Special%3ASearch&ns0=1

Unfortunately the completion search does only rank higher the one suggestion that is a perfect exact match. It does not rank higher suggestions that appear to be fully written titles over the ones that appear to be partially written. It is something we know is not quite perfect but for which we don't yet have a solution for.

Another cause is also that completion prefers suggestions that match a prefix with its accents:

  • ανθρωποσφαγή is preferred over άνθρωπος when searching ανθρωπος

note that ς is just considered identical to σ here.

If this issue is quite recent I'm not sure what could have caused it, I don't think anything changed in the software that could have directly caused this behavior. Could it be that more pages being added over time caused these suggestions to slip out of the 10 displayed results?

See phab:T132637 for when we first implemented diacritics folding for greek, the example query αθανατος used at the time to report the bug is still working as expected.

Spiros71 (talkcontribs)

Yes, David, you pointed very aptly to some of the culprits here

Another cause is also that completion prefers suggestions that match a prefix with its accents:

  • ανθρωποσφαγή is preferred over άνθρωπος when searching ανθρωπος

note that ς is just considered identical to σ here

My point is that ς considered identical with σ is something that could resolve such cases. The former is of course only used at the end of a word. And quoting from that phab issue, I concur with Tray:

French speakers usually have no trouble typing French diacritics, but they may have no idea how to type Ancient Greek polytonic diacritics—which speakers of Modern Greek may also have trouble with, just as speakers of Modern English usually don't know how to type ð, þ, æ, or ē, despite them all being used in the first few lines of Beowulf! Hwæt! (You call me a language nerd, now I gotta act like one.)

TJones (WMF) (talkcontribs)

I can see a clear change on this from the times of the ElasticSearch 5.6 implementation

Wow.. that was 5 years ago for us, so I can't recall every change that might have been relevant in that time. Not sure when it would have changed.

I understand your usability argument, but it is often the case in search engineering that optimizing for one use case breaks others. We are already ignoring the Greek diacritics for the recall phase, but the exact matches come into play for the ranking phase. It's an issue for ανθρωπος (ignoring final sigma, see below) because there are so many words without diacritics that match better.

There's been a similar complaint about overly exact case matching (T364888), but I don't think we can only ignore Greek diacritics or only ignore case for ranking, which—on English Wikitionary for example—would mean that typing an would give "exact" matches with an, àn, ån, án, än, ân, An, Ân, ãn, ān, ăn, ản, ǎn, Ấn, ấn, ẩn, and ắn. (Those are the top full-text results, though.. and I missed aN!) You could argue these results are less usable in autocomplete, since most people most of the time will not be looking for them (on English Wiktionary).

We also were tossing around ideas for improving full-title matches, which could have similar side-effects for short queries. (This also applies, albeit less voluminously, to queries longer than 2 letters, but I stopped looking for details because it's a lot of manual searching for examples since autocomplete doesn't work that way at the moment.)

There's always a trade-off, and having to fall over to full-text search is not the worst trade-off.

I've opened a ticket for the final/non-final sigma issue (T377495), though I'm not sure it will help you. It definitely makes sense on Greek-language wikis, but not as much for non-Greek wikis, like English Wiktionary. (LSJ looks to be using English as its analysis language, too.)

You should be able to set CirrusSearchICUNormalizationUnicodeSetFilter and CirrusSearchICUFoldingUnicodeSetFilter to "[^ς]" in mediawiki/extensions/CirrusSearch/extension.json in your LSJ installation to exempt ς from folding, but that would disable the ς to σ mapping everywhere (autocomplete, full-text, template lookups, etc.), and it still won't work if your language is set to Greek because the Greek-specific lowercase filter also maps ς to σ.. everyone really wants that mapping to happen! But there's an immediate config option that I might help.

Spiros71 (talkcontribs)

Interestingly, Tray, ανθρωπος appears not to be an issue in my case https://ibb.co/StV1J6M There is one other funny thing happening though, not sure if this is up your alley (David included), but I do get search results for a non-existent page https://ibb.co/HgrzHpc σῑ́φων.

TJones (WMF) (talkcontribs)

With respect to the autocomplete of ανθρωπος on LSJ, we know the recall portion of autocomplete gets everything we'd want, but the ranking is where things go awry. Would it make sense on LSJ for those ἄνθρωπος and άνθρωπος to be much more popular? I don't recall all the factors that go into ranking the autocomplete results, but different stats on your site could lead to different rankings that could overpower the exact prefix match advantage.

As for the non-existent page for σῑ́φων, I can't reproduce it, which I think is because I'm not logged in so I don't get offers to create pages. My guess is that there is some either some normalization that isn't happening (ι + ̄ + ́ vs ῑ + ́) or there's an invisisble character (soft hyphens will cause this and are common enough in non-Greek contexts), etc. An example of a lack of normalization you can easily see is that searching for GrEeK LaNgUaGe on enwiki will offer to let you create that exact page.

(BTW, it's "Trey" with an "e".. cognate with τρεις, no less.)

Spiros71 (talkcontribs)

Τριάκις, thanks, Trey! Yes, I think the stats would make the difference as άνθρωπος is a very common word.

Reply to "Greek search no longer truly diacritics insensitive"

Cloned dB and search stopped working

1
Summary by Spiros71

I had to reindex.

Spiros71 (talkcontribs)

I cloned my dB, then switched on the LocalSettings.php to the new dB, and this resulted in the search stopping to function ("An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later"). The ElasticSearch service is running OK (I also restarted it) and I can see that the indexes exist. Is there anywhere else that I should change the dB name?

MW 1.31, ElasticSearch 5.6.13

In error log I see:

2024-09-27 10:24:00 host.xxx.gr xxx_1_31_0_bkp: Search backend error during full_text search for 'σαῦρα' after 2: illegal_argument_exception: no mapping found for field [suggest]

213.61.173.172 (talkcontribs)

I have one large page with a larger wikitable (12 column, 3300 rows).

This page is the only large page and the only one not indexed, because of a "TypeError" in the HtmlFormatter:

api.php?action=query&format=json&prop=cirrusbuilddoc&pageids=1231&formatversion=2

{

    "error": {

        "code": "internal_api_error_TypeError",

        "info": "[8c19427845d7c9abfb9f5240] Exception caught: HtmlFormatter\\HtmlFormatter::onHtmlReady(): Argument #1 ($html) must be of type string, null given, called in /var/www/w/vendor/wikimedia/html-formatter/src/HtmlFormatter.php on line 314",

        "errorclass": "TypeError",

        "trace": "TypeError at /var/www/w/vendor/wikimedia/html-formatter/src/HtmlFormatter.php(90)\nfrom /var/www/w/vendor/wikimedia/html-formatter/src/HtmlFormatter.php(90)\n#0 /var/www/w/vendor/wikimedia/html-formatter/src/HtmlFormatter.php(314): HtmlFormatter\\HtmlFormatter->onHtmlReady()\n#1 /var/www/w/includes/content/WikiTextStructure.php(179): HtmlFormatter\\HtmlFormatter->getText()\n#2 /var/www/w/includes/content/WikiTextStructure.php(221): WikiTextStructure->extractWikitextParts()\n#3 /var/www/w/includes/content/WikitextContentHandler.php(167): WikiTextStructure->getOpeningText()\n#4 /var/www/w/extensions/CirrusSearch/includes/BuildDocument/ParserOutputPageProperties.php(95): WikitextContentHandler->getDataForSearchIndex()\n#5 /var/www/w/extensions/CirrusSearch/includes/BuildDocument/ParserOutputPageProperties.php(70): CirrusSearch\\BuildDocument\\ParserOutputPageProperties->finalizeReal()\n#6 /var/www/w/extensions/CirrusSearch/includes/BuildDocument/BuildDocument.php(172): CirrusSearch\\BuildDocument\\ParserOutputPageProperties->finalize()\n#7 /var/www/w/extensions/CirrusSearch/includes/Api/QueryBuildDocument.php(58): CirrusSearch\\BuildDocument\\BuildDocument->finalize()\n#8 /var/www/w/includes/api/ApiQuery.php(671): CirrusSearch\\Api\\QueryBuildDocument->execute()\n#9 /var/www/w/includes/api/ApiMain.php(1904): ApiQuery->execute()\n#10 /var/www/w/includes/api/ApiMain.php(879): ApiMain->executeAction()\n#11 /var/www/w/includes/api/ApiMain.php(850): ApiMain->executeActionWithErrorHandling()\n#12 /var/www/w/api.php(90): ApiMain->execute()\n#13 /var/www/w/api.php(45): wfApiMain()\n#14 {main}"

    }

}


I tried and found, that when the Page length (in bytes) is approx < 793.845 Byte, it is working without error. When going > 793.978 Byte I get the TypeError.

I think the page length is only for content, therefore the limit seems to be 1MB for the whole html page.

$wgMaxArticleSize or $wgAPIMaxResultSize is not solving the issue.

I looked into the settings of php, jvm, mediawiki and nginx but did not found a solution.


Is there any settings to extend the limit?

DCausse (WMF) (talkcontribs)
213.61.173.172 (talkcontribs)

thank you very much. This was exact the issue.

I updated the HtmlFormatter.php in my v1.39.7 installation manually with the new version: https://gerrit.wikimedia.org/r/c/HtmlFormatter/+/997959/2/src/HtmlFormatter.php#b306

updated the index with:

php /var/www/w/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now

php /var/www/w/extensions/CirrusSearch/maintenance/ForceSearchIndex.php

Now everything is fine.

Search in files + extract and display related part of text

2
Allanext2 (talkcontribs)

I would now like to show a portion of the text related to the search, let's say 2,3 pages before and after the matched text, without giving the possibility to open the entire pdf.

How would you approach this with CirrusSearch? are there some parameters that I can tweak? Would you recommend some API calls or hooks directly to CirrusSearch? or would you suggest a different approach?

I've noticed that PdfHandler with the pdfToText and TikaAllTheFiles both get the pdf content indexed.

Thank you!

DCausse (WMF) (talkcontribs)

CirrusSearch is not aware of the structure of the pdf file, so I'm not sure how I would approach this problem with CirrusSearch...

Note that MW is generally not designed to allow fine-grained access to the content so if the file is uploaded then it'll be viewable and it might be hard to prevent users from viewing it.

Getting a better highlight experience for PDFs might be challenging and cirrus alone might not be helpful, it might just provide some text snippets that you could then attempt to search again in the PDF using a library that can manipulate PDFs and reconstruct a shorter PDF on the fly (e.g. https://pymupdf.readthedocs.io/en/latest/).

Reply to "Search in files + extract and display related part of text"

Search results and possible leaking of restricted content

2
Masin Al-Dujaili (WMDE) (talkcontribs)

We have an internal wiki with several namespaces which in turn have different access permissions set. Does CirrusSearch actively prevent leakage of content of pages a user has no access to?

DCausse (WMF) (talkcontribs)
Reply to "Search results and possible leaking of restricted content"

Creating index > ResponseException

6
Q2e.jua (talkcontribs)

I am trying to setup CirrusSearch extension, but stuck with the following exception while running UpdateSearchIndexConfig.php:

Elastica\Exception\ResponseException from line 178 of /wiki/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php

Setup:

  • Mediawiki 1.39
  • PHP 8.1
  • ElasticSearch 7.10.1 (via Docker)

Output of extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php:

Updating cluster ...

indexing namespaces...

mw_cirrus_metastore missing, creating new metastore index.

Creating metastore index...

mw_cirrus_metastore_first

Scanning available plugins...none

Elastica\Exception\ResponseException from line 178... (full stacktrace truncated)

Ciencia Al Poder (talkcontribs)

Looks like you're using ElasticSearch 7.10.1 but the requirement is 7.10.2. I'm not sure if a minor version difference is important, though

Q2e.jua (talkcontribs)
EBernhardson (WMF) (talkcontribs)

We would need to know what is in the ResponseException to offer a path forward. The name ResponseException itself isn't too meaningul, it mostly means elasticsearch said no. Perhaps something in either mediawiki logging or elasticsearch logging says what exactly the problem was?

Q2e.jua (talkcontribs)

Yeah, I'm still searching for some kind of log, but could not find anything.

  • The elasticsearch docker log does not contain related information.
  • Mediawiki log: I think I need to configure the log, because there is nothing in the default installation.

My next steps:

  • Checking for more log or activate enhanced logging in mediawiki
  • Try to manually run ElasticSearch 7.10.2 docker image, so we can exclude minor version differences
Q2e.jua (talkcontribs)

Update:

  • I had to manually load the 7.10.2 ElasticSearch docker image, because it was not available in the Plesk Repository.
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.10.2
  • After all, the ElasticSearch version was not the issue. The real issue was the configuration of the ElasticSearch container. Here it is important to configure the discovery type:
discovery.type | single-node

I will report this as an improvement issue to the maintainers:

  • Add an meaningful output message / At least just catch php exceptions and print the error message and the response message in the case of curl requests.
  • may add a simple setup for elastic search in the readme