Topic on Extension talk:CirrusSearch

Search in files + extract and display related part of text

2 comments • 16:31, 24 September 2024 1 month ago

2

Allanext2 (talkcontribs)

I would now like to show a portion of the text related to the search, let's say 2,3 pages before and after the matched text, without giving the possibility to open the entire pdf.

How would you approach this with CirrusSearch? are there some parameters that I can tweak? Would you recommend some API calls or hooks directly to CirrusSearch? or would you suggest a different approach?

I've noticed that PdfHandler with the pdfToText and TikaAllTheFiles both get the pdf content indexed.

Thank you!

Reply Edited 13:45, 28 August 2024 2 months ago

DCausse (WMF) (talkcontribs)

CirrusSearch is not aware of the structure of the pdf file, so I'm not sure how I would approach this problem with CirrusSearch...

Note that MW is generally not designed to allow fine-grained access to the content so if the file is uploaded then it'll be viewable and it might be hard to prevent users from viewing it.

Getting a better highlight experience for PDFs might be challenging and cirrus alone might not be helpful, it might just provide some text snippets that you could then attempt to search again in the PDF using a library that can manipulate PDFs and reconstruct a shorter PDF on the fly (e.g. https://pymupdf.readthedocs.io/en/latest/).

Reply 16:31, 24 September 2024 1 month ago

Reply to "Search in files + extract and display related part of text"