Jump to content

Topic on Extension talk:CirrusSearch/Flow

Search inside uploaded documents

13
Xaris~mediawikiwiki (talkcontribs)

Question: Can this extention search inside documents which have been uploaded to the wiki like PDF's?

Ricordisamoa (talkcontribs)
NEverett (WMF) (talkcontribs)

I've just backported most of the features in Cirrus' master branch to the REL1_22 branch, including this. If you want to try it make sure to get the new version of the Elastica plugin on its REL1_22 branch as well and rebuild your index.

2.82.64.19 (talkcontribs)

Do we need to force index the pdf files? I'm seeing no results from pdfs.

Nemo bis (talkcontribs)

First of all try a null edit and wait some time (at most few hours) for the job queue; report back if that wasn't enough.

Chris d edge (talkcontribs)

I've been working on a method that parses document files (PDFs, Word, PPT, etc.) using Tika to extract the document text, and then re-insert the extracted text into the file_text field of the WIKI_general_first index inside Elasticsearch. On this point, I have a couple of questions: 1) Does this sound like the proper method to provide searchable text from documents in CirrusSearch? 2) Has anyone else done anything similar?

On point 2, the reason I ask is that for some documents I'm extracting text from, the resulting text can be huge (100s of MBs) and can grind the search to a hault for some queries (mostly for terms which there aren't many of inside the index).

Any pointers would be greatly appreciated.

SmartK (talkcontribs)
173.164.76.121 (talkcontribs)

Hello Everyone,

I have just added the extension CirrusSearch with all the dependencies. I am not able to search through PDF, txt and, docx. Please help!

Regards,

Dgennaro (talkcontribs)

I am also not able to index documents. I have Image Authorization configured.

Andreas Plank (talkcontribs)

I got PDF search working but not for *.doc files (PDF search works on MW 1.26.2 and MW 1.28.2). You need at least to

  1. Does anybody have a solution for searching inside *.doc files yet?
  2. Did I miss some configuration to set up?
  3. Would it need an FileHandler for doc files to get it working?
SmartK (talkcontribs)
S0ring (talkcontribs)
CtapMaddog (talkcontribs)

A new option: Extension:TikaAllTheFiles. This extension uses Tika to do content extraction (text and/or metadata), and provides the content to CirrusSearch for indexing.

Reply to "Search inside uploaded documents"