Question: Can this extention search inside documents which have been uploaded to the wiki like PDF's?
Topic on Extension talk:CirrusSearch/Flow
Yes, this is possible since gerrit:101252 that resolved bugzilla:6421. See also the first item in m:Tech/News/2014/01.
I've just backported most of the features in Cirrus' master branch to the REL1_22 branch, including this. If you want to try it make sure to get the new version of the Elastica plugin on its REL1_22 branch as well and rebuild your index.
Do we need to force index the pdf files? I'm seeing no results from pdfs.
First of all try a null edit and wait some time (at most few hours) for the job queue; report back if that wasn't enough.
I've been working on a method that parses document files (PDFs, Word, PPT, etc.) using Tika to extract the document text, and then re-insert the extracted text into the file_text field of the WIKI_general_first index inside Elasticsearch. On this point, I have a couple of questions: 1) Does this sound like the proper method to provide searchable text from documents in CirrusSearch? 2) Has anyone else done anything similar?
On point 2, the reason I ask is that for some documents I'm extracting text from, the resulting text can be huge (100s of MBs) and can grind the search to a hault for some queries (mostly for terms which there aren't many of inside the index).
Any pointers would be greatly appreciated.
There is a working extension for this that I have been using for years: https://www.mediawiki.org/wiki/Extension_talk:FileIndexer#Resubmit --SmartK (talk) 15:32, 26 March 2015 (UTC)
Hello Everyone,
I have just added the extension CirrusSearch with all the dependencies. I am not able to search through PDF, txt and, docx. Please help!
Regards,
I am also not able to index documents. I have Image Authorization configured.
I got PDF search working but not for *.doc files (PDF search works on MW 1.26.2 and MW 1.28.2). You need at least to
- set up elasticsearch with „Mapper Attachments Plugin“ (for ElasticSearch 1.x (below version REL1_28) + elasticsearch-mapper-attachments version 2.7.1, ElasticSearch 2.x since REL1_28 + plugin manager will autodetect the right version of mapper-attachments)
- activate Extension:PdfHandler
- build the index exactly as the README of Extension:CirrusSearch documents it
- Does anybody have a solution for searching inside *.doc files yet?
- Did I miss some configuration to set up?
- Would it need an FileHandler for doc files to get it working?
Does anyone know how to do this with elasticsearch 5.5 and Mediawiki 1.29.0 (MW1.29 does now require elasticseach 5 and higher). The mapper-attachments have been depreciated. The new thing is: "ingest attachment plugin". See here: https://www.elastic.co/guide/en/elasticsearch/plugins/5.5/ingest-attachment.html
The Extension:ExtendedSearch of BlueSpice 3.0 offers full-text search in articles and files (i.e. Microsoft Office documents & PDF files) via Elasticsearch 6+ and the Ingest Attachment Processor Plugin
A new option: Extension:TikaAllTheFiles. This extension uses Tika to do content extraction (text and/or metadata), and provides the content to CirrusSearch for indexing.