Topic on Extension talk:CirrusSearch/Flow

Search inside uploaded documents

13 comments • 17:04, 14 December 2021 2 years ago

13

Xaris~mediawikiwiki (talkcontribs)

Question: Can this extention search inside documents which have been uploaded to the wiki like PDF's?

Reply 09:59, 11 January 2014 10 years ago

Ricordisamoa (talkcontribs)

Yes, this is possible since gerrit:101252 that resolved bugzilla:6421. See also the first item in m:Tech/News/2014/01.

Reply 20:22, 14 January 2014 10 years ago

NEverett (WMF) (talkcontribs)

I've just backported most of the features in Cirrus' master branch to the REL1_22 branch, including this. If you want to try it make sure to get the new version of the Elastica plugin on its REL1_22 branch as well and rebuild your index.

Reply 20:14, 26 February 2014 10 years ago

2.82.64.19 (talkcontribs)

Do we need to force index the pdf files? I'm seeing no results from pdfs.

Reply 12:56, 6 June 2014 10 years ago

Nemo bis (talkcontribs)

First of all try a null edit and wait some time (at most few hours) for the job queue; report back if that wasn't enough.

Reply 10:14, 9 June 2014 10 years ago

Chris d edge (talkcontribs)

I've been working on a method that parses document files (PDFs, Word, PPT, etc.) using Tika to extract the document text, and then re-insert the extracted text into the file_text field of the WIKI_general_first index inside Elasticsearch. On this point, I have a couple of questions: 1) Does this sound like the proper method to provide searchable text from documents in CirrusSearch? 2) Has anyone else done anything similar?

On point 2, the reason I ask is that for some documents I'm extracting text from, the resulting text can be huge (100s of MBs) and can grind the search to a hault for some queries (mostly for terms which there aren't many of inside the index).

Any pointers would be greatly appreciated.

Reply Edited 14:00, 26 March 2015 9 years ago

SmartK (talkcontribs)

There is a working extension for this that I have been using for years: https://www.mediawiki.org/wiki/Extension_talk:FileIndexer#Resubmit --SmartK (talk) 15:32, 26 March 2015 (UTC)

Reply 15:32, 26 March 2015 9 years ago

173.164.76.121 (talkcontribs)

Hello Everyone,

I have just added the extension CirrusSearch with all the dependencies. I am not able to search through PDF, txt and, docx. Please help!

Regards,

Reply 22:51, 28 December 2016 7 years ago

Dgennaro (talkcontribs)

I am also not able to index documents. I have Image Authorization configured.

Reply 19:47, 25 January 2017 7 years ago

Andreas Plank (talkcontribs)

I got PDF search working but not for *.doc files (PDF search works on MW 1.26.2 and MW 1.28.2). You need at least to

set up elasticsearch with „Mapper Attachments Plugin“ (for ElasticSearch 1.x (below version REL1_28) + elasticsearch-mapper-attachments version 2.7.1, ElasticSearch 2.x since REL1_28 + plugin manager will autodetect the right version of mapper-attachments)
activate Extension:PdfHandler
build the index exactly as the README of Extension:CirrusSearch documents it

Does anybody have a solution for searching inside *.doc files yet?
Did I miss some configuration to set up?
Would it need an FileHandler for doc files to get it working?

Reply 12:35, 21 July 2017 7 years ago

SmartK (talkcontribs)

Does anyone know how to do this with elasticsearch 5.5 and Mediawiki 1.29.0 (MW1.29 does now require elasticseach 5 and higher). The mapper-attachments have been depreciated. The new thing is: "ingest attachment plugin". See here: https://www.elastic.co/guide/en/elasticsearch/plugins/5.5/ingest-attachment.html

Reply 14:04, 16 August 2017 7 years ago

S0ring (talkcontribs)

The Extension:ExtendedSearch of BlueSpice 3.0 offers full-text search in articles and files (i.e. Microsoft Office documents & PDF files) via Elasticsearch 6+ and the Ingest Attachment Processor Plugin

Reply 17:25, 10 July 2019 5 years ago

CtapMaddog (talkcontribs)

A new option: Extension:TikaAllTheFiles. This extension uses Tika to do content extraction (text and/or metadata), and provides the content to CirrusSearch for indexing.

Reply 17:04, 14 December 2021 2 years ago

Reply to "Search inside uploaded documents"