Extension:TikaAllTheFiles
TikaAllTheFiles Release status: beta |
|
---|---|
Implementation | Media, Search |
Description | Using Apache Tika, provides text and metadata extraction for thousands of file types, enabling full-text search of almost any uploaded file |
Author(s) | Matt Marjanovic (CtapMaddogtalk) |
Maintainer(s) | Center for Transparent Analysis and Policy |
Latest version | 2.0.0 (2024-04-20) |
Compatibility policy | Master maintains backward compatibility. |
MediaWiki | 1.37+ |
PHP | 8.1+ |
Database changes | No |
Composer | centertap/tika-all-the-files |
License | GNU General Public License 3.0 or later |
Download | GitHub: Note: README.md RELEASE-NOTES.md |
Translate the TikaAllTheFiles extension if it is available at translatewiki.net | |
The TikaAllTheFiles (TATF) extension facilitates full-text search over uploaded files, by using the Apache Tika content analysis toolkit, which "detects and extracts metadata and text from over a thousand different file types".
In practical terms: if you already have Extension:CirrusSearch set up and working on your wiki, TATF will allow you to perform full-text searches over the contents of almost any uploaded file --- not just the PDFs.
TATF's features and capabilities:
- extract embedded digital text from any type of uploaded file so that it can be indexed for full-text search;
- extract and index printed text from bitmap image files and from images embedded in document files, e.g., image-only PDF's (requires Tesseract OCR;
- extract metadata from any type of uploaded file for display on
File:
pages; - index metadata properties along with text, to enable simple searching for properties within full-text search.
Installation
[edit]This extension can be installed using composer
.
The complete installation and configuration instructions can be found in README.md.
Configuration parameters
[edit]The complete description of configuration parameters can be found in README.md.