Jump to content

Extension:Lucene-search: Difference between revisions

From mediawiki.org
Content deleted Content added
No edit summary
update version info
Line 5: Line 5:
|author = [[User:Rainman|Robert Stojnić]]
|author = [[User:Rainman|Robert Stojnić]]
|image =
|image =
|version = 2.1.2 (devel)<br/>2.0.2 (stable)
|version = 2.1.3 (devel)<br/>2.0.2 (stable)
|mediawiki = 1.5+
|mediawiki = 1.5+
|download =
|download =
[http://sourceforge.net/project/showfiles.php?group_id=215674 Binary for Java6 (2.1)]
[http://sourceforge.net/project/showfiles.php?group_id=215674 Binary for Java6 (2.1)]
[http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/ SVN (2.1)]<br/>
[http://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-2/ SVN (2.1)]<br/>
[http://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-2/ SVN (2.0)]<br/>
[http://sourceforge.net/project/showfiles.php?group_id=215674 Binary (2.0) ]<br/>
[http://sourceforge.net/project/showfiles.php?group_id=215674 Binary (2.0) ]<br/>
|readme = [http://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-2/README.txt README.txt (stable)]
|readme = [http://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-2/README.txt README.txt (stable)]

Revision as of 17:53, 19 March 2010

MediaWiki extensions manual
lucene-search
Release status: beta
Implementation Search
Description Search engine for MediaWiki
Author(s) Robert Stojnić
Latest version 2.1.3 (devel)
2.0.2 (stable)
MediaWiki 1.5+
License GPL
Download Binary for Java6 (2.1)

SVN (2.1)

Binary (2.0)

README.txt (stable)

Lucene-search is a search engine designed to index and search MediaWiki content on large websites. It is based on Lucene search API. It extends the API to provide ranking based on number of backlinks, distributed searching and indexing, parsing of wikitext, incremental updates etc.. This is the search engine currently being used on Wikimedia wikis.

MediaWiki can use Extension:LuceneSearch (pre 1.13) or Extension:MWSearch (1.13+) to fetch results from this search engine.

Note: This extension is designed for large wikis - smaller sites may want to consider Extension:SphinxSearch or Extension:EzMwLucene.

Versions

2.1 (devel) - used on all WMF wikis
Features: "Did you mean..", highlighting, ranking based on proximity, relatedness and anchor text
2.0.2 (stable)
Features: distributed search, scalability, basic ranking, accentless search

The following documentation is for the latest development version (2.1). The old documentation is at Extension:lucene-search/2.0 docs.

Installation

Experienced-User-Comment: SVN installations appear to provide the least hassle.

Requires: Linux, Java 6+ (OpenJDK or Sun), Apache Ant 1.6, Rsync (for distributed architecture)

Note Windows users LSearch daemon from version 2.0 doesn't support Windows platform (since it uses hard and soft file links). (It should be possible to get this to work in Vista with enough fiddling . . .) You can still use the old daemon written in C#. Here are the installation instructions: m:Installing lucene search.

This page describe the installation of lucene-search extension version 2.1. For installation of version 2.0, see Extension:Lucene-search/2.0 docs

The rest of the documentation will assume linux.

  • Before installation make sure that AdminSettings.php is setup. AdminSettings.sample must be renamed AdminSettings.php, and modified so that it contains:
 $wgDBadminuser = "database_admin_username";
 $wgDBadminpassword  = "database_admin_password";
  • Begin by downloading the binary release and unpack. Or, get the latest version from svn and then run "ant" to build the jar.
  • Generate configuration files by running:
 ./configure <path to mediawiki root directory>
This script will examine your MediaWiki installation, and generate configuration files to match your installation. Before configure, you may customize some options in template/simple/lsearch-global.conf, for example language option. See Extension:Lucene-search/2.0 docs#Global configuration for more details about these options.
  • If everything went without exception, build indexes
 ./build
This will build search, highlight and spellcheck indexes from xml database dump. For small wikis, just put this script into daily cron and installation is done, move onto Running.
For larger wikis, install Extension:OAIRepository MediaWiki extension and after building the initial index use incremental updater:
 ./update
This will fetch latest updates from your wiki, and update various indexes with search, page links and spell check data. Put this into daily cron to keep the indexes up-to-date.

Running

Once the indexes have been built and MWSearch installed, run the daemon:

 ./lsearchd

The deamon will listen on port 8123 for incoming search requests from MediaWiki, and on port 8321 for incoming incremental updates for the index. MWSearch extension will reroute all search requests to this daemon.

Your may simply test the search result by browsing to the HTTP URL like,

http://<hostname>:8123/search/<database_name>/<your_test_query>

For example, http://localhost:8123/search/wikidb/hello.

Further instructions

This extension supports all kinds of exotic options, like distributing the search architecture, index updates with custom rotation exceptions, multiple wikis, etc... However, the documentation for these advanced options is currently scattered around java doc strings. Old documentation and this page talk page might provide further information.

Brief reflection on algorithms used is available at User:Rainman/search internals.

How to...

... put the indexer on a different host

  • Look at lsearch-global.conf in [Index] section. Replace the search host name in this section by the new indexer host name.
  • Copy your lucene-search installation (with config files and indexes) to the new indexer host
  • On the indexer, edit /etc/rsyncd.conf and add these lines:
 [search]
 path = <put your local path to indexes here>
 comment = Lucene Search 2 index data
 read only

The local path to indexes is just the indexes/ subdirectory of your lucene-search installation on the indexer.

  • Run rsyncd via rsync --daemon
  • transfer the appropriate cron jobs to indexer (e.g. build or update)
  • (re)start lsearchd on indexer and searcher

After new index is built on the indexer (e.g. via a daily cronjob), searcher will pick it up, transfer it and use it.

Note however that this will produce two different set of configuration files, and you need to update both of them on any subsequent changes. A better idea is to share the lsearch-global.conf file via NFS, or put it on a URL (to do this, edit the learch-global.conf location in lsearch.conf).

See also