User:OrenBochman/Search/Spec
Appearance
Intorduction
[edit]Before proposing the NG Search I decided to review the current seach engine Lucene-search devloped by Rainman.
Lucene Search 'Spec'
[edit]This section will attempt to outline the existing search engine as an informal spec, with criticism in the body as comment, or questions. Please provide additional information/corrections as you are able.
- MWSearch is the gateway between mediaWiki and Lucene-search
- Listens on port 8123 for search requestd.
- Listens on port 8321 for index updates.
User Guide
[edit]- Search Engine Features Search Engine Features
Lucene Search Bugs
[edit]Features
[edit]- Distributed index - due to size the index is distributed on multiple machines.[1]
- Offline Indexing - starts by indexing a XML_dump [2] and produces:
- a Search index
- Q. with what fields, boosting?
- a Highlight index
- Q. is this necessary with document term vectors now available?
- Q. with what fields, boosting?
- Spellcheck indexes - support for did you mean
- Fields
- 2-Grams of wikipedia fulltext of minimum and maximum
- All titles
- Boosting for
- Titles
- Section Headers
- First Paragraph
- Redirects
- In which source file is the queay cooking formula at?
MediaWiki Cluster Configuration
[edit]base on:
Summary
[edit]- Search Machines
- Did You Mean
- Highlighting Configuration
[Database] Section Configuration Format
[edit]# for a single line comment.
wikidb : (single) (language,en)
- declares thatwikidb
is a single (nondistributed) index and that it should use English as its default language, and thus use English stemming.{file:///home/wikipedia/common/pmtpa.dblist}
imports a database list using a file- The optional
(warmup,100)
instructs Lucene to apply 100 queries to the index when an updated version is fetched to warm it up. This enables smooth transition in performance and ensures indexes are always well cached and buffered. - Make sure there are no spaces in the arguments (e.g.
(warmup,10)
). This condition can lead to failure to create search, snapshot or index folders when building Index. wikilucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[])
- declares thatwikilucene
should be split (distributed) into 3 indexes according to namespaces, where shard1 called nspart1has namespace 0, shrad2 called nspart2 4,5,12,13 and sharad3 called nspart3 has the other namespaces.
[Database] Raw Data
[edit][Database] #wikilucene : (single) (language,en) (warmup,0) #wikidev : (single) (language,sr) #wikilucene : (nssplit,3) (nspart1,[0]) (nspart2,[4,5,12,13]), (nspart3,[]) #wikilucene : (language,en) (warmup,10) #format: #database_name (, database_name)+ :([single|mainsplit|nssplit],[SHRAD-COUNT],[TRUE|FALSE],[IDX_BUFFER_DOCS],[IDX_MERGE_FACTOR]) (language,en) (warmup,10) {file:///home/wikipedia/common/pmtpa.dblist} : (single,true,20,1000) (prefix) (spell,10,3) enwiki : (nssplit,2) enwiki : (nspart1,[0],true,20,500,2) enwiki : (nspart2,[],true,20,500) enwiki : (spell,40,10) (warmup,500) mediawikiwiki, metawiki, commonswiki, strategywiki : (language,en) commonswiki : (nssplit,2) (nspart1,[6]) (nspart2,[]) dewiki, frwiki : (spell,20,5) dewiki, frwiki, itwiki, ptwiki, jawiki, plwiki, nlwiki, ruwiki, svwiki, zhwiki : (nssplit,2) (nspart1,[0,2,4,12,14]) (nspart2,[])
[Database-Group] Configuration Format
[edit]- TODO: research and document
[Database-Group] Raw Data
[edit]<all> : (titles_by_suffix,2) (tspart1,[ wiki|w ]) (tspart2,[ wiktionary|wikt, wikibooks|b, wikinews|n, wikiquote|q, wikisource|s, wikiversity|v]) sv-titles: (titles_by_suffix,2) (tspart1,[ svwiki|w ]) (tspart2,[ svwiktionary|wikt, svwikibooks|b, svwikinews|n, svwikiquote|q, svwikisource|src]) mw-titles: (titles_by_suffix,1) (tspart1, [ mediawikiwiki|mw, metawiki|meta ])
[Search-Group] Configuration Format
[edit]- TODO: research and document
[Search-Group] Raw Data
[edit]# Search hosts layout [Search-Group] # search 1 (enwiki) search1: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search2: enwiki.nspart1.sub1.hl enwiki.spell #enwiki.nspart1.sub2.hl search3: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search4: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search5: enwiki.nspart1.sub2.hl enwiki.spell #enwiki.nspart1.sub1.hl search8: enwiki.prefix #enwiki.spell search9: enwiki.nspart1.sub1 enwiki.nspart1.sub2 search12: enwiki.spell search13: enwiki.nspart2* search13x: en-titles* search14: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl search19: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl search20: enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl # search 2 (de,fr,jawiki) search6: dewiki.nspart1 dewiki.nspart2 frwiki.nspart1 frwiki.nspart2 jawiki.nspart1 jawiki.nspart2 search6: itwiki.nspart1.hl search15: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl search16: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl search17: dewiki.nspart1.hl dewiki.nspart2.hl frwiki.nspart1.hl frwiki.nspart2.hl # search 3 (it,nl,ru,sv,pl,pt,es,zhwiki) search7: itwiki.nspart1 itwiki.nspart2 nlwiki.nspart1 nlwiki.nspart2 ruwiki.nspart1 ruwiki.nspart2 svwiki.nspart1 search7: svwiki.nspart2 plwiki.nspart1 plwiki.nspart2 eswiki ptwiki.nspart1 ptwiki.nspart2 zhwiki.nspart1 zhwiki.nspart2 search15: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl search15: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search15: ptwiki.nspart1.hl ptwiki.nspart2.hl search16: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl search16: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search16: ptwiki.nspart1.hl ptwiki.nspart2.hl search17: itwiki.nspart1.hl itwiki.nspart2.hl nlwiki.nspart1.hl nlwiki.nspart2.hl ruwiki.nspart1.hl ruwiki.nspart2.hl search17: svwiki.nspart1.hl svwiki.nspart2.hl plwiki.nspart1.hl plwiki.nspart2.hl eswiki.hl search17: ptwiki.nspart1.hl ptwiki.nspart2.hl # search 2-3 interwiki/spellchecks search10x: de-titles* ja-titles* it-titles* nl-titles* ru-titles* fr-titles* search10x: sv-titles* pl-titles* pt-titles* es-titles* zh-titles* search10: dewiki.spell frwiki.spell itwiki.spell nlwiki.spell ruwiki.spell search10: svwiki.spell plwiki.spell ptwiki.spell eswiki.spell # search 4 search11x: commonswiki.spell commonswiki.nspart1.hl commonswiki.nspart1 commonswiki.nspart2.hl commonswiki.nspart2 search11: commonswiki.nspart1 commonswiki.nspart1.hl commonswiki.nspart2.hl search11: commonswiki.nspart2 search11: *? search11x: *tspart1 *tspart2 search19: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.))*.spell search12: (?!(enwiki.|dewiki.|frwiki.|itwiki.|nlwiki.|ruwiki.|svwiki.|plwiki.|eswiki.|ptwiki.|jawiki.|zhwiki.))*.hl # prefix stuffs search18: *.prefix # stuffs to deploy in future searchNone: *.related jawiki.nspart1.hl jawiki.nspart2.hl zhwiki.nspart1.hl zhwiki.nspart2.hl searchNone: enwiki.spell enwiki.nspart1.sub1.hl enwiki.nspart1.sub2.hl
# Indexers [Index] searchidx2: *
# Rsync path where indexes are on hosts, after default value put # hosts where the location differs # Syntax: host : <path> [Index-Path] <default> : /search
[OAI] simplewiki : http://simple.wikipedia.org/w/index.php rswikimedia : http://rs.wikimedia.org/w/index.php ilwikimedia : http://il.wikimedia.org/w/index.php nzwikimedia : http://nz.wikimedia.org/w/index.php sewikimedia : http://se.wikimedia.org/w/index.php alswiki : http://als.wikipedia.org/w/index.php alswikibooks : http://als.wikibooks.org/w/index.php alswikiquote : http://als.wikibooks.org/w/index.php alswiktionary : http://als.wiktionary.org/w/index.php chwikimedia : http://www.wikimedia.ch/w/index.php crhwiki : http://chr.wikipedia.org/w/index.php roa_rupwiki : http://roa-rup.wikipedia.org/w/index.php roa_rupwiktionary : http://roa-rup.wiktionary.org/w/index.php be_x_oldwiki : http://be-x-old.wikipedia.org/w/index.php ukwikimedia : http://uk.wikimedia.org/w/index.php brwikimedia : http://br.wikimedia.org/w/index.php dkwikimedia : http://dk.wikimedia.org/w/index.php trwikimedia : http://tr.wikimedia.org/w/index.php arwikimedia : http://ar.wikimedia.org/w/index.php mxwikimedia : http://mx.wikimedia.org/w/index.php
[Namespace-Boost] commonswiki : (0, 1) (6, 4) <default> : (0, 1) (1, 0.0005) (2, 0.005) (3, 0.001) (4, 0.01), (6, 0.02), (8, 0.005), (10, 0.0005), (12, 0.01), (14, 0.02)
# Global properies [Properties] # suffixes to database name, the rest is assumed to be language code Database.suffix=wiki wiktionary wikiquote wikibooks wikisource wikinews wikiversity wikimedia # Allow only up to 500 results per page Search.maxlimit=501 # Age scaling based on last edit, default is no scaling # Below are suffixes (or whole names) with various scaling strength AgeScaling.strong=wikinews AgeScaling.medium=mediawikiwiki metawiki #AgeScaling.weak=wiki # Use additional per-article ranking data, more suitable for non-encyclopedias AdditionalRank.suffix=mediawikiwiki metawiki # suffix for databases that should also have exact-case index built # note: this will also turn off stemming! ExactCase.suffix=wiktionary jbowiki # wmf-style init file, attempt to read OAI and lang info from it # for sample see http://noc.wikimedia.org/conf/InitialiseSettings.php.html #WMF.InitialiseSettings=file:///home/wikipedia/common/php-1.5/InitialiseSettings.php #WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-deployment/wmf-config/InitialiseSettings.php WMF.InitialiseSettings=file:///home/wikipedia/common/wmf-config/InitialiseSettings.php # Where common images are Commons.wiki=commonswiki.nspart1
# Syntax: <prefix_name> : <coma separated list of namespaces> # <all> is a special keyword meaning all namespaces # E.g. all_talk : 1,3,5,7,9,11,13,15 [Namespace-Prefix] all : <all> [0] : 0 [1] : 1 [2] : 2 [3] : 3 [4] : 4 [5] : 5 [6] : 6 [7] : 7 [8] : 8 [9] : 9 [10] : 10 [11] : 11 [12] : 12 [13] : 13 [14] : 14 [15] : 15 [100] : 100 [101] : 101 [104] : 104 [105] : 105 [106] : 106 [0,6,12,14,100,106]: 0,6,12,14,100,106 [0,100,104] : 0,100,104 [0,2,4,12,14] : 0,2,4,12,14 [0,14] : 0,14 [4,12] : 4,12
The Algoritms
[edit]The Ranking Algoritm
[edit]- Ranking system[1]:
- PageRank-like algorithm in the sense of reference-to-article counting.
- it may not be so great if one indexes only a wikipedia since
- the links graph is too sparse for specialist pages.
- few page are link hogs (e.g. year 1945)
- an effective pagerank also needs a good map reduce to work fast.
Did You Mean? Algorithm
[edit]- Did you mean - queary correction (phrase and words)
- Q. What information is important or representative of article? (often more informative than PageRank)
- beginning of articles,
- redirects,
- words used to refer to article,
- section captions
- Q. what disambiguates the article from related terms is its context?
- extracted frequently co-occuring article titles in all of wikipedia to extract article association.
- no open source "Did you mean..." engine at that time. (there are now)
- There are programs like aspell, but all of them spell-check only single words.
- the algorithm is 2-gram of all words in the language, with frquency thresholds (min and max).
- would be improved by a a language model (morphology + semantics)
- can fix some simple errors, but is not powerful enough.
- added scoring via heuristics.
- added special score to boost 2-grams that are in titles,
- added whole titles,
- "fuzzy" 2-grams of words that might provide context for a title words, by taking all words from redirects and links in first paragraph of article. (PLEASE CLARIFY)
- the search results to see if the rare spelling a user entered is significant
Solr may enable to dump code for:
- configuration
- maintain consistent copies of split indexes
- smooth updates from indexer to searchers
- Contact rainman aka Robert Stojnić rainman-sr who Developed Extension:Lucene-search. and Maintained the search servers.
- Rainman/search_internals
- (Consult his thesis)
- Consult the unit test
- Consult the API
- Consult search related bus
- Write a spec