User:OrenBochman/Search/NGSpec
- The ultimate goal is to make searching simple and satisfacory.
secondry goals are:
- improve precision and recall.
- Evaluate component by the knowldge & intelligence they can expose
- Use infrastrucure effectiv;ly
- Low edit to index time
Features
[edit]
Standard Features[edit]
Ranking[edit]
|
Wiki Specific Features[edit]
Media Support[edit]
|
Performance & Scalabability[edit]
|
UI & Admin UI[edit]
|
search analytics[edit]
|
Crowdsourcable components[edit]
|
Indexing
[edit]
Cache Analytics[edit]
Reputations[edit]a documents reputation is derived from I intrinsic factors and E extrinsic factors Intrinsic reputation
Extrinsic reputation
|
PreProcessing[edit]Lexical[edit]
|
Semantic[edit]Cross Language[edit]
|
Brainstorm Some Search Problems
[edit]LSDEAMON vs Apache Solr
[edit]As search evolves it might be prudent to migrate to Apache SOLR[3] as a stand alone search server instead of the LSDEAMON
Pros[edit]
Can support many more features from above matrix
|
Cons[edit]
May require a new front end MWSearch.
|
Query Expansion
[edit]Indexing Source as opposed to HTML
[edit]Problem: Lucene search processes Wikimedia source text and not the outputted HTML.
[edit]Solution:
- Index output HTML (placed into cache)
- Stip unwanted tags (while)
- boosting things like
- Headers
- Interwikis
- External Links
Problem: HTML also contains CSS, HTML, Script, Comments
[edit]- solution:
Either index these too, or run a filter to remove them. Some Strategies are:- Discard all markup.
- A markup_filter/tokenizer could be used to discard markup.
- Tika project can do this.
- Other ready made solutions.
- Keep all markup
- Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
(interesting if one wants to also compress output for integrating into DB or Cache.
- Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
- Selective processing
- A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
- This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).
- Discard all markup.
Problem: Indexing offline and online
[edit]- solr can access the DB directly...?
- real-time "only" - slowly build index in background
- offline "only" - used dedicated machine/cloud to dump and index offline.
- dua - each time the linguistic component becomes significantly better (or there is a bug fix) it would be desire able to upgrade search. How this would be done would depend much on the architecture of the analyzer. One possible approach would be
- production of a linguistic/entity data or a new software milestone.
- offline analysis from dump (xml,or html)
- online processing newest to oldest updates with a (Poisson wait time prediction model)
NG Search Features
[edit]
Problem: Analysis And Language[edit]
Problem: Search is not aware of morphological language variation[edit]
|
Lexical Chain[edit]
|
|
Soluition 2 - specialized Language Support
[edit]Integrate new resources for languages analysing as they become available.
- contrib location for
- lucene
- https://svn.apache.org/repos/asf/lucene/dev/tags/
- https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_5_0/lucene/contrib/
- https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/; and for branch_3x (to be released next as v3.6), see
- https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/
- solr
- https://svn.apache.org/repos/asf/lucene/solr/dev/tags/
- https://svn.apache.org/repos/asf/lucene/solr/dev/tags/lucene_solr_3_5_0/lucene/contrib/
- https://svn.apache.org/repos/asf/lucene/solr/dev/trunk/lucene/contrib/; and for branch_3x (to be released next as v3.6), see
- https://svn.apache.org/repos/asf/lucene/solr/dev/branches/branch_3x/lucene/contrib/
- lucene
- external resources
language | resource | status | Comments |
---|---|---|---|
Arabic Stemmer - algorithmic | TestArabicNormalizationFilter.java at https://issues.apache.org/jira/secure/attachment/12391029/LUCENE-1406.patch | ||
Arabic Stemmer - data based | http://savannah.nongnu.org/projects/aramorph | ||
Chinese | SmartChineseSentenceTokenizerFactory.java and SmartChineseWordTokenFilterFactory.java | ||
Hungarian | morphology | identified | |
Finish morphology | http://gna.org/projects/omorfi | ||
Hebrew | morphology | identified | |
Japanese | morphology | identified | |
Polish | StempelPolishStemFilterFactory.java |
- Benchmarking
- TestSuite (check resource against N-Gram)
- Acceptence Test
- Ranking Suite based on "did you know..." glosses and thier articles
How can search be made more interactive via Facets?
[edit]- SOLR instead of Lucene could provide faceted search involving categories.
- The single most impressive change to search could be via facets.
- Facets can be generated via categories (Though they work best in multiple shallow hierarchies).
- Facets can be generated via template analysis.
- Facets can be generated via semantic extensions. (explore)
- Focus on culture (local,wiki), sentiment(), importance, popularity (edit,view,revert) my be refreshing.
- Facets can also be generated using named entity and relational analysis.
- Facets may have substantial processing cost if done wrong.
- A Cluster map interface might be popular.
How Can Search Resolve Unexpected Title Ambiguity
[edit]- The The Art Of War prescribes the following advice "know the enemy and know yourself and you shall emerge victorious in 1000 searches". (Italics are mine).
- Google called it "I'm feeling lucky".
Ambiguity can come from:
- The Lexical form of the query (bank - river, money)
- From the result domain - the top search result is an exact match of a disambiguation page.
In either case the search engine should be able to make a good (measured) guess as to what the user meant and give them the desired result.
The following data is available:
- Squid Chace access is sampled at 1 to a 1000
- All edits are logged too.
Instrumenting Links
[edit]- If we wanted to collect intelligence we could instrument all links to jump to a redirect page which logs
<source,target,user/ip-cookie,timestamp> than fetches the required page.
- It would be interesting to have these stats for all pages.
- It would be really interesting to have these stats for disambiguation/redirect pages.
- Some of this may be available from the site logs (are there any)
Use case 1. General browsing history stats available for disambiguation pages
[edit]Here is a resolution heuristic
- use intelligence vector of <target,frequency> to jump to the most popular (80% solution) - call it "I hate disambiguation" preference.
- use intelligence vector <source,target,frequency> to produce document term vector projections of source vs target to match most related source and target pages. (should index source).
Use case 2. crowd source local interest
[edit]Search Patterns are often affected by television etc. This call for analyzing search data and producing the following intelligence vector <top memes, geo location>. This would be produced every N<=15 minutes.
- use inteligence vector <source,target,target freshness,frequency> together with <top memes, geo location> if significant on the search term to steer to the current interest.
Use case 3. Use specific browsing history also available
[edit]- use <source,target,frequency> and as above but with a memory <my top memes + edit history> weighed by time to fetch personalised search results.
How can search be made more relevant via Intelligence?
[edit]- Use current page (AKA referer)
- Use browsing history
- Use search history
- Use Profile
- API for serving ads/fundraising
How Can Search Be Made More Relevant via metadata extraction ?
[edit]While semantic wiki is one approach to metadata collections, the Apache UIMA offers a possibility of extraction of metadata from free text as well (as templates).
- entity detection.
How To Test Quality of Search Results ?
[edit]Ideally one would like to have a list of queries + top result, highlight etc for different wikis and test the various algorithms. Since data can change one would like to use something that is stable over time.
- generated Q&A corpus.
- snapshot corpus.
- real world Q&A (less robust since a real world wiki test results will change over time).
- some queries are easy targets (unique article) while others are harder to find (many results).
Personalised Results via ResponseTrackingFilter
[edit]- Users post search action should be tracked anonymously to test and evaluate the ranking to their needs.
- Users should be able to opt in for personalised tracking based on their view/edit history.
- This information should be integrated into the tracking algorithm as a component that can filter search.
External Links Checker
[edit]External Links should be scanned once they are added. This will facilitate
- testing is a link is alive.
- testing if the content has changed.
The links should be doctored for frequency count.
PLSI Field for cross language search
[edit]- index a cross language field with N=200 words from each language version of wikipedia in it.
- the run PLSI alorithem on it.
- this will produce a matrix that associates phrases with cross language meaning.
- so it should then be possible to use the out put of this index to do xross language search.
Payloads
[edit]- payloads allow storing and retrieving arbitrary tokens for each token.
- payloads can be used to boost at the term level (using function queries)
What might go into payloads?
- Html (Logical) Markup Info that is stripped (e.g.)
- isHeader
- isEmphesized
- isCode
- WikiMarkUp
- isLinkText
- isImageDesc
- TemplateNestingLevel
- Linguistic data
- LangId
- LemmaId - Id for base form
- MorphState - Lemma's Morphological State
- ProbPosNN - probability it is a noun
- ProbPosVB - probability it is a noun
- ProbPosADJ - probability it is a noun
- ProbPosADV - probability it is a noun
- ProbPosPROP - probability it is a noun
- PropPosUNKOWN - probability it is Other/Unknown
- Semantic data
- ContextBasedSeme (if disambiguated)
- LanguageIndependentSemeId
- isWikiTitle
- Reputation
- Owner(ID,Rank)
- TokenReputation
- some can be used for ranking.
- some can be used for cross language search.
- some can be used to improve precision.
- some can be used to increase recall.
References
[edit]- â via document term vectors cosines
- â tika
- â http://lucene.apache.org/solr/
- â Lucene In Action 2nd Edition P. 275
- â Lucene In Action 2nd Edition P. 277
- â Lucene In Action 2nd Edition P. 283
- â http://project.carrot2.org/release-3.5.0-notes.html
- â http://project.carrot2.org/release-3.5.0-notes.html
- â http://www.lirmm.fr/~croitoru/kcap07-onto-2.pdf
- â http://delicias.dia.fi.upm.es/wiki/images/a/a5/GeneralOntologies.pdf