Jump to content

User:OrenBochman/Search/BrainStorm

From mediawiki.org

Brainstorm Some Search Problems

[edit]

Queary Expansion

[edit]

Problem: HTML also contains CSS, HTML, Script, Comments

[edit]
  1. solution:
    Either index these too, or run a filter to remove them. Some Strategies are:
    1. Discard all markup.
      1. A markup_filter/tokenizer could be used to discard markup.
      2. Lucene Tika project can do this.
      3. Other ready made solutions.
    2. Keep all markup
      1. Write a markup-analyzer that would be used to compress the page to reduce storage requirements.
        (interesting if one wants to also compress output for integrating into DB or Cache.
    3. Selective processing
      1. A table_template_map extension could be used in a strategy to identify structured information for deeper indexing.
      2. This is the most promising it can detect/filter out unapproved markup (Javascripts, CSS, Broken XHTML).

NG Search Features

[edit]

Problem: Document Analysis is Language specific

[edit]

Wikipedia documents come in over 200 languages. Language specific analyzer implementations only exist for limited number of languages. Most languages (synthetic and agglutinative) would be handled satisfacorily by N-Gram based analysis which can be done in in a language independent way.

By developing a cross languge anlyzer with language detection capabilities it would be possible to have the best of all worlds. It would detect the document language and each token's language. It would then defer treatment of the tokens to a language specific implementation. It would also satisfy lucene's limitation for requiring one analyzer per document.

Some projects, the wiktionaries for example can have 10+ languages represented in a single document.

Multilingual Analyzer

[edit]

Langauge can be

  1. Single - 98% confidence
  2. Confidence Map - sevral high confidence candidates (wikitionary 2 language doc)
  3. Mixture Model (bit vector) - for wiktionary (wikitionary 20 language doc, multilanguage message page)
  4. Undetected (binary,cdata,databased, svg, etc)

Requirements:

  • Assymetrical API for Index / Search modes (takes a base language field)
  • Token and Document variant
  • Store Language info in field payload or in token type, Language Mixture & Confidence Score
  1. Extract features from query and check against model prepared of line.
  2. model would contain lexical feature such as:
    1. alphabet
    2. bi/trigram distribution.
    3. Stop lists; collection of common word/pos/language sets (or lemma/language)
    4. Normalized frequency statistics based on sampling full text from different languages..
    5. a light model would be glyph based.

Problem Morphological Variation in Language

[edit]

Non Technical Explanation

[edit]
to be: am; is; was; are; were; will be; would be; should be; used to be; 

are different forms of the same word. A search engine is expected to treat all word forms as equivalent under normal circumstances.

Language with rich morphology (synthertic, agluative) have rich lexical morphological reducing both precision and recall. By treating words as lemmas (equivelency classes of words) it is possible to overcome this problem. However automaticaly mapping words into lemma is a non trivial task.[1]

Luckly the wiktionary porojects provide morphological details (Inflection). It should be possible to create a template which could be used to annotate lemmas in wiktionries (prototype avilable in my wiktionary page) These could be extracted by a lemma extracter and generelized with high confidence for undocumented parts of the language.

Lemma Extractor

[edit]
  • Mine Lemmas based on analysis of wiktionary. (partial sed exript exists).
  • Enumerate/Annontate lemma members with morphological state for human debugging & for further NLP processing (optional).
  • Bootstrap an automatical lemma analyzer using this method.
  • use cross language tricks to check confidence (optional)
  • provide feedback for wiktioanty use.
  • post algorithmic overrides of errors.

Simple deliverables:

  1. Agluitive Morphology:
    1. S Stem list
    2. A Affix list
    3. M Mappings <S,A>
  2. Synthetic Morphology:
    1. S Stem list
    2. T Template list
    3. M Mapping <S,T>

Lemma Analzyer

[edit]

4 main approches to Lemmas:

  1. generative grammar rule based list
    • Pros
      • Good as - Gold standard
    • Cons
      • Rules have exceptions
      • Requires new work per per language
      • Reqires dual expertise - Language specific Linguistics and Lexicography.
    • Tool
      • Hspell
  2. hand crafted database - for example wiktionary/dbpedia
    • Pros
      • good as - Gold standard and working set
    • Cons
      • Rules have exceptions
      • Requires new work per per language
      • Reqires dual expertise - Language specific Linguistics and Lexicography.
    • Tool
  3. Machine induction of morphology - for example wiktionary/dbpedia
    • Pros
      • once configured can be fully automated.
      • objective criterion 1 : mdl.
      • objective criterion 2 : gold standard.
      • objective criterion 3 : other triage techniques.
      • objective criterion 4 : semantics triage.
      • Heristics based - needs parameter adjustment per language.
        • induction suffixes, prefixes, simple signatures and complex signatures
        • induction aliphoney via proximity heristic (needs added triage/inspection).
        • induction of elision via a ^-1 (needs added triage/inspection).
        • induction of doublication via a ^2 operator (needs triage/inspection).
    • can be used to make spelling checkers.
    • can be used to make an fsm version - see bellow.
    • Cons
      • Heristics based - needs review (can be outsourced to non experts or using more triage).
      • Requires adjusting parameters to make new languages work.
      • Reqires integration in form of a lucene version. Including support for different morphology versions based on growing corpus size.
      • Does not analyze/tag states.
      • Should processing NE and Personal Names differently.
      • The Morphology is compression orientated and not search orientated and requires restructuring based on roots and pos or morphological states. (This looks like an oversight in the orignial design)
    • Tools
      • Linguistica
    • Comments
      • Can benefit from better triage
        • semantical
      • Can benefit from more complex heuristics (e.g.)
        • considering ngram/skipgram eigenvalue matrix
        • Phonological and Phonotactic considerations (vowel harmoney, consonet vowel @ start)
      • Can benefit from Named entity resolution
  4. FSM based merphology
    • Pros
      • great performence
      • great for packeging
    • Cons
      • Very complex file formats
      • Require Lexicons
      • Requires new work per per language
      • Reqires triple expertise - Language specific Linguistics, Lexicography, FSM tech specific.
    • Tool
      • Apertium, their FSM tools project.

TODO - orgenize

[edit]
  1. Language with rich morphology this will reduce effectiveness of search. (e.g. Hebrew, Arabic, Hungarian, Swahili)
  2. Text Mine en.Wiktionary && xx.Wiktionary to for the data of a "lemma analyzer". (Store it in a table based on Apertium Morphlogical Dictionary format).
  3. Index xx.Wikipeia for frquency data and via a row/column algorithem to fill in the gaps of the Morphological Dictionary Table
    1. dumb lemma (bag with a representative)
    2. smart lemma (list ordered by frequency)
    3. quantum lemma (organized by morphological state and frequency)
  4. lemma based indexing.
  5. run a semantic disambiguation algorithm (tag )on disambiguate
  • other benefits:
  1. lemma based compression. (arithmetic coding based on smart lemma)
    1. indexing all lemmas
  2. smart resolution of disambiguation page.
  3. algorithm translate English to simple English.
  4. excellent language detection for search.
  • metrics:
  1. extract amount of information contributed by a user
    1. since inception.
    2. in final version.

How can search be made more interactive via Facets?

[edit]
  1. SOLR instead of Lucene could provide faceted search involving categories.
  2. The single most impressive change to search could be via facets.
  3. Facets can be generated via categories (Though they work best in multiple shallow hierarchies).
  4. Facets can be generated via template analysis.
  5. Facets can be generated via semantic extensions. (explore)
  6. Focus on culture (local,wiki), sentiment(), importance, popularity (edit,view,revert) my be refreshing.
  7. Facets can also be generated using named entity and relational analysis.
  8. Facets may have substantial processing cost if done wrong.
  9. A Cluster map interface might be popular.

How Can Search Resolve Unexpected Title Ambiguity

[edit]
  • The The Art Of War proscribes the following advice "know the enemy and know yourself and you shall emerge victorious in 1000 searches". (Italics are mine).
  • Google called it "I'm feeling lucky".

Ambiguity can come from:

  • The Lexical form of the queary (bank - river, money)
  • From the result domain - the top search result is an exact match of a disambiguation page.

In either case the search engine should be able to make a good (measured) guess as to what the user meant and give them the desired result.

The following data is avaiable:

  • Squid Chace access is sampled at 1 to a 1000
  • All edits are logged too.
[edit]
  • If we wanted to collect intelligece we could instrument all links to jump to a redirect page which logs

<source,target,user/ip-cookie,timestamp> than fetches the required page.

  • It would be interesting to have these stats for all pages.
  • It would be realy interesting to have these stats for disambiguation/redirect pages.
  • Some of this may be available from the site logs (are there any)

Use case 1. General browsing history stats available for disambiguation pages

[edit]

Here is a reolution huristic

  1. use inteligence vector of <target,frequency> to jump to the most popular (80% solution) - call it "I hate disambiguation" preference.
  2. use inteligence vector <source,target,frequency> to produce document term vector projections of source vs target to match most related source and traget pages. (should index source).

Use case 2. crowd source local interest

[edit]

Search Patterns are often affected by televison etc. This call for analyzing search data and producing the following intelligence vector <top memes, geo location>. This would be produced every N<=15 minutes.

  1. use inteligence vector <source,target,target freshness,frequency> together with <top memes, geo location> if significant on the search term to steer to the current interest.

Use case 3. Use specific browsing history also available

[edit]
  1. use <source,target,frequency> and as above but with a mememory <my top memes + edit history> weighed by time to fetch personalised search results.

How can search be made more relavant via Intelligence?

[edit]
  1. Use current page (AKA refrerer)
  2. Use browsing history
  3. Use search history
  4. Use Profile
  5. API for serving ads/fundrasing

How Can Search Be Made More Relevant via meta data extraction ?

[edit]

While semenatic wiki is one approch to matadata collections, the Apache UIMA offers a possiblity of extraction of metadata from free text as well (as templates).

  • entitiy detection.

Bugs

[edit]

https://bugzilla.wikimedia.org/buglist.cgi?list_id=23545&resolution=---&query_format=advanced&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=Lucene%20Search&product=MediaWiki%20extensions


References

[edit]
  1. ↑ An Algorithm for the Unsupervised Learning of Morphology, John A. Goldsmith, http://hum.uchicago.edu/~jagoldsm/Papers/algorithm.pdf (accessed January 20, 2010).