Jump to content

Wikimedia Discovery/So Many Search Options

From mediawiki.org

Background

[edit]

We’d always prefer providing some sort of search results to users rather than giving no results. And if we can’t give results, helpful suggestions and links are a better alternative than no results at all.

Right now we have, or have in the works, a large enough number of search modifications, extensions, and alternatives that we need to think clearly and carefully about how to order them and how to let them interact.

Current or Near Term Options

[edit]

A brief summary of the features includes:

  • Question mark stripping: ? characters are removed unless that are escaped with a slash \?, because most people use them when asking questions, not as one-character wildcards. This is before searching.
  • ASCII/ICU folding, stemming, case folding, etc.: this happens right before search, and is done by Elastic Search as part of the language analysis step, but it’s worth mentioning explicitly, since we could theoretically use components of this type outside Elastic at some point. Currently, characters are mapped to other characters (lower case, some apostrophe-like marks are converted to apostrophes), words are reduced to their approximate roots (run, running, ran, runs all become run), etc.
  • Interwiki / Cross-project searching: On Wikipedias, provide one result from each sister project in the same language, if projects and results are available.
  • Did You Mean (DYM) Spelling suggestions: If search terms don’t look very likely, and another similar term does, provide a clickable link with those changes made. If the original query gave zero results, go ahead and try the suggested search.
  • Quote stripping: If a query has quotes and does poorly, try the query again without the quotes. (Note: "poorly" isn't defined for future projects yet, but common criteria are < 3 results, or no results.)
  • Language detection / identification (TextCat / cross-language searching): If a query has fewer than 3 results, do language detection on it. If the language detected is not the “host” language (the language of the current wiki), try to get results from the corresponding project in that language, if it exists, and show any results.
  • Wrong keyboard detection: Using the same technique as language detection, detect when a user has typed a query in one language (e.g., Russian) while using the keyboard of another language (e.g., English), if the query does poorly.* This can be run concurrently with language detection, or separately. If a non-host language is detected, convert the query to the correct keyboard and run again.

An interesting distinction that may make a difference for how results are displayed or ordered (much of which is not yet decided, see below) is that between query corrections and query expansions. The line isn't 100% clear, but corrections include DYM, and wrong keyboard detection. Expansions include interwiki search and quote stripping. Language ID is on the line. It may make more sense to automatically provide corrected results, and pointers to expanded results.

Under the heading of expansions, it is possible to imagine future modules that detect, say, the first result is a very good title match with a film, therefore we suggest additional queries for the main actors, director, film series, etc. (There's no plan to work on this right now, but it's good to think about these more expansive expansions now so we build something reasonably future proof.)

Stopping Criteria

[edit]

Ideally, it would be interesting to run everything and see what gives the best result and show that, but realistically, that’s probably too expensive. So it makes sense to order them carefully and thoughtfully, and consider stopping criteria. Potential stopping criteria include:

  • a certain amount of time has gone by or CPU has been used
  • a certain number of options have been tried (they don’t all have the same initial criteria, so aren’t all eligible to run on every query, and different options could be weighted based on the cost of running them, too)
  • an option achieves “success” (e.g., returns a certain number of results).

Continuing Discussions

[edit]

(January 2017)

We (David, Stas, Erik, Jan, & Trey) have been working to come up with a framework that's a step in the right direction in fleshing out a generic approach.

  • One component is modularity. In the general case, we don't want to have to write code in Cirrus of the "if TextCat then X, else if QuoteStripping then Y" variety. There will be some exceptions, particularly interwiki search, but most query-modifying options should meet this criteria. The modular interface is relatively simple:
    • Input is a query (usually the original query), the source wiki (usually the wiki it was originally submitted to), and the results count (so that different modules can use different results count thresholds if they need to—"Did you mean" (DYM) might use 0, most others might use 3).
    • Output is a list, each element consisting of a modified query, the wiki it should be submitted to, and a human readable information string (such as "Did You Mean X" or "Results from Russian Wikipedia". The returned list can be empty if the module has nothing to suggest (e.g., there are no quotes to strip, or the language detected is the language of the current wiki, etc.).
  • Given general modularity, the modules need to be ordered, both for the order of considering modified queries to run and for the order of considering results to show. We can use likelihood of applicability and accuracy of results to initially order the modules, but this is also something that could be A/B tested.
  • Another component is simultaneous search. It takes too long to issue a modified query, check the results, and repeat, say, five times. However, as the recent interwiki search work has shown, issuing multiple queries at once doesn't necessarily bog down the Elasticsearch cluster. So we propose defining some maximum number of simultaneous queries (say, five), and working through the search-modifying modules until we run out of modules or we fill all the available simultaneous search slots. Once the slots are filled (which should be fast compared to searching), we issue the five queries at once.
  • The final component, which we did not fully work through, is displaying the results. We had originally written that it does not seem helpful, for the users who most need help with search, to display seven kinds of results all smashed together. However we're going to chat with UX pros about that and make sure. (Seven is definitely too many. We currently allow two for language ID. Is three too many?)
    • As noted above, it may make sense to distinguish query corrections (DYM) from query expansions (stripping quotes) and treat them differently. One obvious distinction is provide actual results vs providing pointers to results (i.e., a link to perform a different search—which is also much cheaper, CPU-wise, than doing the search automatically).
      • If all of the second-try results are "pointers", it might make sense to give the actual results for the 'best" query inline. (But we haven't figured out "best" yet, other than first in order.)
      • Do we collect suggestions/expansions/pointers at the top of the page (where DYM results happen now), or do we put them at the bottom, for people who scroll down and haven't found anything worth clicking on?
    • Earlier notes: In general, it seems that later modules are more likely to be more "desperate" (i.e., more likely to favor recall over precision so as to give some results), so raw number of results is not a good selection criteria. It's also possible for some results to overlap with previous results (e.g., a query with quotes may get one result, while the same query without quotes may get that same result, plus many others). Until we think of something better, the current draft proposal is to stick with the "success criteria"—original query results (if any) would be shown, along with the first "successful" modified query (that probably means having 3+ results). If there is no successful modified query, we could show the results of the earliest modified query.

Exceptions and Notes

[edit]

We decided that interwiki results (i.e., showing results from same-language Wiktionary, Wikiquote, Wikisource, Wikivoyage, etc.) are independent from the other query-modifying modules we're considering. Those results should be shown if available, but whether or not they are available doesn't affect anything else.

Cirrus-internal modifications (i.e, ?-stripping and language analysis) don't affect and aren't affected by the modular framework. It's just good to keep in mind that they exist.

We've decided that Did You Mean (DYM) is not an exception, and should be refactored to behave like other potential modules. We've also generalized the DYM suggestion text and Language ID/TextCat "Showing Results from..." above.

For off-wiki searchers using the API, we've mocked up potential combined JSON results below.

Advanced Options

[edit]

One idea we touched on but didn't fully address was combining quote stripping and language detection—we might want to strip quotes and send it to the query to a wiki in another language if both of those options fail independently. One way to approach this would be to have a special module that just calls the other two modules in turn and combines their results. Another option would be to have a mechanism for explicitly composing two or more modules—taking the output of one and using it as input to the other. That's not something to consider for the initial version, but an idea to keep in mind.

One idea that just came up, and which is incorporated above but is also worth explicitly mentioning, is that a module could have multiple suggestions. A language-aware quote stripper could suggest the stripped query on the current wiki, but also on another wiki if the query seems to be in another language. A wrong-keyboard detector could suggest the modified query on the current wiki, and on the wiki corresponding to the presumed keyboard (e.g., from enwiki, and query detected as "Latin Cyrillic" could be converted to proper Cyrillic and run on both enwiki and ruwiki). This seems easy to include in the first pass, even if all modules return only one suggestion to start.

Depending on how complex and expensive the module internals are, we might want to cache results on a blackboard. Such a mechanism allows the modules to know about each other, without the framework having to know about them. For example, quote stripping is probably so fast and easy it's okay to do it more than once in different modules. Language identification with TextCat is lightweight, but much more intensive than quote stripping. The language ID module could detect the language and write the results to the blackboard. The quote stripper could check the backboard and see that a language ID has occurred, and make two suggestions, one for the current wiki, and one for the wiki of the detected language. Similarly, with a composed language-aware quote stripper that calls TextCat a second time on the same query, TextCat could check the blackboard and not re-run the exact same query again. This is probably overkill in general, and certainly too much for an initial implementation, so it's just an idea to keep around in case we need it.

Extensions and Hooks

[edit]

(Feb 2017)

Another wrinkle we hadn't previously considered is how this interacts with extensions. A good example is the ArticlePlaceholder extension, which uses Wikidata to generate content that can provide information on a subject when no human-generated article exists. It appends its results to the end of the search results page. As an example, search for Prikkflodskilpadde on Norwegian Nynorsk Wikipedia. The section "Finn data om emnet" (~"Find data on the subject") at the bottom has the additional results. Another example (with >40 results) on nnwiki: Theodore Roosevelt.

ArticlePlaceholder uses the SpecialSearchResultsAppend hook. There are also SpecialSearchResultsPrepend and SpecialSearchNogomatch hooks, and there was a SpecialSearchNoResults hook, but it has been removed.

We should think about how any solution we come up with interacts possible extensions that use these hooks. We could also entertain the notion of providing these features as their own extensions, and allowing wikis to install/enable the ones they'd like to use. It's not immediately clear why the No Results hook was removed—it seems like it could have been useful, as might a Too Few Results hook.

Open Questions

[edit]
  • Results selection: As mentioned above, we haven't really carefully worked out a good method for selecting results when our five-or-so simultaneous queries all return results. We've consulted with Jan about UX and how best to display results, and our current relationship status with this question is, "it's complicated".
    • Confidence: In our discussion the idea of confidence came up, such as having each module giving some confidence score to its suggested query, which would allow us to order them based on that confidence, rather than on a fixed order. TextCat has eluded confidence measures so far, and it's unfortunately hard to imagine assigning a well-founded confidence after stripping quotes from a query. We could have simple categories ("high, medium, low" or maybe "bold, in vain, desperate") that could sort results, especially when a module has multiple suggestions. For example, quote stripping might be "medium" while quote stripping + language ID is "low", even though both come from the same module.
  • Extensions: Should we be concerned about how any changes we make interact with other extensions? Should we implement some of these features as extensions?
  • API: We've mocked up some possible API results below, but it's a very early draft and needs more thought and potential updates.

A Worked Example

[edit]

Suppose we get a query, "los lobos locos", which gets 1 result on enwiki. We run our first module, language detection with TextCat, and it determines the query is Spanish and queues up "los lobos locos" to search on eswiki. The second module, the quote stripper, suggests los lobos locos on both enwiki and eswiki. Wrong keyboard detection (having lost its mind) suggests "дщы дщищы дщсщы" on enwiki and ruwiki. That's five suggestions, so we stop processing modules.

Scenario 1
[edit]

Suppose:

  • "los lobos locos" on eswiki returns 0 results.
  • los lobos locos on enwiki returns 1 result.
  • los lobos locos on eswiki returns 2 results.
  • "дщы дщищы дщсщы" on enwiki returns 0 results.
  • "дщы дщищы дщсщы" on ruwiki returns 2 results.

Since none are "successful", we just show the original 1 result from enwiki, along with the earliest "unsuccessful" result—the 1 result from los lobos locos on enwiki.

Scenario 2
[edit]

Suppose (difference from above in bold):

  • "los lobos locos" on eswiki returns 0 results.
  • los lobos locos on enwiki returns 1 result.
  • los lobos locos on eswiki returns 2 results.
  • "дщы дщищы дщсщы" on enwiki returns 0 results.
  • "дщы дщищы дщсщы" on ruwiki returns 5 results.

Since the final query (in defiance of all likelihood in the real world) is "successful", we show the original 1 result from enwiki, along with the earliest "successful" result—the 5 results from "дщы дщищы дщсщы" on ruwiki.

Draft Proposal

[edit]

Search Engine Results Page

[edit]

Below is the updated draft proposal, for further discussion of both generalities and specifics. Important elements include:

  • Order with respect to default search and to each other. Options below are roughly sorted into groups that happen at the same time. Exact sorting is a point of discussion.
  • Initial eligibility criteria: “automatic” always happens; “no previous successful results” is always assumed (see below) except for “automatic” actions; the number of main search results or results from previous options is probably the most common criterion.
  • Marginal cost estimate: start with very rough low/medium/high estimates of the marginal cost of the various options, if activated. The marginal cost of determining initial criteria is presumed to be low.
  • Success” criteria: here defined as giving good enough results so as to stop processing and trying other alternatives—so while question mark stripping is probably always going to be successful in terms of removing question mark characters, its success criterion is “none” because it will never stop processing. Success criteria could include the number of results, the “quality” of results, and maybe the length of the query (short, one-word queries seem like they could be a different class than very long and/or multi-word queries).
  • Results shown: One way to cut down on UI complexity is to only show the “best” set of results from extra search options, so if stripping quotes gives 1 result, wrong keyboard gives 2 results, and language identification gives 200 results, only the final 200 results would be added to the original main search results (which are likely fewer than 3).
option category initial criteria success criteria marginal cost estimate [1] effect / process results shown
?-stripping Cirrus-internal automatic none low unescaped ?s are stripped from queries n/a—effect is in main search
lg analysis [2] Cirrus-internal[3] automatic none medium character/case folding, stemming, and general mangling of the query n/a—effect is in main search
main search Cirrus-internal automatic 3+ results [4] high all results are shown
interwiki search [5] Cirrus-internal[6] automatic on Wikipedia; never, elsewhere none[6] high top result (if any) from each sister project (if any), displayed in a side bar all results (i.e,, first from each project) are shown
DYM suggestions ? run with the main search using the phrase suggester with several options available to tune (prefix length, confidence). Currently tuned toward recall. Does not work on spaceless languages. ?[7] high if original query gets results, a link to a suggested search; if original gets no results and suggestion does, suggested results are shown ? [7]
quote stripping [8] modular < 3 results from any previous step [4]+ presence of quotes 3+ results [4] high remove quotes (or replace with spaces) and re-run the query if successful, show results; if not, show only if nothing better comes along. [9]
wrong keyboard [8] [10] modular < 3 results from any previous step [4] + detection of “wrong keyboard” 3+ results [4] high remap the query to the right keyboard and re-run. if successful, show results; if not, show only if nothing better comes along. [9]
language detection [8] [10] modular < 3 results from any previous step [4] + detection of non-host language 3+ results [4] high re-run the query on the wiki with the detected language if successful, show results; if not, show only if nothing better comes along. [9]
  1. Estimating is hard, so better estimates are very welcome.
  2. Technically, the language analysis happens inside Elastic in the main search, but it happens before the query is looked up in the indices, so it’s logically “before”.
  3. This actually happens in Elasticsearch, but we're using "Cirrus-internal" as a label for everything that "automatically happens as part of search" and isn't "second-try" processing.
  4. 4.0 4.1 4.2 4.3 4.4 4.5 4.6 “Fewer than 3 results” from the main search has been used to define “poorly performing queries” for language identification, but was arbitrarily chosen. It’s a reasonable initial proposal for initial criterion (< 3) and success (3+), but is readily changed.
  5. We're assuming that interwiki search runs at approximately the same time as the main search, since it is automatic.
  6. 6.0 6.1 This isn't really Cirrus-internal, but it's an exception to the modular framework we're currently considering, and it happens regardless of whatever else goes on. See "Exceptions", above.
  7. 7.0 7.1 Do we want to mix DYM suggestions and/or results with language identification results? Depend on success criteria, but we need a plan for showing (or not showing) suggestions if another option also gives “successful” results.
  8. 8.0 8.1 8.2 It seems like DYM suggestions should come before the others for historical reasons, but it's up for discussion.
  9. 9.0 9.1 9.2 The idea here is that if a stage has successful results, then we stop and show them. If the stage has some results, but not enough to be “successful”, we hold on to them. If a later stage is successful, show those and discard these. If no later stage is successful, show these. So given the order here, if stripping quotes gives 1 result, wrong keyboard gives 2 results, and language identification gives 2 results, the single quote-stripping result would be shown. An alternative option would be to show the largest set, with ties broken by order.
  10. 10.0 10.1 Wrong keyboard can be run at the same time as language detection (i.e, they can be rolled into one process) if they are happening sequentially.

API Proposal

[edit]

Throughout this section, the JSON examples have had search result details abbreviated to just the title to save space, Unicode characters decoded for readability, and <em>emphasis</em> tags used to italicize rather than being displayed.

"Did You Mean"
[edit]

"Did you mean" creates challenges for the notion of generalizing second-try modules because there is a lot of existing infrastructure specific to it.

DYM Suggestions
[edit]

Currently, if you search for alpert einstein on enwiki via the API (URL) you get a small number of results, with a suggestion to correct the query to albert einstein in query|searchinfo|suggestion.

 {
     "batchcomplete": "",
     "continue": { "sroffset": 10, "continue": "-||" },
     "query": {
         "searchinfo": {
             "totalhits": 42,
             "suggestion": "albert einstein",
             "suggestionsnippet": "albert einstein"
         },
         "search": [
             { ..., "title": "Political views of Albert Einstein", ... },
             { ..., "title": "Ram Dass", ... },
             { ..., "title": "Warren Alpert Foundation Prize", ... },
             { ..., "title": "Final Theory (novel)", ... },
             { ..., "title": "History of the Technion — Israel Institute of Technology", ... },
             { ..., "title": "Brandeis University", ... },
             { ..., "title": "Susan Band Horwitz", ... },
             { ..., "title": "Akira Endo (biochemist)", ... },
             { ..., "title": "Hobo nickel", ... },
             { ..., "title": "Associated Recording Studios", ... }
         ]
     }
 }
DYM Query Rewrites
[edit]

Currently, searching for alpert einstien on enwiki via the API (URL) gets no results, so the query is automatically re-written, and results for albert einstein are provided. The rewritten query is provided in query|searchinfo|rewrittenquery.

 {
     "batchcomplete": "",
     "continue": { "sroffset": 10, "continue": "-||" },
     "query": {
         "searchinfo": {
             "totalhits": 4595,
             "rewrittenquery": "albert einstein",
             "rewrittenquerysnippet": "albert einstein"
         },
         "search": [
             { ..., "title": "Albert Einstein", ... },
             { ..., "title": "Einstein family", ... },
             { ..., "title": "Albert Einstein Medal", ... },
             { ..., "title": "Albert Einstein Hospital", ... },
             { ..., "title": "Albert Einstein: The Practical Bohemian", ... },
             { ..., "title": "Albert Einstein in popular culture", ... },
             { ..., "title": "Albert Einstein College of Medicine", ... },
             { ..., "title": "Political views of Albert Einstein", ... },
             { ..., "title": "Albert Einstein Square", ... },
             { ..., "title": "Albert Einstein Peace Prize", ... }
         ]
     }
 }
Language Identification
[edit]

Currently, if you search for Maria Asuncion Aramburuzabala Larregui on enwiki via the API ( URL ), you get JSON results like this:

 {
     "batchcomplete": "",
     "query": {
         "searchinfo": {
             "totalhits": 1
         },
         "search": [
             { ..., "title": "María Asunción Aramburuzabala", ... }
         ],
         "additionalsearch": {
             "es": [
                 { ..., "title": "María Asunción Aramburuzabala Larregui", ... },
                 { ..., "title": "Escoriaza", ... },
                 { ..., "title": "KIO Networks", ... },
                 { ..., "title": "Museo del Prado", ... }
             ]
         },
         "additionalsearchinfo": {
             "totalhits": 4
         }
     }
 }

The newer field query|additionalsearch contains the results of language ID and subsequent search (the query was identified as Spanish and Spanish Wikipedia (eswiki) was searched). The key query|additionalsearch|es identifies the language of the other Wikipedia that was searched. query|additionalsearchinfo|totalhits contains the total number of hits returned.

Proposed API Changes
[edit]

Our goal is to extend this format with as few breaking changes as possible, but we need to include the possibility of results from DYM, language ID, quote stripping, wrong keyboard detection, and others.

Since it is possible to have multiple result sets from the same language wiki, we propose adding an arbitrary sequence number to the query|additionalsearch|... "language" key to ensure uniqueness, and so that additional information under query|additionalsearchinfo can be provided.

We've also proposed refactoring DYM to be just another second-try suggester like the others. Whether or not this includes deprecating query|searchinfo|suggestion, query|searchinfo|rewrittenquery, and their corresponding snippets is an open question. It might also entail no longer giving automatic results when the original query gave zero results. (i.e., no more rewrittenquery, everything would be a suggestion. Or we could give DYM special legacy status for backward compatibility.)

In order to maintain some backward compatibility, query|additionalsearchinfo|totalhits would contain the sum of results from all second-try modules, and a new field, query|additionalsearchinfo|keydescription would contain additional information about the modified query, its source, and the results. At the moment, both the keys and values are mostly intended as placeholders, with the exact format and information to be determined.

Going back to our earlier hypothetical example, but without quotes, los lobos locos, and let's assume it gets one result on enwiki. DYM suggests los locos locos on enwiki. Language detection determines the query is Spanish and queues up los lobos locos to search on eswiki. Wrong keyboard detection (still having lost its mind) suggests дщы дщищы дщсщы on enwiki and ruwiki.

This time, let's suppose that:

  • los locos locos on enwiki returns 2 results (DYM).
  • los lobos locos on eswiki returns 2 results (languageID).
  • дщы дщищы дщсщы on enwiki returns 1 result. (wrong keyboard, same wiki)
  • дщы дщищы дщсщы on ruwiki returns 5 results. (wrong keyboard, different wiki)

In this example, the legacy format DYM query|searchinfo|suggestion and query|searchinfo|suggestionsnippet are not included.

 {
     "batchcomplete": "",
     "query": {
         "searchinfo": {
             "totalhits": 1
         },
         "search": [
             { ..., "title": "Los Lobos Locos", ... }
         ],
         "additionalsearch": {
             "en-0": [
                 { ..., "title": "Los Locos Locos", ... },
                 { ..., "title": "Los Lobos Locos ", ... }
             ],
             "es-1": [
                 { ..., "title": "Los Lobos Locos", ... },
                 { ..., "title": "Todos los locos", ... }
             ],
             "en-2": [
                 { ..., "title": "Lolcats around the world", ... }
             ],
             "ru-3": [
                 { ..., "title": "Los Lobos", ... },
                 { ..., "title": "Лолкот", ... },
                 { ..., "title": "Волк", ... },
                 { ..., "title": "Сумасшествие", ... },
                 { ..., "title": "Жаргон падонков", ... }
             ]
         },
         "additionalsearchinfo": {
             "totalhits": 10,
             "keydescription": {
               "en-0": {
                 "source": "DYM",
                 "wiki": "enwiki",
                 "rewrittenquery": "los locos locos",
                 "rewrittenquerysnippet": "los locos locos",
                 "description": "Did you mean los locos locos ?",
                 "totalhits": 2
               },
               "es-1": {
                 "source": "languageID",
                 "wiki": "enwiki",
                 "description": "Showing results from Spanish Wikipedia",
                 "totalhits": 1
               },
               "en-2": {
                 "source": "wrongKeyboard",
                 "wiki": "eswiki",
                 "rewrittenquery": "дщы дщищы дщсщы",
                 "rewrittenquerysnippet": "дщы дщищы дщсщы",
                 "description": "Did you mean дщы дщищы дщсщы ?",
                 "totalhits": 2
               },
               "ru-3": {
                 "source": "wrongKeyboard + languageID",
                 "wiki": "ruwiki",
                 "rewrittenquery": "дщы дщищы дщсщы",
                 "rewrittenquerysnippet": "дщы дщищы дщсщы",
                 "description": "Showing results for дщы дщищы дщсщы from Russian Wikipedia",
                 "totalhits": 5
               }
             }
         }
     }
 }
API Questions
[edit]
  • Do we deprecate the DYM query|searchinfo|suggestion, query|searchinfo|rewrittenquery, and their corresponding snippets?
  • What's the exact format for query|additionalsearchinfo|keydescription?

More Options

[edit]

Other ideas that have come up that are or could be second-try-ish:

  • Regional results: This could be its own thing, but some may be hesitant to give users different results based on their location; instead, give "also popular in your region" results that highlight results not in the top results that are nonetheless popular in the users (apparent) geographical region. Obvious potential example searches include things involving titles (president) or homonyms (football).
  • Query refinement suggestions: rather than spelling suggestions, offer query suggestions—additional search terms (to limit facets), related searches, etc.
  • Faceted search

Seems like some sort of ranking is the only way to go if we ever get to many of these.

SEO

[edit]

Adding some terms to make it easier to find this page later: second try search, second chance search