User:TJones (WMF)/Notes/Potential Applications of Natural Language Processing to On-Wiki Search
May 2018 â See TJones_(WMF)/Notes for other projects. See also T193070
â[W]e need to design the right tasks. Too easy and NLP is only a burden; too hard and the necessary inferences are beyond current NLP techniques.â
âR. Baeza-Yates et al., âTowards Semantic Searchâ (doi:10.1007/978-3-540-69858-6_2)
Introduction and Overview
[edit]What is natural language processing? Itâs hard to pin down exactly, but almost any automated processing of text as language might qualify; in a corporate environment, itâs whatever text processing you can do that your competitors canât. The English Wikipedia article on NLP does a good job of laying out a lot of the common, high-level tasks NLP addresses, most of which weâll at least touch on below.
The goal of this report is to look at any aspects of computational linguistics/NLP that might be useful for search, and list at least 50-100 ideas and variants (many without too much detail at first), and through discussion find 10-20 that seem really promising, and then identify 1-2 that we can pursue, ourselves and/or with the help of an outside consultant, over the next 2-4 quarters.
Some topics are driven more by use case (phonetic search), and some more by technique (Word2vec must be good for something), so the level of obvious applicability may vary, and some items are defined in one item and referenced in another. Items are roughly grouped by similarity, but only kind of.
Focusing more deeply on any of these topics could lead to a recursive amount of similar detail. For most topics we still need to investigate the costâin terms of development, integration, complexity, computational resources, etcâvs the value to be hadâimprovements to search, benefit to readers, editors, etc. I donât have any strong opinions on âbuild vs buyâ for most of these, especially given the fact that we would only pursue open-source options, which affords much more control. For some, whether build or buy, it may make sense to wait for other resources to mature, like Wikidataâs lexeme search (Stasâs Notes, T189739), or more structured data in Wiktionary.
I think it is often worth considering some not-so-cutting-edge techniques that, by todayâs standards, are not only simple, but relatively lightweight. These have the advantage of being well-documented, easier to implement and understand, and lower risk should they fail to pan out. Many also follow an 80/20 reward/cost ratio compared to cutting-edge techniquesâand implementing five ideas that each get 80% of the value of the best version of each might be better than implementing one idea in the best way possible, especially when the payoff for each implementation is unclear. Experience can then show which is the best to pursue to the max for the most benefit. Of course, cutting-edge techniques are good, too, if they are practical to implement!
As the dividing line between Natural Language Processing, Machine Learning, and Information Retrieval is sometimes blurry, I havenât been too careful to discard ideas that are clearly more ML or IR than NLP.
Every item probably deserves an explicit âneeds literature review for more ideas and optionsâ bullet point, but that would get to be very repetitive. Many items that refer to words or terms might need to be modified to refer to character n-grams for languages without spaces (esp. those without good segmentation algorithms).
N.B.: Items are loosely grouped, but completely unsorted. Top-level items aren't more important than sub-items, and we might work on a sub-sub-item and never take on a higher level item. The hierarchy and grouping is just a way to organize all the information and reduce repetition in the discussion.
Current Recommendations
[edit]David, Erik, and Trey reviewed a selection of the most promising-seeming and/or most interesting projects and gave them a very rough cost estimate based on how big of a relative impact they would have (weighted double), technologically how hard they would be, and how difficult the UI aspect would be (weighted half). See the Scoring Matrix below. The scores are not definitive, but helped guide the discussion.
For the possibility of working with an outside consultant, we also considered how easily separated each project would be from our overall system (making it easier for someone new to get up to speed), how projects feed into each other, how easily we could work on projects ourselves (like, we know pretty much what to do, we just have to do it), etc.
Our current recommendation for an outside consultant would be to start with (1) spelling correction/did you mean improvements, with an option to extend the project to include either (2) "more like" suggestion improvements, or (3) query reformulation mining, specifically for typo corrections. These are bolded in the scoring matrix below.
For spelling correction (#1), we are envisioning an approach that integrates generic intra-word and inter-word statistical models, optional language-specific features, and explicit weighted corrections. We believe we could mine redirects flagged as typo correction for explicit corrections, and the query reformulation mining (#3) would also provide frequency-weighted explicit corrections. Our hope is that a system built initially for English would be readily applicable to other alphabetic languages, most probably other Indo-European languages, based on statistics available from Elastic; and that some elements of the system could be applied to other non-alphabetic languages and languages that are typologically dissimilar to Indo-European languages.[1]
Looking at the rest of the list, (a) wrong keyboard detection seems like something we should work on internally, since we already have a few good ideas on how to approach it. (b) Acronym support is a pet peeve for several members of the team, and seems to be straightforward to improve. (c) Automatic stemmer building and (d) automatic stop word generation aren't so much projects we should work on as things we should research to see if there are already tools or lists out there we could use to make the projects much easier.
Scoring Matrix
[edit]Project | Tech | UI | Impact | Cost | Notes |
Spelling / DYM Improvements | hard | N/A | large | 2 | good scope, distinct from other parts of the system, good for a consultant |
improve "more like" suggestions | hard | N/A | large | 2 | good scope, distinct from other parts of the system, eval is hard, good for a consultant |
wrong keyboard detection | easy/medium | easy | medium | 2.5 | easy for us to work onâsmall impact for most, but large impact for some |
ignore completion prefixes | easy/medium | easy | small/medium | 3.5 | determine low information title/redirect prefixes and also index titles without them. E.g., "List of" |
query expansion | medium/hard | N/A | medium | 3.5 | needs better scope |
Proper acronym support | easy/medium | N/A | small/medium | 3.5 | easy for us to work onâand at the same fix the word_break_helper, properly support N.A.S.A == NASA, (get rid of poor hacks such as this) |
query reformulation mining | hard | easy | medium | 4 | for eventual use for spelling correction or synonymsâneeds separate work to put to good use |
automatic stemmer building | hard | N/A | medium | 4 | (needs researchâmay be able to use existing tools) |
entity recognition | medium | N/A | small/medium | 4 | needs better scope |
diversity reranking | medium | N/A | small/medium | 4 | |
community built thesaurus | hard | hard | medium/large | 4 | (lots of non-technical issues and needs buy-in from the communities) |
related results | medium | easy | small/medium | 4 | |
noun phrase indexing | medium/hard | N/A | small/medium | 4.5 | |
link analysis | medium/hard | N/A | small/medium | 4.5 | |
automatic stop words | medium | N/A | small | 5 | (needs researchâthere may be decent lists out there, should definitely use those first) |
phonetic search | medium | easy | small | 5 |
The Possibilities are Endless!
[edit]- Phonetic search: either as a second-try search (i.e., when there are few or zero results) or as a keyword. Probably limited to titles only. Language-dependent. T182708
- Parsing / part-of-speech tagging (POS tagging usually involves less structural information). Language-dependent.
- can help with word-sense disambiguation, detection of noun phrases for noun-phrase indexing and entity recognition.
- Automatic stemmer building: explore a general framework for automatic stemmer building, esp. using Wiktionary data
- Semi-automated morphological tools: I as thinking that some tools to do conjugations and declensions would be handy for Wiktionary. Then I remembered that templates are arguably Turing complete (or close enough) that of course such things already exist for English and lots of other languages on English Wiktionary.
- Noun-phrase indexing: index more complex noun phrases in addition to or instead of the parts of the noun phrase. Can disambiguate some nouns by making them more specific, can provide better matches to people, places, or things. Detecting noun-phrases in queries could be much harder, so looking for n-grams that are indexed as phrases instead would be one approach.
- Various techniques could be used to limit the what actually gets indexed, like TF and IDF ranges, or score all candidates and only index the top n.
- Could be generalized to phrasal indexing based on something other than purely syntactic considerations.
- Could use page titles and redirects as phrase candidates (perhaps w/ minimum IDF score).
- Entity recognition, classification, and resolution: Recognize entities (possibly restricted to named entities), classify them (people, places, companies, other organizations, etc.), and determine whether different identified entities are the same. (language-dependent)
- Improve recall by recognizing that âJack Kennedyâ and âJohn F. Kennedyâ are the same person.
- Improve precision by recognizing that the fifteen instances of âKennedyâ in an article are all about Bobby Kennedy, and so do not represent good term density for a search on âJack Kennedyâ.
- Distinguish between âJohn F. Kennedyâ, âJohn F. Kennedy International Airportâ, âJohn F. Kennedy School of Governmentâ, and âJohn F. Kennedy Center for the Performing Artsâ as different kinds of entities.
- A useful input to some recognizers and resolvers is a gazetteer. Extracting such a list of known entities from Wikidata or from Wikipedia by category could be useful.
- Coreference resolution / Anaphora resolution: This is similar to entity resolution, but applies to pronouns. Probably not terribly useful directly, but could be useful as part of topic segmentation, and possible as a way to increase the relative weight/term density of entities mentioned in a text.
- Relationship extraction: once your entity recognition has found some entities, you can also derive relationships between them. Might be good for document summarization or identifying candidates for related results.
- LTR Features
- Binary: noting that a given phrase/entity has been identified in both the query and a given article.
- Count: how many times does the phrase/entity from the query occur in the article.
- Word-sense disambiguation
- assign specific sense to a given instance of a word. For example, bank could be a financial institution or the edge of a river, but in the phrase âmoney in the bankâ itâs clearly the financial institution meaning.
- can map to senses defined in a lexicon, or two ad hoc senses derived via un- or semi-supervised algorithms like Word2vec or similar. Ad hoc algorithmically defined senses are not transparent, but can be more specific than lexical senses. For example, the âbankâ in âWest Bankâ refers to a river bank, but its use in a proper noun is less about geomorphology and more about politics.
- can improve precision by searching for the relevant sense of a word, rather than the exact string.
- related to semantic search and semantic similarity
- Use a Thesaurus
- Automatically building a thesaurus:
- mine query logs for synonymsâdiff queries with same click is evidence of synonymy
- mine redirects for synonymsâdiff words in title is evidence of synonymy
- probably language independent; though may need to consider n-grams or other elements for languages without analyzers
- could expand beyond pure synonymy to âexpanded searchâ with Word2vec or similar
- Community-built thesaurus: Allow the communities to define synonym groups on a wiki page some where, and regularly harvest and enable them. (Needs a lot of community discussion and some infrastructure and process to deal with testing, finding consensus, edit warring, etc. But I could imagine ârelforge liteâ would allow people to see approximate results by ORing together synonyms.) Language-dependent, possibly wiki-dependentâthough same-language projects could borrow from each other.
- Various techniques for automated thesaurus building could be used to suggest candidates for community review, including smarter hyphen processing options.
- Some additional considerations:
- Should a thesaurus always be on, or should it have to be invoked? Do we have multiple levels of thesaurus, some on by default, some only used when âexpanded searchâ is invoked, and some way (âtermâ or verbatim:term) of disabling all thesaurus terms.
- Thesaurus for Unicode characters: Automatically add Unicode character names to a thesaurus, so â„ == Ankh, ⏠== euro, etc, based on Unicode character name descriptions. Translate to other languages via Wikidata, or other sourcesâare there official Unicode names in other major world languages? Candidates may need some sort of review, and decide what level of phrasal matching is required (e.g., "goofy face" for đ€Ș, is it a single token or two words? Do textual matches have be an exact phrase match, or does goofy smiley face count? Etc.) (From a discussion on T211824.)
- Automatically building a thesaurus:
- (Semi-)Automatically finding stop words: get a ranked list of words by max DF or by IDF and pick a cut off (hard) or get speaker review (less hard). Language-dependent.
- Or finding existing lists, with a usable license
- Or decide that stop words are overrated.
- T56875
- See also these stop word lists with a BSD-style license.
- Query rewriting: (a super-set of query expansion) the automatic or suggested modification of a userâs query (this is also sometimes called âquery reformulationâ, but here Iâm going to use that to refer to users modifying their own queries. See below.) This includes:
- Spelling correction: not only fixing obvious typos, but also correcting typos that are still valid words. For example, while fro is a perfectly fine word, in âtoys fro totsâ it is probably supposed to to be for. Statistical methods based on query or document word bigrams can try to detect and correct such typos. Many techniques are language-independent.
- Parsing might be able to detect unexpected parts of speech, or evaluate the syntactic quality of a suggested repair.
- A reversed index would also allow us to make repairs in the first couple of letters of a word (which now we canât doâso neither Did You Mean (DYM) nor the completion suggester can correct Nississippi to Mississippi.
- Using word sense disambiguation (in longer queries) to search only for a particular sense of a word. For example, in river bank, we can exclude or reduce the score of results for banks as financial institutions.
- Suggesting additional search terms (in shorter queries) to disambiguate ambiguous terms or refine a query. Could be based on query log mining (term X often co-occurs with term Y), document neighborhoods (add the most common co-occurring term from each of n neighborhoods in which term X most frequently occurs), or other techniques. Language-independent.
- This also includes some kinds of information that we handle at indexing time, rather than at query time, like stemming and using synonyms from a thesaurus.
- See also Completion suggester improvements for related ideas for improving queries while the user is typing.
- Spelling correction: not only fixing obvious typos, but also correcting typos that are still valid words. For example, while fro is a perfectly fine word, in âtoys fro totsâ it is probably supposed to to be for. Statistical methods based on query or document word bigrams can try to detect and correct such typos. Many techniques are language-independent.
- Query reformulation mining: detecting when users modify their own queries to try to get better results and mining that information.
- Mine logs for sequential queries that are very similar (at the character level or at the word level). At the character level, might imply spelling correction. At the word level, might imply synonyms. Probably language-independent.
- Breaking down compounds to index. Less applicable in English, more applicable in German and other languages. Language-dependent.
- Smarter hyphen processing, e.g., equating merry-go-round and merrygoround. Could be done through a token filter that converts hyphenated forms to non-hyphenated forms (language independent). Could be done via thesaurus for specific words, curated by mining candidates (language-dependent curation, but language independent creation on candidate list) or automatically determined based on some threshold, e.g., both forms occur in the corpus n times, where n â„ 1 (language-independent).
- Document summarization: build an automatic summary of a document. The simplest approach chooses sentences of phrases from the existing document based on TF/IDF-like weighting and structural information; much cleverer approaches try to synthesize more compact summaries by parsing the text and trying to âunderstandâ it.
- Could weight summary based on query or other keywords.
- Simple approach could use additional information from noun-phrase indexing, entity recognition, word-sense disambiguation, topic segmentation, or other NLP-derived info to improve weighting of sentences/phrases.
- Semi-clever approach could try to parse info boxes and other commonly used templates to construct summary info.
- Could be an API that allows users to request summary version of document of specified percentage (25%) or specified length (1000 characters).
- A UI supporting a slider that grows/shrinks the summary is possible.
- Multi-document summaries are difficult, but could in theory provide an overview of what is known on a topic from across multiple documents.
- for example, summarize a topic based on the top n search results
- multi-doc summaries, entity recognition, and topic segmentation could allow pulling together all the information Wikipedia has on a topic about person/place X, even though it is scattered across multiple articles and there is no article on X.
- Simple approach is roughly language-independent, adding weighting by query/keywords is possibly quasi-language independent, in that it may only involve one parameterâuse tokens (e.g., English) or use n-grams (e.g., Chinese). Clever approaches are probably language-dependent (the more clever, the more likely to be language-dependent). Template parsing is wiki-dependent.
- Semantic search / Semantic similarity: these are broad topics, and many of the sub-components are touched on throughout.
- Document neighborhoods / ad-hoc facets: several approaches could be used to define document neighborhoods, and the neighborhoods could be used for several things. The basic idea is to find either n clusters of documents or clusters of m documents that are similar in some way. Iâm calling these âneighborhoodsâ because âclustersâ gets used for many, many things.
- defining neighborhoods: any similarity metric and clustering algorithms can be used to cluster documents. Some similarity metrics: TF/IDF keyword vectors, Word2vec/Doc2vec, Latent semantic analysis, cluster pruning, etc.
- sqrt(N) seems like a good heuristic for number of clusters if you have no other basis for choosing
- could assign docs to singular nearest neighborhood, or to all neighborhoods within some distance.
- could define multiple levels of neighborhood
- implementing neighborhoods: most of the candidate metrics are vector-based, and storing vectors in Elasticsearch is probably impractical; creating a new field called, say, ânbhdâ and storing an arbitrary token in it is at least plausible (though based on neighborhood size could still cause problems with skewed indexes).
- using neighborhoods: some use cases
- increasing recall: assign a query to one or more neighborhoods and return all documents in the neighborhood(s) as potential matches. Probably requires new ways of scoring. Might want to limit neighborhood size (rather than number of neighborhoods) in this use case.
- could be used for diversity reranking
- LTR features: learning-to-rank could take neighborhood info into account for ranking. Possible features include:
- Binary value for âneighborhood matchâ between query and doc; each could have one neighborhood, or short list of neighborhoods (multiple matches or hierarchical neighborhoods)
- Weighted value for âneighborhood matchâ between query and doc, given multiple neighborhoods each: could be # overlap in top-5 neighborhoods, hierarchical neighborhood, rank of best match, etc.
- Nominal values of query and documentâe.g., LTR could learn that documents in nbhd321 are slightly better results for queries in nbhd017. Sparsity of data and stability of neighborhood labels are issues.
- defining neighborhoods: any similarity metric and clustering algorithms can be used to cluster documents. Some similarity metrics: TF/IDF keyword vectors, Word2vec/Doc2vec, Latent semantic analysis, cluster pruning, etc.
- Search by similarity
- use built in morelike or our own formulation for similarity (including document neighborhoods)
- add a âmore like thisâ / âfewer like thisâ query-refinement option on the search results page
- have an interface that allows you to input a text snippet and find similar documents
- match documents and editors
- find collaborators to work on a given page by finding people who have made edits to similar pages
- less creepily, find pages to edit based on edits youâve already made (plus, say, the quality score of article)
- possibly weighted by number of edits or number of characters contributed to edited pages
- could be useful for category tools
- Category tools
- use the similarity measures from search by similarity
- find docs like other docs in this category for category suggestion
- probably exclude docs already in sub-categories
- maybe add emphasis on docs in super-categories
- find categories with similar content to suggest mergers
- cluster large category contents into groups to suggest splits/sub-categories
- Diversity rerankingâpromoting results that increase the diversity of topics in the top n results. For example, the top ten results for the search bank might all be about financial institutions; promoting one or more result about the edges of rivers would improve result diversity. Needs some document similarity measureâsee search by similarity and document neighborhoods and the eBay blog post linked to above.
- Could apply to full-text search or the completion suggester.
- Zone indexes: index additional fields or zones (which Iâll refer to generically as zones). Possible pre-defined zones include title, opening text, see also, section titles, subsection titles, section opening text, and frequently used sections like citations, further reading, references, external links, or quotations (in Wiktionary), as well as captions, general aux text, other sections, and category names. Depending on which zones, some are language-dependent (references) and some are not (section titles).
- could use topic segmentation to automatically find additional topic zones.
- zone indexes could be exposed as keywords (like intitle, e.g., search in section titles)
- LTR features:
- zone relevance score (see below) for particular zones
- what zone hits are from (e.g., section title > references)
- term proximity with respect to zones; e.g., all hits are within one topic zone or one subsection is better than if hits for different query terms are all in different zones.
- as keyword or LTR feature, itâs possible to calculate zone-specific relevance score, such as TF/IDF/BM25-type scores, etc. For example, âwars involvingâ is not useful category text on the WWII article.
- Completion suggester improvements
- determine low-information title/redirect prefixes (like âList ofâ) and index pages with and without such prefixes
- find other entities or noun phrases (see noun-phrase indexing and entity recognition) in titles/redirects and also index the âbestâ of those
- n-gramâbased word-level completion: predict/suggest the next few words in a query based on the last few words the user has typed (when the whole query isn't matching anything useful)
- make spelling correction suggestions per-word (which might be different from matching a title with one or two errors)
- (We'll have to think carefully about the UI if we want to show title matches, spelling corrections, and next-word suggestions.)
- Question answering (language-dependent)
- Shallow version: dropping questions words or question phrasing, see T174621, and hope the improved results do the job (see also Treyâs Notes)
- Deep question answering involves converting the question to a query and trying to return specific answers by parsing page results OR converting to SPARQL and getting a Wikidata answer.
- 20-questions style UI for Wikidataâhelping people find something they know something about, but canât quite put their finger on
- Basic/static version split the universe into the n (â€10) most âdistinctiveâ categories, and the user selects one to continue, iterate on specified subset. You should be able to get to anything in ~10 steps. at any point when there are fewer than m (50 †m †100) results show them all. splits could be re-computed monthly (computation is language-independent, presentation is language-dependent)
- Advanced/dynamic version: allow one-or-more selections at any level; for certain categories (e.g., âpersonâ) allow specification of likely-known information (birthdate range, date of death range, country of origin, gender, etc.)âthese could be manually defined for the most obvious categories, or they could be inferred based on the number of items in a category that have those values, and what kinds of values they are. dynamically determine âdistinctiveâ subcategories on the fly based on current constraints (could be very compute intensive, depending on algorithms available). (Topic-dependent)
- Feedback option: allow people to mark categories as unhelpful, and then either donât show them for that session, for that user, for that main subcategory, or generally mark them as dispreferred for all users (much testing needed!). (computation is language-independent, presentation is language-dependent)
- Link analysisâusing incoming and outgoing links to improve search results
- Could be on-wiki link text, or using something like Common Crawl
- Incoming link text could provide additional terms to index the document by
- Targets of outgoing on-wiki links could also provide additional terms to index
- Outgoing links could highlight important terms for the document that should be weighted more heavily
- Outgoing links could also help identify entities (see entity recognition)
- Topic segmentation: Identifying topic shifts within a document can help break it up into âsub-documentsâ that are about different things. This is useful for document summarization, including giving better snippets. Other uses might include indexing sub-document âtopicsâ rather than whole long documents, and scoring particular sections of a document for relevance rather than entire documentsâsee zone indexes.
- The same information used to detect topic shifts can be used across documents for similarity detection, which can be used for clustering or diversity reranking
- Sentiment analysis: detect whether a bit of text indicates a positive, neutral, or negative opinion. Could be a useful for topic segmentation, and also for reversing the emotional polarity on queries. An example I saw for non-encyclopedic searches was âare reptiles good pets?â which should have the same results as âare reptiles bad pets?â (modulo the intent of the searcher to find one preferred answer over the other). Something like Word2Vec could possibly turn positive sentiment terms into negative sentiment terms to search for dissenting opinions.
- Doesnât seem great for any obvious encyclopedia queries, but maybe.
- Might be good for finding dictionary quotation examples with a given polarity.
- Might be good for finding relevant sections on Wikivoyage.
- Learn-to-Rank improvements
- Additional features for LTR: see LTR features in (or following) document neighborhoods, zone indexes, noun-phrase indexing, entity recognition.
- Better query grouping for LTR training: several techniques could be used to improve our ability to group queries for LTR training beyond the current language analysis. Query reformulation mining (or a related thesaurus) could help identify additional queries that are âessentially the sameâ.
- Speech recognition and Optical character recognition: The most obvious use for these is as input techniques. Users could speak a search, or upload a photo (or live camera feed) of text in order to search by similarity or try to find a particular document (say, in Wikisource). (Language-dependent)
- Text to speech: Reading results or result titles aloud. (Language-dependent)
- Speech recognition and text-to-speech can improve accessibility, but users who need those technologies may already have them available on their devices/browsers.
- Language generation: This most likely applies to document summarization from a non-textual source, like an info box. (Language-dependent)
- Language Identification: We already use language identification for poorly performing queries on some Wikipedias, and will show results, if any, from the wiki of the identified language.
- extend language identification for poorly performing queries to more languages and/or more projects, possibly in a more general, less fine-tuned way (Treyâs notes)
- wrong keyboard detection: notice when someone has switched their language but not their keyboard and typed what looks like gibberish (T138958, T155104, Treyâs notes)
- using existing language ID on poorly performing queries, similar to current language ID
- using some other statistical methods for adding additional tokens or giving Did You Mean (DYM) suggestions
- in-document language identification: detect sections of texts that are in a language other than the main language of the wiki, and treat that text differently
- block unneeded/nonsensical language analysis
- annotate it as being in the detected language and allow searching by annotations
- potentially much easier and more accurate than for queries because of larger sample sizes
- Statistical alignment for transliteration: deduce multilingual transliteration schemes from Wikidata names data:
- use data for medium well-known people and places (to avoid the Germany/Deutschland/Allemagne/Niemcy problem), see examples: Q198369, Q940330, Q1136189
- could be useful for maps when no manual transliteration/translation fallback is available
- could be used to create index-only redirects for names in a known language
- could provide suggestions for verification for missing wikidata labels
- Index-only redirects: Rather than doing something like creating a bot to automatically add useful redirects, we could create a mechanism for generating index-only redirects that are indexed like redirects, but which donât actually exist outside the index. They would be generated at index time, and could be expanded or removed to accommodate weaknesses in full-text search or the completion suggester.
- Find commonish typos that neither Did You Mean (DYM) nor the completion suggester can correct, and automatically add index-only redirects for those variants. Improvements to the completion suggester, say, might make some index-only redirect generation unnecessary. Existing redirects could be mined for candidates.
- Provide translated/transliterated redirects for named people and places, either from Wikidata or automatically (see statistical alignment for transliteration). Automatic transliteration would not happen when Wikidata labels were available. Over time, specific index-only redirects might disappear because an incorrect automatic redirect was replaced with a manual Wikidata one.
- Better redirect display: Currently on English Wikipedia, âSophie Simmonsâ redirects to a section of the article about her father, âGene Simmonsâ, which seems to indicate that this is probably the best article about her. However, because there is some overlap in the redirect and title (âSimmonsâ) the redirect text isnât shown in the full-text results, which makes it look like the âGene Simmonsâ article is only a mediocre match, instead of an exact title match to a redirect. On the other hand, âGoerge Clooneyâ redirects to âGeorge Clooneyâ, and maybe it isnât necessary to show that redirect. Perhaps some other metric for similarity could serve as a better gate on whether or not to show the redirect text along with âredirect fromâ. Probably script-dependent (similarity in spaceless languages like Japanese and Chinese might behave differently) and possibly language dependent (highly inflected languages might skew similarity.
- Cross-language information retrieval: How to search in one language using another? Especially taking into account that some wikis are much more developed than others?
- Machine translation of the search and/or the results would allow people to get info in a language they donât understand when no info is available in the wikis they can read.
- Cross-language query lookup: As mentioned in statistical alignment for transliteration, Wikidata is a good source of cross-language mappings for named entities, either for translated index-only redirects, or for cross-language query lookup. For example, is someone on English Wikipedia searches for Moscou (Moscow in French), either look up Moscow on English Wikipedia, or redirect them to Moscou on French Wikipedia.
- This is actually an example of why this could be very hard: Moscou actually matches more than 10 languages! In the other direction, as ĐĐ¶ĐŸĐœ ĐĄĐŒĐžŃ (Russian transliteration of John Smith) matches a lot of peopleâin this case they are all named âJohn Smithâ in English, but there must be more ambiguous examples out there.
- Other second-chance search schemes:
- For poorly performing queries, search the users second-choice (and third, etc?) language wikiâspecified in user prefs, or using accept-language headers, etc.
- Search across all of the users specified languages (up to some reasonable limit) at once.
- Highlight links to the same article in other languages, i.e., resurrect part of âexplore similarâ (see T149809, etc.; see also related results)
- Related results: We can automatically generate result links on related topics.
- Pull information from within results pages (especially from info boxes) to provide links to interesting related topics. Options include:
- Theâbestâ links from the result pageâcould be generically popular, most-clicked on by people from that page, or links from the highest query-termâdensity part of the page (see also zone indexes). Page-specific info likely would only be tracked for popular pages.
- Pull links from info boxes. Could be template-specific; could be based on general popularity, page-specific popularity, or even template-specific popularity (e.g., most-clicked element of this template). Could be manually create by template, or automatically learned (e.g., template-specific popularity above), or a mix. For example: manually curated extraction from top n templates from the top m wikis plus automated template-specific popularity for everything else; or manually assigned weights for top n templates from the top m wikis to bootstrap the process, but then new weights are learned.
- For disambiguation pages, use general popularity or page-specific popularity to suggest links to top n items on the page.
- Provide links to the same article in other languages
- Provide links to related pages (using âmore likeâ or other methods to search by similarity)
- Provide links to related categories, or âbestâ pages from âtopâ categories (using various metrics)
- (Can you tell I want to resurrect part or all of âexplore similarâ? See T149809)
- Consider longest-prefix-of-query vs. article-title matching (or maybe longest-substring within a query)âpossibly requiring at least two words. If this gives a good article title match, pop it to the top!
- Consider searching for really, really good matches in some namespacesâespecially the Category and File namespacesâeven if not requested by the user.
- Pull information from within results pages (especially from info boxes) to provide links to interesting related topics. Options include:
- Related queries: We can automatically generate additional possible queries and offer them to searchers, making them more specific, more general, or on related topics.
- More specific searches: add additional keywords to the query; these can be based on various sources: similar queries from other users, redirect mining, query reformulation mining, or keywords extracted from the top n results (as a form of âmore like this oneâ).
- More general searches: using on-wiki categories, a lexical database like WordNet or WikibaseLexeme data, or other ontology, offer queries with higher-level terms. For example, a query with cat, dog, and hamster might get reformulated with animal, mammal, or pet in addition or instead of the others.
- Related searches: using an ontology as above, offer query reformulations with different, related terms. So cat, dog, or hamster might generate options with gerbil, turtle, or parrot.
- Analysis chain improvementsâgeneral and specific:
- Sentence breaking: Not sure if we need to improve this, but could improve working with zones, and general proximity search.
- Word segmentation:
- Spaceless languages could use word segmentation algorithms, but even languages with spaces could benefit from better treatment of acronyms and initialisms (NASA vs N.A.S.A. vs N A S Aâexploded acronyms are rare in English, but do occur)
- Normalization of compounds and hyphenated words (sportsball vs sports-ball vs sports ball)âsee also breaking down compounds, and smarter hyphen processing
- Abbreviation handling (abbreviation vs abbrev); could be via thesaurus or other mechanism, word-sense disambiguation could also help; abbreviation detection could improve sentence breaking
- Pay attention to capitalization. If a user goes to the trouble to capitalize some letters and not others, maybe it means something. IT isn't necessarily it, and MgH isn't MGH.
- Lots of specific language-specific improvements.
- Support for non-standard input methods: some people donât have ready access to a keyboard that supports their language fully. It can be a problem for specific charactersâlike using one (1), lowercase L (l), or uppercase i (I) for the palochka (Ó) in Kabardian (see T186401), a strong preference for non-Unicode encoding in the community (e.g., T191535), or just a lack of any keyboard meaning all input is in transliteration.
- General Machine Learning: We should keep in mind that ORES is available to provide "machine learning as a service." It might be useful in general for us to become familiar with using ORES to do the right kind of number crunching for us.
A Sigh and a Caveat
[edit]Whew! Suggestions, questions, comments, etc are welcome on the Talk page!
I know I have a bias for English and for Wikipedia. Iâm better at thinking about other languages, but not so great at thinking about use cases beyond Wikipedia and Wiktionary (my favorite wiki). Lots of these ideas are general, but some are more useful for an encyclopedia than a dictionary, and might not be super useful at all for other projects. If you have any ideas for NLP applications on other wikis, please share!
Some general references available online
[edit]- https://nlp.stanford.edu/IR-book/
- https://nlp.stanford.edu/fsnlp/
- https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Search
- https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Search
- https://meta.wikimedia.org/wiki/2017_Community_Wishlist_Survey/Search
- https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Why_People_Use_Search_Engines
- â For the burgeoning linguistics nerds, most Indo-European languages don't have extremely complicated morphology that gloms together a lot of elements into one word. See polysynthetic languages for the opposite extreme. Russian, Greek, Hindi, and Persian are all Indo-European languages, and the Greek and Cyrillic alphabets might do fine, while Hindi's Devanagari abugida and Persian's Arabic script abjad may make them less amenable to the same statistical methods. Spaceless languages, like Chinese (which also has logographic writing), Japanese (mixed logographic and syllabaries), and Thai (abugida) may also have a much harder time using intraword statistical methods.