Yeah, the general case is different from the German daß/dass problem in that "non-word" symbols, like punctuation, are not going to be indexed even if we deal with ß/ss correctly.
> This would do no analysis, no stemming, no normalization.
I can see not doing stemming or normalization, but "analysis" includes tokenization, which is more or less breaking text up into words in English (and much more complex in Chinese and Japanese, for example). Would you want to skip tokenization, too?
Without tokenization a search for bot
would return matches for bot
, robot
, botulism
, and phlebotomy
? Would you want to be able to search on ing te
and match breaking text
, but not breaking text
(with two spaces between words). Would you want searches for text
, text,
, text.
, and text"
to all give different results? It sounds like the answer is yes, so I'll assume that's the case.
The problem is that this kind of search is extremely expensive. For the current insource regex search, we index the text as trigrams (3-character sequences—so some text
is indexed as som
, ome
, me
(with a final space) e t
(with a space in the middle), te
(with an initial space), tex
, and ext
). We try to find trigrams in a regex being searched to limit the number of documents we have to scan with the exact regex. That's why insource regex queries with only one character or with really complex patterns with no plain text almost always time out on English Wikipedia—they have to scan the entire document collection looking for the one character or the complex pattern. But insource queries for /ing text/
or /text\"/
have a chance—though apparently matching the trigram ing
gives too many results in English and the query still times out!
Indexing every letter (or even every bigram) would lead to incredibly large indexes, with many index entries having millions of documents (most individual letters, all common short words like in, on, an, to, of, and common grammatical inflections like ed). Right now you can search for the on English Wikipedia and get almost 5.7M hits. It works and doesn't time out because no post-processing of those documents is necessary to verify the hits—unlike a regex search which still has to grep through the trigram results to make sure the pattern matches.
An alternative might be to do tokenization such that no characters are lost, but the text is still divided into "words" and other tokens. In such a scenario, text."
would probably be indexed as text
, .
, and "
, and a search for text."
would not match, say, context."
. There are still complications with whitespace, and a more efficient implementation that works on tokens (which is what the underlying search engine, Elasticsearch, is built to do) might still match text . "
and text."
because both have the three tokens text
, .
, and "
in a row. A more exact implementation would find all documents with text
, .
, and "
in them, and then scan for the exact string text."
like the regex matching does, but that would have the same limitations and time outs that the regex matching does.
Unfortunately, your use cases are just not well-supported by a full-text search engine, and that's what we have to work with. I don't think there's any way to justify the expense of supporting such an index. And even if we did build the indexes required, if getting rid of time outs and incomplete results would require significantly more servers dedicated to search.
Even Google doesn't handle the 〃 case (Google: 〃 site:en.wikipedia.org
). It drops the 〃 and gives roughly the same results as site:en.wikipedia.org
(it actually gives a slightly lower results count—61.3M vs 61.5M—but the top 10 are identical and the top 1 doesn't contain 〃).
Also, note that Google doesn't find every instance of 〆. The first result I get with an insource search on-wiki is Takeminakata, which has 〆 in the references. The Google results seem to be primarily instances of 〆 all by itself, though there are some others. (I'm not sure what the appropriate tokenization of 〆捕 is, for example, so it may get split up into 〆 and 捕; I just don't know.)
I'm having some technical difficulties with my dev environment at the moment, so I can't check, but indexing 〆 by itself might be possible. It depends on whether it is eliminated by the tokenizer or by the normalization step. I think we could possibly prevent the normalization from normalizing tokens to nothing—which would probably apply to some other characters such as diacritics like ¨—but preventing the tokenizer from ignoring punctuation characters would be a different level of complexity. There are also questions of what such a hack would do to indexing speed and index sizes, so even if it is technically feasible, it might not be practically feasible. I'll try to look at it when my dev environment is back online.