Talk:Requests for comment/CirrusSearch
Add topicQuestions, comments
[edit]- If it's just a different improved enterprise wrapper around lucene, then why would search results be any different? (S Page (WMF) (talk) 19:19, 17 June 2013 (UTC))
- Primarilly because Solr brings a different set of plugins and a different scoring strategy. We know we'll be different - we'd like to be better. (NEverett (WMF) (talk) 20:52, 17 June 2013 (UTC))
- About the URLs served by solr for each collection of documents. Are these URLs internal, or will it be visible to wiki users? (S Page (WMF) (talk) 19:19, 17 June 2013 (UTC))
- Some of those urls are administrative and shouldn't be exposed to the public (NEverett (WMF) (talk) 20:52, 17 June 2013 (UTC))
- The actual search is probably safe to expose but I think we should do more research before doing so. (NEverett (WMF) (talk) 20:52, 17 June 2013 (UTC))
- On further consideration, I don't see us exposing this. We provide APIs (opensearch and normal MW search) for people to make requests against. If we'd like to expand those APIs, that's a cool feature request. Requiring people to go through MediaWiki to get search ensures all the normal things like rate limits, resource pools (PoolCounter) and so forth are respected. ^demon[omg plz] 16:18, 25 June 2013 (UTC)
- Most stuff we'll test in labs. Will the labs instance(s) show search results for a live wikipedia(s)? I hope so, otherwise people won't be motivated to test. If so, you could even change the messages on the live wiki's Special:Search page to add "Repeat this search against our new search engine in testing" (S Page (WMF) (talk) 19:19, 17 June 2013 (UTC))
- I like the "Repeat this search against our new search engine in testing" quite a bit and think we should go for it if it isn't a ton of work. (NEverett (WMF) (talk) 20:52, 17 June 2013 (UTC))
- We've been testing against restores (via mwdumper) of enwikiquote, a nearly empty wiki full of test data, and jawiki in labs. (NEverett (WMF) (talk) 20:52, 17 June 2013 (UTC))
- When you roll out to mediawiki.org and other wikis, will there be a similar "Problems with these search results? Try using the old search engine, _report_ any cases where it's worse." (S Page (WMF) (talk) 19:19, 17 June 2013 (UTC))
- This sounds like a great way to get more feedback. (NEverett (WMF) (talk) 20:52, 17 June 2013 (UTC))
- I agree. It could be a link to a translatable help page here on mediawiki.org, containing also instructions on how to report issues/give feedback. Such links usually work quite well. --Nemo 16:12, 18 June 2013 (UTC)
- I'm going to look at adding something like a query parameter so you can easily switch search backends. This will allow people to easily compare results side-by-side. ^demon[omg plz] 16:18, 25 June 2013 (UTC)
- Will Solr have any more or different features than current search? (According to Help:Searching there are currently no features except the Special:Search page's checkboxes for category selection and including redirects) (S Page (WMF) (talk) 19:19, 17 June 2013 (UTC))
- The plan for CirrusSearch is to replace what we have in the first release and then build on it. (NEverett (WMF) (talk) 20:52, 17 June 2013 (UTC))
- Is CirrusSearch a PHP plug-ins or Solr plugin? Can you link to its code? Similarly, there is no link or explanation of MWSearch and lsearchd. (S Page (WMF) (talk) 19:19, 17 June 2013 (UTC))
- PHP. Added links. (NEverett (WMF) (talk) 20:52, 17 June 2013 (UTC))
- It sounds fine, thanks for working on this. (S Page (WMF) (talk) 19:19, 17 June 2013 (UTC))
- Will it be possible to have an interwiki search again (bugzilla:44420) at some point? Is it easier with Solr than with Lucene? --Nemo 07:05, 18 June 2013 (UTC)
- This is not something we're shooting for for the first release. I like the idea a lot and am really tempted to dream up great ways to implement this but I'm trying not to get distracted. NEverett (WMF) (talk) 14:58, 25 June 2013 (UTC)
- How is the scoring going to be? Will it replicate the current Lucene scoring (which considers number of incoming links and the like) or not, and will it be easier to adjust gradually as needed in the future (as opposed to the current monolithic scoring nobody understands, impossibile to tailor)? --Nemo 07:05, 18 June 2013 (UTC)
- We aren't going to replicate the current scoring. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- Thanks. Is scoring another thing you'd rather work on upstream as tokenizers mentioned below? --Nemo 16:31, 24 June 2013 (UTC)
- That depends on the scoring problem and what we determine is the most appropriate way to solve it. Some scoring issues will be resolved by patching upstream (Solr) and submitting those patches (problems with tokenizers and analyzers), some will be resolved by modifying CirrusSearch (weighting issues, sending more data to the index?), and yet others might require some tweaks to MediaWiki itself (template expansion?). I'm sorry I can't be too specific but I think we'll discover more as the rubber meets the road. NEverett (WMF) (talk) 14:58, 25 June 2013 (UTC)
- Thanks. Is scoring another thing you'd rather work on upstream as tokenizers mentioned below? --Nemo 16:31, 24 June 2013 (UTC)
- We haven't yet decided whether to take into account the number of incoming links. The results seem pretty good without it. I think this is something worth deploying without to a subset of wikis and adding it if we feel that search results are less good and we feel that this is the cause. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- Ok. Does this affect autocompletion too? --Nemo 16:31, 24 June 2013 (UTC)
- Since we're not building the link weighting at all at this point it certainly does affect autocomplete which I might slip up and call prefix search sometimes because that is how I think of it. NEverett (WMF) (talk) 14:58, 25 June 2013 (UTC)
- Ok. Does this affect autocompletion too? --Nemo 16:31, 24 June 2013 (UTC)
- We aren't going to replicate the current scoring. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- "Will transcluded pages be able to be indexed in situ especially where the pages are transcluded cross-namespace, or would this be part of a future build?" --Nemo 07:05, 18 June 2013 (UTC)
- I'm breaking this out from the question about included templates and doing more research before I answer it. I promise I'm paying attention. NEverett (WMF) (talk) 14:58, 25 June 2013 (UTC)
- Will it index the pages with all templates expanded? This is particularly important for Wiktionary, but also Wikisource and Wikipedia. --Nemo 07:05, 18 June 2013 (UTC)
- The plan is to expand all templates. One question that has come up is, should we not expand some of the templates? NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- Thanks. In general, I'd say no, but it depends on how smart the scoring is: you wouldn't want to have thousands of articles containing a word or name in an infobox coming before articles actual mentions of that word, when searching for it. If the scoring is not smart enough, it's possible wikis would like to exclude some templates (say, navigational templates) with some tag similar to noinclude or Category:Exclude in print. --Nemo 16:31, 24 June 2013 (UTC)
- What'll happen is that phrases in super common infoboxes will become less important with regards to scoring. Searching for <citation publication> will sort things about publications and not citations higher than things about citations but not publications. It'll still spit out things about citation publications above all of those which should be fine. Really, the problem with expanding all templates while indexing is that it is slow during batch indexing which is what we're actively working on right now. NEverett (WMF) (talk) 14:45, 25 June 2013 (UTC)
- If we do decide that we want to not expand (or remove entirely) some of the templates then we can always make that change later and reindex everything. It'll take time but we're making sure that reindexing is something we can do if we need it. NEverett (WMF) (talk) 14:45, 25 June 2013 (UTC)
- I think what we're going to do is fully parse the page and then have Solr strip the HTML. I think if we do something like "noprint" or so-forth we can make it exclude specific parts of the document as well. I fixed the performance problem. ^demon[omg plz] 16:18, 25 June 2013 (UTC)
- Thanks. In general, I'd say no, but it depends on how smart the scoring is: you wouldn't want to have thousands of articles containing a word or name in an infobox coming before articles actual mentions of that word, when searching for it. If the scoring is not smart enough, it's possible wikis would like to exclude some templates (say, navigational templates) with some tag similar to noinclude or Category:Exclude in print. --Nemo 16:31, 24 June 2013 (UTC)
- The plan is to expand all templates. One question that has come up is, should we not expand some of the templates? NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- If I remember correctly, MWSearch uses customized tokenizers in indexing. Will these be ported to the new solr search? --Mglaser (talk) 10:02, 18 June 2013 (UTC)
- And how easy will it be to implement new ones[1]? --Nemo 16:10, 18 June 2013 (UTC)
- If we need new tokenizers and such we'll contribute them upstream. It does look like Solr has some support for Finnish built in by using Snowball but I doubt it is as good as Omorfi from the looks of things. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- I absolutely love the faceted search feature of Solr. Are there any plans to use this one? I think this might have to be reflected in building the schema. --Mglaser (talk) 10:02, 18 June 2013 (UTC)
- One of our requirements is the ability to change the schema without too much pain. So no, we don't plan on using faceting right yet, but yes, we'll gladly do something with it when we know what would be useful. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- A search for categories would be so great! In BlueSpice, when indexing the articles, we also store their categories. On the scale of our wikis, this is very performant. Would you think this might also be an idea at this large scale? --Mglaser (talk) 10:02, 18 June 2013 (UTC)
- incategory:<name> can be used currently to filter based on category. That works in CirrusSearch too. Faceting in categories might be nice. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- In order to measure the quality of full-text search, would it be helpful to (in an automated way) compare with the search result given by major web search engines when run exclusively for a Wikimedia site, such as "site:lang.wikipedia.org" on Google? Of course we should not be targeting to replicate them, but the comparison could give some hint when doing things wrong (such as by weird tokenization). --Whym (talk) 14:43, 22 June 2013 (UTC)
- I think something like this would probably take a while to implement and give too many false positives so I don't plan on it. I'd prefer to deal with individual bugs and build a regression suite around that. NEverett (WMF) (talk) 14:58, 25 June 2013 (UTC)
- I don't mean to sound like a broken record, but I'm not sold on the comparison of Solr vs. ElasticSearch. The current Solr installation is irrelevant as it's small, built for a different purpose, badly designed, does not use Solr 4.x/SolrCloud and it really just needs to go (unless you really want to compare Solr 3 with ElasticSearch :). I can't argue with the "we have more experience with Solr" argument but I'd really prefer a comparison on their technical merits, if anything to learn more about their differences. http://solr-vs-elasticsearch.com/ seems like a good resource and it seems to suggest ElasticSearch for "large installations" (note that the comparison welcomes input, they have their HTML in GitHub and welcome contributions). Faidon Liambotis (WMF) (talk) 23:56, 22 June 2013 (UTC)
- It is probably worth spending a few days playing with ElasticSearch now that we're comfortable with Solr. NEverett (WMF) (talk) 15:52, 24 June 2013 (UTC)
- I agree with Nik. We'll start iterating with it as well. ^demon[omg plz] 16:18, 25 June 2013 (UTC)
- Will it be possible to use CirrusSearch on a non-Wikimedia installation of MediaWiki? -- Tim Starling (talk) 06:08, 22 July 2013 (UTC)
- That's the plan. We've written the extension to be very generic, so as long as you're able to setup Solr or Elastic you'd be able to use it. Of course this doesn't help people on shared hosting, but they've never been able to use anything except database-backed searching. ^demon[omg plz] 06:17, 22 July 2013 (UTC)
- How will updates work exactly? How much latency, if any, will be added to page saves? Will the parser cache be used? Will the API be used? -- Tim Starling (talk) 06:13, 22 July 2013 (UTC)
- We're using the SearchUpdate deferred update. There shouldn't be much latency noticeable to users as it's deferred until after page output, but I haven't done any profiling yet. And yes, the pcache is being used. ^demon[omg plz] 06:17, 22 July 2013 (UTC)
- With some configurations deferred updates do delay the serving of the page. I don't know if it is inherent when using nginx or if it is a configuration issue. I haven't seen that explained anywhere. --Nikerabbit (talk) 17:43, 22 July 2013 (UTC)
- We're using the SearchUpdate deferred update. There shouldn't be much latency noticeable to users as it's deferred until after page output, but I haven't done any profiling yet. And yes, the pcache is being used. ^demon[omg plz] 06:17, 22 July 2013 (UTC)
- I do not see any changes in the section about languages after changing the preference from Solr to Elasticsearch. How well does Elasticsearch support languages other than English, compared to Solr? siebrand (talk) 15:48, 22 July 2013 (UTC)
Features lost after migrating to CS
[edit]One point I see which is lost on wiki-sites migrating to this new search engine. You know that there are lots and lots of navboxes breeding by pages, people try to fight with them, but they are often stronger (and there are another kinds of templates which also add some standard texts to many pages). So, many articles are connected to some term only by mentioning it in some template in a hundred of others; one can't use WLH mechanism to filter them out. But since old SE ignores template text, one is able to get results where the term is mentioned in text, not template. But new engine doesn't behave in such manner. You can't just say "don't show me pages with template XXX", since many of them are truly relevant.
I would like that a parallel search engine is introduced that searches in bare wikitexts. It also will help in cases like finding deprecatedly named parameters of protected templates; currently, such task can be effectively done only by a bot operating on a dump, which is slower and unaccessible to many users. Ignatus (talk) 21:31, 13 January 2014 (UTC)