Jump to content

CirrusSearch/Presentation

From mediawiki.org

This is a presentation for use in one hour sessions talking about CirrusSearch.


Intro

[edit]

Goal of presentation is education. Stop me if I'm not making sense. I'll take questions any time - just stop me.


Background

[edit]

The sum total of the requirements

[edit]

Make the search engine not page people. Users have to like it.

Oh yeah, and it'd be nice if it were kept up to date in real time.

It has to support about 870 million full text searches a month as well as about 3.1 billion find as you type searches a month.


It took 18 months

[edit]

Why 18 months? Many many months spent on "users have to like it." "Users have to like it" really means:

  • Can't take any features away (mostly)
    • You have to figure out what features the old system actually had in the first place to do this
      • Old search didn't have tests or specs or anything fancy like that
  • Have to add some shiny features
    • You have to figure out what users actually want

Solution: slowly roll out. Phased to different communities and as an opt in feature before its the default.

  • Hit communities that are underserved by search first (zh-yue wikipedia, wikisource, wiktionary, wikidata)
  • Hit change averse communities last (enwiki, dewiki)
  • Give power users lots and lots and lots of time to try feature before dumping it on them

And build a running set of regression tests.

Only once we had all the features sorted out could we start to predict hardware requirements. So we had to wait right at the end to order hardware.


The solution

[edit]

Replace the search system that had heroically powered search for years from the ground up with one based on Elasticsearch.


Why?

[edit]

The old search engine, lsearchd, was a Java application based on Lucene 2.3. Lucene 5.0 is coming out soon. The world of open source search moved on and we never kept up. We simply don't have the manpower to maintain our own search system. Solr and Elasticsearch have a significant community behind them keeping them up to date. Both have people full time working on them.

We chose Elasticsearch over Solr for lots of reasons:

  • Contributor pipeline was good and maintainers were nice
  • At the time Elasticsearch had better support for configuring the schema over HTTP
  • Elasticsearch's REST api just *felt* better
  • Deb and rpm packages available that work well

What is Elasticsearch?

[edit]

Think of Elasticsearch as a document database with fancy searching. Documents are generally JSON and you can PUT/POST/GET them. It supports redundancy through replication and horizontal scaling through sharding.

Searching is fancy. Elasticsearch supports these things:

  • Query
  • Filter
  • Rescore
  • Source filtering
  • Returning calculated fields
  • Highlight
  • Aggregate
  • Suggest
  • Stats recording

Example results and example query and example document.

What we learned

[edit]

Good

[edit]
  • Application maintains data. Everything about it. Documents, analysis configuration, everything.
  • Code has to work on last release's data while it is migrating to the new data.
  • Changing analysis configuration is much much much faster than rebuilding the documents from scratch (hours vs days).
  • Both kinds of reindex need to be automated. In our case PHP scripts running in screen was fine. We used bash scripts to repeat for all the wikis.
  • Keep domain specific knowledge out of the search server. If any is needed push it with scripts.
  • Don't be afraid of writing a plugin if you feel you need to get really close to the problem. Open source it. Contribute code to Elasticsearch and/or Lucene. They are good at code review.
  • Queue updates for smoother write frequency. You can even turn them off if you need to!
  • Don't push noop updates to Elasticsearch. They are expensive. You can detect that an update is noop by setting a flag or with a script.


Bad

[edit]
  • Requests can use scripts sent over HTTP. Original scripting language (MVEL) wasn't sandboxed at all. Groovy seems to have some sandbox issues as well. Not very security in depth-y.
    • Its possible to disable dynamic scripts entirely and to only allow scripts that have been placed on the file system. This limits your flexibility but is likely worth it for the piece of mind.
  • Rolling restarts are super slow.

Links

[edit]