Wikimedia Technology/Annual Plans/FY2019/TEC8: Search Platform
Search Platform provides the infrastructure and back-end tooling for content discovery across the Wikimedia/Wikipedia landscape. This includes not only surfacing relevant content when readers are looking for it, but also guiding people to content when they have may not have expressed themselves clearly, or might not know exactly what they're looking for. And we do this across languages, for both MediaWiki and Wikidata. Our main focus is on utilizing machine learning and NLP to improve ranking and relevancy of search results and to provide front-end teams with interfaces to results that can be used to improve the experience of search for readers and editors.
During the 2018-19 fiscal year, Search Platform will also be significantly contributing to the Structured Data on Commons cross-department program.
Program outline
[edit]Teams contributing to the program
[edit]Search Platform, WMDE, Audiences (75% of one analyst)
Annual Plan priorities
[edit]Primary Goal: 3. Knowledge as a Service - evolve our systems and structures
How does your program affect annual plan priority?
[edit]To be able to effectively deliver Knowledge as a Service, it is critical that we provide excellent search and discovery tooling. By utilizing machine learning, the search platform team has already laid the foundations of a highly tunable search result ranking engine. While this gives us the opportunity to expand the array of features that influence the ranking of search results, there are still many improvements we can make that will help us surface more relevant results, both within and across languages, by incorporating natural language processing (NLP) and phonetic matching, and adding more specific language analyzer plugins to Elasticsearch which will be able to deal with the nuances between closely related yet distinctly different languages. With these improvements, we will be even better positioned to lead readers to the content they are seeking, and expose them to more accurate related content that will keep them exploring ever deeper into the knowledge space.
In addition to increasing the relevance of search results, the Search Platform team will be working closely with the Structured Data on Commons team to implement search requirements for this next-generation implementation of Commons.
Program Goal
[edit]Through incremental Search Platform component improvements, teams and developers can deliver more and better ways for readers and editors to discover and search for content.
- Outcome 1
- The advanced machine learning techniques we implement will improve search result relevance across language Wikipedias.
- Output 1
- Continue to identify new features for machine learning and incorporate them into the Machine-Learning-to-Rank (MLR) pipeline
- Output 2
- Experiment with Natural Language Processing (NLP) to improve the machine learning results
- Output 3
- Maintain CirrusSearch and the Search API
- Outcome 2
- Users across languages experience better search results.
- Output 4
- New language analyzers deployed to improve support for multiple languages (as they make sense to individual language wikis).
- Outcome 3
- Wikidata Query Service expanded with deeper, cross-wiki search features
- Output 5
- Deep category and full-text search for wikidata via WDQS (in preparation for Structured Data on Commons).
- Outcome 4
- Search Platform gains a much deeper understanding of search performance and improvement impact metrics
- Output 6
- Dashboard of new and relevant metrics that encompass the performance and impact of machine learning in content discovery
Resources
[edit]FY2017–18 | FY2018–19 | |
---|---|---|
People (OpEx) | Current team, contributing to all outcomes:
Short-term contract resources:
|
Short-term contract resources:
|
Stuff (CapEx) |
| |
Travel & Other |
|
|
Targets
[edit]Outcome 1
[edit]- Search Platform has a set of clear running metrics in a dashboard indicating the performance (and improvement trajectory) of the machine learning mechanisms, encompassing coding efficiency (ie. how fast new features can be implemented and trained), processing performance (ie. how long it takes to train models), and result relevance.
- Target
- Increase processing performance and efficiency, along with result relevance, based on baseline metrics to be identified and implemented in a dashboard during FY2017-18.
- Measurement method
Still being determined
Dependencies
[edit]Audiences has been providing 75% of one analyst to help with data analysis for search platform A/B tests, and we still need that resource.