Jump to content

Core Platform Team/Initiatives/Image Suggestion API

From mediawiki.org


Epics, User Stories, and Requirements

Image Recommendation API (Proof of Concept)

List Unillustrated Articles and their image suggestions

  • As a developer, when I make a request to the Image Recommendation API,
    • I expect to see a list of unillustrated articles and their image suggestions
      • The list should be at most 10 images per 1 page request
      • Of the 10 images, at most 3 of the images should be from ImageMatchingAlgorithm and 7-10 images should be from MediaSearch


List Image Recommendations for all Wikipedia languages

  • As a developer, when I make a request to the Image Recommendation API with a page title,
    • I expect to be able to make requests for all Wikipedia projects in any language
      • e.g. Arabic, Cebuano, English and Vietnamese Wikipedia


Provide the Image Source and Confidence Rating of an Image

  • As a developer, when I receive a list of images
    • I expect to know the source of how the recommendation was provided
      • e.g. I see the image recommendation for the Frog page is from "Commons"
    • I expect to know the confidence rating for each image recommended per page requested
      • e.g. I see that the image for "Amazonian Tree Frog.jpg" has a confidence rating of "high"


Filter # of Image Recommendations Per Article Request

  • As a developer, when I provide a parameter to limit the number of image recommendations per page
    • I expect to get somewhere between 1 and 10 images recommended per page requested

Non-Functional Requirements

  • Authorization/Authentication
  • Performance Metrics
  • API Product Metrics
    • API Usage
    • Unique API Customers
  • Data metrics
    • As a member of the Platform Team, I want the Image Recommendation data pipeline to respect system and data quality SLOs.
    • System
      • Spark sinks (in / out records, cpu usage, memory usage, executor counts
    • Datasets
      • Summary of population statistics (purpose: identify regressions, population/model drift, anomaly detection)
      • Size and counts of intermediate and final datasets (purpose: identify regressions)
  • ML Metrics
    • Accuracy by
      • Method (ImageMatching Algorithm, MediaSearch)
      • Sources (WikiData, Commons, etc.)
    • Recommendations resulting in
      • Rejections
      • Applied Edits
      • Skips
  • Documentation


< Initiatives

Time and Resource Estimates

< Initiatives

Estimated Start Date

None given

Actual Start Date

None given

Estimated Completion Date

March 3, 2021

Actual Completion Date

None given

Resource Estimates

None given

Collaborators

None given

Open Questions

Project Organisation

  • What are our definitions of done and success criteria? How are these broken down per components and aligned across teams?
  • Are we one project team, or two (backend/API) that work together?
    • 1 team with 2 concerns: Image Suggestion API & Data Pipeline
  • How would we like to communicate with other teams?
  • Do we have points of contact for the support teams?
  • Do we need a RACI for the project?
  • Are we missing any resources?
    • No

Timelines and Scope

  • Are there critical intermediate deadlines for other teams that we should be aware of?
  • What is the timeline for the various parts of the project
    • Android: MVP Release by March 3
  • Are there any teams we can decouple dependency from?
  • What can we, platform team, stop caring about? (out of scope)
  • Are the expectations clear and realistic?
  • Can we deliver within the timeline?
  • How do we bound this project if it is also going to be iterative?
  • What are the risks?
  • What constitutes scope creep?
  • What internal deadlines can we set for ourselves?
    • Proof of Concept target delivery date is March 3

Requirements

  • Are there any eventual requirements whose deferral jeopardizes the architecture?
  • What prereqs must we satisfy before we can start a POC Task API implementation?
  • Who approves the API spec?
    • The Client Team(s)
  • How often do we expect to re-train the model? The best we can do is currently once a month.
  • What system / team will be responsible for tracking recommendations state?
  • Can we alter the Image Rec. Algorithm to run more performant(ly)?
  • Is it proven that the image rec. algo provides "better" results than MediaSearch?
  • Does the ranking system need to be part of the first iteration (where does it fall if the SD is no longer going to use the Task API)
    • Confidence Rating will be included as part of the Image Recommendation API proof of concept

API Service

  • What language or framework should we build the api in?
    • The proof of concept will be built with nodejs.
  • Is the API going to be an extension or service
    • The API will be a service.
  • Is task api storing the data from image rec algo + MediaSearch somewhere, or doing queries to both in real time, and then smashing the results together?
    • The API will "smash" the results together of the image rec algo + MediaSearch if the results from the image rec algo are not "sufficient". This may mean not enough results to satisfy the number of requested results. The API will likely do a query to MediaSearch in real time, and then have intermediate storage between the image rec algo Hadoop cluster and the task API.
  • How do we update tasks to reflect user's actions (accept/reject a task)
  • What’s meant by the Image Recommendation bot as an end user? I was under the impression the API would be used by human interaction only
    • The API will serve both end users (e.g. android app users) as well as MediaWiki bots that will automatically select images (with a high enough confidence score) to be added to articles.
  • What happens if a user rejects images for not being relevant? Do we update the options for the next user or remove the recommendation for improvement? Also how are we capturing this information for the algorithm so that it doesn’t offer the same image recommendations the following month (assuming an image hasn’t been added to the page in the last month)?
  • Does the POC include the requirement of "List Image Recommendations for a Given Article"?

Storage

  • Will the task API use Elastic search as a backend or other storage (MySQL, Cassandra etc)
  • What storage are we using for the ETL pipeline
  • What are the performance requirements?

< Initiatives

Phabricator

https://phabricator.wikimedia.org/T260832

Plans/RFCs

None given

Other Documents

https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_an_image

Subpages