Architecture Repository/Artifacts/Knowledge store
Knowledge store
[edit]Free knowledge data model based on schema.org
Last updated: 2022-12-16 by APaskulin (WMF)
Status: v1 published May 2021 to inform the creation of the schema for Wikimedia Enterprise. For the current Wikimedia Enterprise schema, see the data dictionary on enterprise.wikimedia.com.
The purpose of this document is to define a predictable structure for distributing Wikimedia content. To do this, we’ve chosen to use standard types and properties from schema.org. This model is not meant to replace existing data structures within MediaWiki; instead, these structures can act as part of a distribution layer that consumes, structures, and serves knowledge beyond Wikimedia.
Using this model
[edit]We encourage Wikimedia projects to make use of this model, either as a whole or as a base to build on. Services currently using this model include Phoenix (structured content proof of value) and Wikimedia Enterprise.
Adding a property
[edit]As defined here, the model is restricted to properties that are meaningful outside the context of MediaWiki. To suggest a new property, leave a comment on the talk page. New properties should conform with the applicable schema.org type whenever possible.
Feedback and questions
[edit]To share feedback and question, leave a comment on the talk page. Note that there are often several unknowns associated with each type; these unknowns are tracked in the notes and questions subsections.
Patterns
[edit]Capabilities
[edit]Serve and distribute
Distribute predictably-structured knowledge to products and platforms
Language
[edit]a human language
Based on schema.org Language
{
"name": "English",
"identifier": "en",
"direction": "ltr"
}
Property | Type | Description |
---|---|---|
name
|
Text | Language name in that language |
identifier |
Text | Language code as used by Wikimedia (ISO 639 with exceptions[1]) |
direction (not on schema.org) |
Text | right-to-left (rtl) or left-to-right (ltr) |
variant (not on schema.org) |
Text | Language variant[2] (if applicable) |
Notes and questions
Project
[edit]a wiki in a single language
Based on schema.org CreativeWork (not on schema.org Project)
{
"name": "Wikipedia",
"identifier": "en.wikipedia.org",
"in_language": {
"identifier": "en"
},
"url": "https://en.wikipedia.org",
"size": {
"value": 70934,
"unit_text": "MB"
}
}
Property | Type | Description |
---|---|---|
name
|
Text | Unabbreviated project name in the language specified by inLanguage (Example: Wikipedia, Wikisłownik, etc.) |
identifier |
Text | Project domain (Example: en.wikipedia.org) |
in_language |
Language | Human language the project is written in |
url |
Text | URL for the project entry point (not directly to the main page) |
size |
QuantitativeValue | Project size when downloaded as a whole (compressed) |
Notes and questions
- How should we handle inLanguage for multi-lingual projects? (Commons, Wikispecies, Wikidata, etc.)
Page
[edit]a wiki page
Based on schema.org Article
{
"name": "Pinnation",
"identifier": 339742,
"url": "https://en.wikipedia.org/wiki/Pinnation",
"in_language": {
"identifier": "en"
},
"is_part_of": [
{
"identifier": "en.wikipedia.org"
}
],
"version": 975098740,
"date_modified": "2020-08-26T18:48:58Z",
"license": [
{
"identifier": "CC-BY-SA-3.0",
"name": "Creative Commons Attribution Share Alike 3.0 Unported",
"url": "https://creativecommons.org/licenses/by-sa/3.0/"
}
],
"main_entity": {
"identifier": "Q3756157"
},
"keywords": "Plant morphology, Leaves",
"has_part": [
{
"identifier": "/node/ff569ed4759dbfc"
}
]
}
Property | Type | Description |
---|---|---|
name
|
Text | Page title in reading-friendly format (spaces instead of underscores) |
identifier |
Integer | Page ID (MediaWiki page ID) |
url |
Text | Complete URL for the page |
in_language |
Language | Human language the page is written in |
is_part_of |
array of Project | Wiki the page belongs to |
version |
Integer | Revision ID (MediaWiki revision ID) |
date_modified |
Text | Timestamp of latest revision in ISO 8601 format (DateTime) |
license |
array of License | Content license |
main_entity |
Entity | Primary subject of the page (Wikidata ID) |
keywords |
Text | Comma separated list of categories the page belongs to |
has_part |
array of Section | Page sections |
Notes and questions
- Consider using display title for
name
instead of reading-friendly title - How should we handle media files associated with a page? Schema.org has audio, video, thumbnailURL, and primaryImageOfPage (MediaObject). Note that using primaryImageOfPage would be from WebPage type.
- How to handle licenses for images embedded in a page? (Check with legal)
- Should we include other URLs (mobile, edit, talk, etc.)? Schema.org has discussionUrl but no others.
- We’ve intentionally not included content at the page level in favor of providing content at the section level.
- Is it a problem that isPartOf would be inconsistent between objects?
- Properties to consider:
- about - Rosette or other set of page subjects (Wikidata items)
- interactionStatistic seems like the most logical place for pageviews, number of edits, etc. What types of stats should we include? (array of InteractionCounter)
- mentions - array of Thing, links included within the page
- abstract: Is there a way we could get the first two sentences of the article?
- citation (References used on the page)
- schemaVersion (https://schema.org/docs/releases.html#v12.0) seems like a good idea, but I’m struggling to see the value. These releases seem to come out every few months.
- page quality score (aggregateRating?)
- copyrightHolder - “The text of Wikipedia is copyrighted (automatically, under the Berne Convention) by Wikipedia editors and contributors and is formally licensed to the public under one or several liberal licenses.”[1] (Covered by license?)
- dateCreated (page’s initial publication date)
- creativeWorkStatus
- creditText (attribution text)
Section
[edit]content grouped under a heading or as an introduction before the first heading on a page
Based on schema.org CreativeWork
{
"name": "Orbit and turning",
"identifier": "/node/ff569ed4759dbfc",
"version": 975098740,
"is_part_of": [
{
"identifier": 339742
}
],
"text": "...html...",
"encoding_format": "text/html",
"license": [
{
"identifier": "CC-BY-SA-3.0",
"name": "Creative Commons Attribution Share Alike 3.0 Unported",
"url": "https://creativecommons.org/licenses/by-sa/3.0/"
}
]
}
Property | Type | Description |
---|---|---|
name
|
Text | Section heading |
identifier |
Text | Knowledge store ID |
version |
Integer | MediaWiki revision ID |
is_part_of |
array of Page | Page the section belongs to |
text |
Text | Section content in HTML |
encoding_format |
MIME type | "text/html" |
license |
array of License | Content license |
Notes and questions
- Properties to consider:
dateModified
about
- Rosette or other set of page subjects (Wikidata items)
License
[edit]content license
Based on schema.org CreativeWork
{
"identifier": "CC-BY-SA-3.0",
"name": "Creative Commons Attribution Share Alike 3.0 Unported",
"url": "https://creativecommons.org/licenses/by-sa/3.0/"
}
Property | Type | Description |
---|---|---|
name
|
Text | License name |
identifier |
Text | License ID from spdx.org |
url |
Text | URL for the license text |
Notes and questions
Entity
[edit]a subject of a page
Based on schema.org Thing
{
"identifier": "Q3756157"
}
Property | Type | Description |
---|---|---|
identifier
|
Text | Wikidata ID |
Notes and questions
- Connection with Wikidata