Jump to content

Wikidata annotation tool

From mediawiki.org

Annotation Tool that extracts information from the books and feed them on Wikidata

[edit]

Public Url:

(https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Annotation_tool_that_extracts_statements_from_books_and_feed_them_on_Wikidata)

Announcement of Proposal:

Announcement 1
Announcement 2

Discussions

Project Completion Report
[edit]

(https://www.mediawiki.org/wiki/Wikidata_annotation_tool/project_completion_report)

Important: Updates

(https://www.mediawiki.org/wiki/Wikidata_annotation_tool/updates)

Name and Contact Information

[edit]

Name:

Amanpreet Singh

Email:

amanpreet.iitr2013@gmail.com

IRC Nick:

apsdehal

Web Page / Blog / Microblog:

Spookout

Location:

Roorkee, Uttarakhand, India

Typical Working Hours:

10:00- 13:00, 15:30-19:00, 22:00-03:00 ( IST ) 4:30- 7:30, 9:30-1:30, 16:30- 21:30 ( UTC )

Synopsis

[edit]

Project is strongly based on belief to improve user interactivity with Wikidata and create a whole new world of data sharing and saving by creating a tool that on highlighting a statement would provide a GUI to fix its structure and then feed it to Wikidata. Wikidata is a free information base that is same for humans and machines. It centralizes access to and structurally manage data so that every piece of data is easily available and accessible. By the means of the plugin people can save their important notes and quotes directly on Wikidata hence making them more accessible to the mass.

The Need

[edit]

Statements or annotations link the web data together and bind them as one entity. Items, properties and values which are worthless without their interconnections are brought to life through these statements. The tools aims at helping people create annotations as a result gluing the dataweb together, and as a result, enriching it with tremendous amount of knowledge. So the need of the project is justified in a way, that there is need to continuously link together the things and thus make this network of data more and more valuable with more and more people annotating.

Possible Mentors

[edit]
  1. Cristian Consonni
  2. Andrea Zanni
  3. The Pundit team

Use cases

[edit]
  1. You are at home, reading a book on Wikisource. Suppose you want to take notes of important things, you can annotate and directly feed and share important quotes and data automatically with their source to the knowledge base of Wikidata. Furthermore, the viewers of the book after you will be able to see your notes and thus saving the time. This can be done just by activating the plugin.
  2. Imagine a work office scenario. You are attending a presentation or seminar. An important fact or data point is shared during the presentation, e.g. your national statistical institute has just released the latest population data on their website. You can annotate it, click and it is on Wikidata.
  3. You are reading the news on your tablet using your browser, a new prime minister is being nominated. You can select the relevant text and insert this information in Wikidata.
  4. Given a statement from Wikidata (or another source), we can use this tool to mark up a reference and import that reference to Wikidata. This could help with providing references for the millions of statements (claims) that currently don't have one. So more people annotating through this tool will add more and more references to the Wikidata. So this way many claims can be converted to proper statements.

Information about project

[edit]

Glossary of Wikidata terms used:

[edit]
Item
It is a page in Wikidata main namespace representing a real-life topicconcept, or subject. Items are identified by a prefixed id, or by a sitelink to an external page, or by a unique combination of multilingual label and description.
Properties
It is a descriptor of a value for a particular item. In other words, it is an attribute for an item.
Statements
is a piece of data about an item, recorded on the item's page. A statement consists of a claim (a property-value pair such as "Season: Winter" about an item, together with optional qualifiers), supported by optional references (giving the source for the claim).
Claim
It is simply a statement without references.
Value
Simply an information about item that explains something about one of its property.
Quantifier
It is a part of the claim that says something about the specific claim, often in a descriptive way.

The side picture explains the above glossary terms, by using an item named London.

Now lets come to Pundit, the way Pundit creates annotations is person selects a sentence and this opens up a triple composer which has the sentence as the subject of the annotations, predicate can be selected from a preloaded list, while the objects are fetch from various servers like DBpedia etc. This procedure creates a triple, further triple can be composed to add more info about the statement, like one for references. Finally, all these triple are saved as an annotation based on rdf model to Pundit annotations server.

How it will work?

[edit]

I am going to create a browser plugin for this project that will offer a GUI on highlighting a sentence. This GUI will be built upon Pundit software (using Pundit's triple composer) which will help to create a triple (subject, object, predicate), that provide an interface to choose from fetched items, properties, values from Wikidata (based upon the sentence highlighted) as possible entities for the triple, further options will be provided to create another triples for adding referencs then finally feed the statement to Wikidata in proper manners (on item's page with properties, values and references) through JavaScript by linking the annotations to its items. The tool will offer suggestions for composing triple based on the existing properties and items on Wikidata. For the whole process, we are going to use Wikidata's regularly improving API to achieve our goal. Through this whole data I saved or searched will be shared globally, as in future viewers of the page where sentence was highlighted will be able to see the annotations.

Triple Composer
Pundit's Triple Composer

Following schema shows how the extension will work in details:

  • Firstly, we are going to track the user using API to check if he/she is login and if not redirect to login page. User can still anonymously annotate text as usual like an anonymous user edits pages on Mediawiki.
  • I will package Pundit integrated with Wikidata vocabulary (that will be fetched from Wikidata accordingly) and selectors, and a whole new GUI (different than already available Pundit GUI) as a browser plugin and as a bookmarklet.
  • I will provide a GUI to the user so that he/she can annotate text. Note: Pundit already provides a GUI, we will alter according to our needs as most suitable.
  • Next, the interface should propose to:
    • choose a subject (i.e. an item), by default it will be the sentence user highlighted.
    • choose a predicate (i.e.a property)
    • choose an object (i.e. data value, or statement)
The proposed predicate should already exist on Wikidata, if not we will present user with an interface with title:
'Can't find what you are looking for? Propose a property', and then we will redirect the user to property proposal page (A page where you can propose new properties for Wikidata). After this step, till now the annotation has become a claim.
  • In the next step we will gather sources of the annotation such as gathering website url, book's name (Wikisource) and many more. If we can't find sources we will provide an interface to user to input them himself, so as to convert the claim to statement through references.
  • Pundit will analyze the annotation as subject, object and predicate, pack it as statement and then save it at Pundit server.
  • JavaScript scripts (or a bot) will be run to update the item's page on Wikidata with the necessary information about the statement created. This will be also be done sometimes through Wikibase API.
  • The flow will be unidirectional, that the user create annotations, save it on Pundit server, it is also synchronized with Wikidata item's page.
  • Further extensions to this project can be bidirectionality.

Tools to be used

[edit]
  1. Pundit: Pundit is the free open source software for augmenting web pages with semantically structured annotations. I am going to use this to extract entities from the sentence and then use its triple composer to compose triples (subject, predicate and object) which will suggest subject, predicate and object to user based on results retrieve by JavaScript scripts as properties, items and values from Wikidata. The reason I chose Pundit is basically it is an open source software, well established and regularly maintained. In addition, creators of this beautiful software are ready to help in case I need any. Example of how Pundit works, will explain in detail the process of creating annotations by it.
  2. Wikibase API: I am going to use API for Wikidata for the interaction related to the latter, currently it is in stable state and is regularly maintained. I will interact with Wikidata item pages through this API. Second job this API will do is to retrieve items, values and properties from Wikidata as to present to user so he/she can create their own statements. Also the login status of the user will be checked through this API.
  3. Wikibase PHP API: I might be using this tool provided by addwiki sometimes in case I am unable to do a certain request through JavaScript.
  4. Dojo: As Pundit is build up in Dojo framework for the JavaScript, I would be writing my most of code in it.
  5. QUnit: I will be using QUnit Test framework provided by the jQuery foundation to test my code against many testing scenarios. This is openly available software lincesed under MIT license.
  6. In some cases the code may be extended from the existing external tools present at Wikimedia labs.

Deliverables

[edit]

Required Deliverables

[edit]
  • Create a plugin that would can provide function of annotating text and then feed that annotated text to Wikidata.
  • Plugin must use existing properties available from Wikidata, if not available ask for creating a new one.
  • Plugin must allow users to create items on the fly.
  • Provide references by taking source URL and quotations in considerations.
  • Show the user with the plugin activated the annotations made by previous users of an annotation source.

Optional Deliverables

[edit]
  • Create a Mediawiki extension for the same and thus increasing the reach.

Timeline

[edit]
Task No. Timeline Task
1 April 22 - May 3 Familiarising myself with the mentors, codebase and my project. As I am already in contact with my mentors for a long time, I have also gone through the codebase, so it won't take much long to accomplish this task.
2 May 4 - May 16 Create a prototype plugin for the Annotator and write its initial code. Also write the login functionality.
3 May 17 - May 23 Write unit tests and test the initial code.
4 May 23 - June 5 Use Pundit's entity extractor to extract entities from sentence and start altering the GUI provided by the existing base of Pundit, Consult Wikimedia Design team for the design improvements.
5 June 5 - June 15 Write more unit tests and vigorously test the GUI and the extension by creating various scenarios. Fix up any bugs found during testing.
6 June 15- June 25 Use Wikidata API to retrieve existing properties on selected text and show suggestions.
June 23 Midterm Evaluation
7 June 25 - July 10 Parse the statement (subject-predicate-object) to the Wikidata and save them on items pages using Wikidata API. Also saving on the Pundit's server to show further users what was annotated before them.
8 July 10 -July 19 Create a bot that fetches annotations from Pundit server and feeds them to wikidata regularly.
9 July 19 - July 22 Brush up documentation, add comments and package the plugin.
10 July 22 - August 1 Write unit tests and test repeatedly. Fix any bug found during tests. Packup plugin for public use.
11 August 2 - August 15 Find and fix all the bugs found and clean up the extension
August 18 Firm pencil down date.
11 August 21 Submit the extension to Google and launch the plugin on initial scale with help of Mediawiki and my mentors.

Details on Timeline

[edit]
  • Task 1:
I am contributing to Mediawiki from the last year. Through the micro task. I have also setup pundit in my machine , I been also familiarized with the Wikibase API and making requests to Wikibase API through Javascript. I am regularly in contact with my mentors through a google group and we regularly do discussion on the topic and post questions in case I have doubts. We also hangout through voice calls on Google hangouts, thus making the communication more effective.
  • Task 2:
Since I am creating a plugin for a which can also be easily saved in bookmark, it won't take much time to implement since pundit already the functionality of packaging it in the bookmarklet, the major time will be consumed in setting up the plugin. So initial code will be based on setting up the pundit to be in synchronization with Mediawiki. In this phase the login functionality through Mediawiki will also be implemented. Login functionality will be based on the data provided by API as explained here.
  • Task 3:
Writing unit tests and then testing the code is essential and integral part of this project, so this will done on many stages of the project. I will be using jQuery QUnit tests to test my code, and thus this code can be regularly extended to cover unit tests
  • Task 4:
I will use Pundit's entity extractor to extract entities from the sentence highlighted and then propose a change screen which is basically Pundit's triple composer, then I have to modify the current GUI provided by the pundit for the triple composer and blend it into the traditional look of Mediawiki, so I will be writing CSS and JavaScript to enhance the style of the current GUI during whole this time. I will also contact Wikimedia and Wikimedia design team to get their feedback on the design.
  • Task 5:
Again unit tests written in QUnitTest module will test the whole code to find if anything is broken, hence improving the overall stability of the code. I any errors are found they all have to be fixed regularly during this period.
  • Task 6:
Now this task is related with bringing the Wikidata vocabulary from its server to the user's frontend and suggest user properties and values based on the data received. Since Wikidata API is publicly available to make request and fetch JSON data from it, so this job will be done through JavaScript AJAX request. I will be making get JSON requests to the server and in turn handle the data received with my JavaScript functions on the client and thus making pundit available the whole bunch of the Wikidata vocabulary needed. So triple composer will be available with entities to suggest to user. Also more triples can be created as needed for adding references etc.
  • Task 7:
In task we will store annotations on pundit server to make sure that the further user after the one annotating can see what was annotated before them, this will be done in a proper RDF model that will be saved on Pundit annotation server. At this point we will also create JavaScript scripts that will feed Wikidata's specific item's page (that is subject in the annotation triple) with the required triple in form of property-value pair, while references will also be considered.
  • Task 8:
During this time I will create a bot based on the PyWikipediaBot that will regularly fetch annotations from the Pundit server and feed them to the Wikidata in a proper manner based on RDF model. Here subject in the annotation on Pundit server will become item, predicates as a property while the object as the value and these will be feed to Item's page on Wikidata by the use Wikibase API.
  • Task 9:
This will involve writing documentation for the JavaScript objects(represents as classes) made during the task and also adding comments to the source code, and packing up the plugin for public use.
  • Task 10:
Again write QUnit Tests avaliable by jQuery foundation to test the whole code and in turn fix up any bugs found that can break down the plugin.
  • Task 11:
Launch the plugin for the public and fix any bugs found by them by actively working on it.
  • Task 122:
Submitting the source code to google for final evaluation and launching the plugin officially for the use by the public, and hence finishing up the project.
  • Optional Task:
If the time permits I would create an extension for the plugin on Mediawiki and host up a testing scenario at Wikimedia labs, thus increasing the reach of the project.

Participation

[edit]

For me, It was and will be always - Sharing is Caring. This has always helped me to get well with the Wikimedia community. I will publish all my completion reports on my blog weekly. All source code I write will be published to my Github repository and will be pushed to branch in Pundit repository also to make sure of collaboration. I always try to stay live in IRC, and am regular in replying to emails, so it helps me to blend in the community. Testing and documentation will be added to the Wikitech Mail page. I am mostly available on #mediawiki, #wikimedia-dev, #wikidata during my working hours. I usually hangout with my mentors to discuss the ideas, I always post our discussions and question on our google group which is free to everyone to join.

About Me

[edit]

I am a 19 year old, second year student currently enrolled in Electrical Engineering (IV Year Course) at IIT Roorkee. I developed a passion for programming and web development in my freshman year. I am regularly contributing to Mediawiki since November 2013. I am an active member of SDSLabs [1] at IIT Roorkee. I am currently proficient in JavaScript, PHP, Python and Node.js. I have been using Linux for the past two years and found it my initial source of inspiration for open source. I open source all my projects that I do individually so that the mass can gain something from it. I have been developing apps regularly at SDSLabs, we code late night at our lab and we all enjoy it. You can find SDSLabs github profile here. I usually work between 4:00 p.m. to 11:00 p.m. in weekdays and 11:00 a.m. to 11:00 p.m. in weekends, rest time is spent in usual exception of studies.

I am having summer vacations from April end to July mid, after that I will have to spend 28 hours per weeks to studies, which doesn't affect my work timings, so I think I would be able to complete my project in time and will continue working on developing it further in time once GSoC is completed. Coming from remote village in valleys of Himachal Pradesh, I love the idea of open source and think that 'Sharing is Caring' and hope that this idea will spread more through the communities like Mediawiki and projects like GSoC.

I am eagerly looking towards my project, as I selected this project because it involves the idea of sharing i.e. collaborating data and no doubt it involves my favorite language JavaScript, and also some PHP in server end. This project is in a way interesting because it aims at connecting data around the world with their sources and help people save their important data, so I am excited about this.

Motivation
[edit]

I am highly passionate about open-source software and security. You can see my other open-source projects on my GitHub profile mentioned below. I assure dedication of at least 40 hours per week to the work and that I do not have any other obligations during the period of the program, with the obvious exception of regular academics. A major part of this duration is summer holidays where I’ll be working from my home. Also, if any part of the proposal is not clear, I'll be very happy to clarify.

Past Projects & Contributions
[edit]
  1. Github Profile.
  2. Build web app for a local startup at IIT Roorkee, Roorkee Delivers.
  3. Created a code sharing website OpenCode
  4. A web app that makes matches on the basis of common interest between two people.
  5. jQuery plugin for shopping cart ( jCart ) and cookies ( jCookie ).
  6. Created an application for the alumnies to share their experiences at IIT Roorkee.
  7. Contribution to Mediawiki (Gerrit Repo).
  8. I have mostly worked on improving the extension Multimedia Viewer.
  9. I have also contributed to open source project Moodle.
  10. Worked on our own lab music player based on play by github in Node.js.

Any Other Info

[edit]

UI Model

[edit]
  • A simple UI model under construction can be seen at this link.

Micro Task

[edit]

This involves two simple tasks that are:

  • Showcase a simple Pundit setup webpage.
  • Make a simple webapp that uses wikidata api

Source code can be found at this link.

See also

[edit]