Jump to content

Zotero translator for Wikimedia blog

From mediawiki.org

This page shows how to build a Zotero web translator for the Wikimedia blog. We will be testing it on translation-server. The fields that should be filled for blog entries are provided here.

Create translator file and add Metadata

[edit]
  1. Set-up the environment for development of translators through translation following the steps as explained here.
  2. Create a new file in you text editor named Wikimedia Blog.js and save it in translation-server/modules/zotero/translators.
  3. Add the metadata in the format as shown at the top of the file. Generate a hash code by running md5sum /exact-path/Wikimedia blog.jn the terminal and enter it as the translatorID.
  4. Generate the system time using the command date "+%Y-%m-%d %H:%M:%S" in the terminal. Add your name under the creator and let other fields be as shown.
{
	"translatorID": "1c78acb8-faaa-4465-a0e7-f6dd8c4560e6",
	"label": "Wikimedia Blog",
	"creator": "Sonali Gupta",
	"target": "^https?://blog\\.wikimedia\\.org/",
	"minVersion": "3.0",
	"maxVersion": "",
	"priority": 100,
	"inRepository": true,
	"translatorType": 4,
	"browserSupport": "gcsibv",
	"lastUpdated": "2017-08-25 20:42:49"
}

Add Licence block

[edit]

Add this licence block and edit the year and name of creator in the first sentence of the block.

/*
	***** BEGIN LICENSE BLOCK *****

	Copyright Š 2017 Sonali Gupta
	
	This file is part of Zotero.

	Zotero is free software: you can redistribute it and/or modify
	it under the terms of the GNU Affero General Public License as published by
	the Free Software Foundation, either version 3 of the License, or
	(at your option) any later version.

	Zotero is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	GNU Affero General Public License for more details.

	You should have received a copy of the GNU Affero General Public License
	along with Zotero. If not, see <http://www.gnu.org/licenses/>.

	***** END LICENSE BLOCK *****
*/

Add polyfill functions and detectWeb

[edit]
  1. Next, add the polyfill functions for text and attr.
    function attr(docOrElem,selector,attr,index){var elem=index?docOrElem.querySelectorAll(selector).item(index):docOrElem.querySelector(selector);return elem?elem.getAttribute(attr):null}function text(docOrElem,selector,index){var elem=index?docOrElem.querySelectorAll(selector).item(index):docOrElem.querySelector(selector);return elem?elem.textContent:null}
    
  2. Open this Wikimedia blog entry in a browser tab and let us find out a way to identify any url if it is a single entry or not. Press Ctrl+Shift+I and check the class attribute in the body tag of the page.It can be used as a way of identifying the page type. We can use the string "single-post" to check if a page is a single blog entry or not.
  3. Open this Search page of the blog in a browser tab and see the body tab to find a way to identify search results. Here the class inside the body tag has substrings like "search", "search-results", etc. So for any url we receive, we can check the class attribute of body tag to know the type of document in which we want to classify it. There are other multiple entry pages on the blog that are categorized as technology, community, foundation, etc and have been archived. We can handle such pages by checking if their class name has the substring "archive". Following is the detectWeb code using this logic.
    function detectWeb(doc, url) {
    	if (doc.body.className.indexOf("search-results")>-1 || doc.body.className.indexOf("archive")>-1 
    	        && getSearchResults(doc, true)) {
    		return 'multiple';
    	}
    	else if (doc.body.className.indexOf("single-post")>-1) {
    		return 'blogPost';
    	}
    }
    

Add getSearchResults and doWeb function

[edit]
  1. For detectWeb to work, we need to write the getSearchResults. This method should be able to pick all results from a multiple items page and save it in key,value pairs with the corresponding url. We will use CSS selectors to reach all the nodes holding information about articles. Open this Search page in a new tag and Press Ctrl+Shift+I. Using the node picker, inspect the title of the first article. On inspecting the title, an HTML tag will get highlighted in the Inspector window corresponding to it. Right click on the tag and go for Copy -> CSS path. The following will be the CSS path.
    Inspect title node of search results to get CSS path
    html.svg body.search.search-results.logged-in.admin-bar.no-customize-support.mp6.customizer-styles-applied.highlander-enabled.highlander-light.infinite-scroll div.wrapper div.main.shell.cf div.content ol#articles.articles li.article header.article-head h4.article-title a
    
  2. We can reduce the length of this path in a way that it continues to uniquely identify the correct nodes we are looking for. Picking the last few selectors will serve the similar purpose as the whole path. So we will use li.article header.article-head h4.article-title a to get nodes holding title as text and url in the href attribute. Following you can see how getSearchResults uses this path and then we can simple insert the template of doWeb function without any changes.
    function getSearchResults(doc, checkOnly) {
    	var items = {};
    	var found = false;
    	// Adjust the XPath
    	var rows = doc.querySelectorAll('li.article header.article-head h4.article-title a');
    	for (var i=0; i<rows.length; i++) {
    		var href = rows[i].href;
    		var title = ZU.trimInternal(rows[i].textContent);
    		if (!href || !title) continue;
    		if (checkOnly) return true;
    		found = true;
    		items[href] = title;
    	}
    	return found ? items : false;
    }
    function doWeb(doc, url) {
    	if (detectWeb(doc, url) == "multiple") {
    		Zotero.selectItems(getSearchResults(doc, false), function (items) {
    			if (!items) {
    				return true;
    			}
    			var articles = [];
    			for (var i in items) {
    				articles.push(i);
    			}
    			ZU.processDocuments(articles, scrape);
    		});
    	} else {
    		scrape(doc, url);
    	}
    }
    

Test multiple entry pages

[edit]

Once detectWeb, getSearchResults and doWeb are coded, we can go ahead and test the multiple entry pages to know if it works fine. Open terminal and from the translation-server directory, rebuild the image to take in the changes and run the docker container.

./build.sh && docker run -p 1969:1969 -ti --rm -v `pwd`/build/app/:/opt/translation-server/app/ translation-server
Server output on successful translation of search page of blog

Open another terminal window and pass the query for search page we have been using.

curl -d '{"url":"https://blog.wikimedia.org/?s=wikimania","sessionid":"abc123"}' \
       --header "Content-Type: application/json" \
       127.0.0.1:1969/web

We should also make sure that our methods are handling the archived articles well. Lets pass another url to test.

 curl -d '{"url":"https://blog.wikimedia.org/c/community/","sessionid":"abc123"}' \
       --header "Content-Type: application/json" \
       127.0.0.1:1969/web

Add scrape method

[edit]

We will use embedded metadata.js to scrape information. It will get the data from the meta tags and fill them automatically in the relevant fields. Over it, we will extract other information missed by the translator, if necessary. Let us first check the output provided by embedded metadata. Include the following code that loads one translator into another.

function scrape(doc, url) {
	var translator = Zotero.loadTranslator('web');
	// Embedded Metadata
	translator.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48');
	translator.setDocument(doc);
	
	translator.setHandler('itemDone', function (obj, item) {
		item.complete();
	});

	translator.getTranslatorObject(function(trans) {
		trans.itemType = "blogPost";
		trans.doWeb(doc, url);
	});
}
Translation of blog post by importing Embedded Metadata

Rebuild the docker image and run the server. Pass the following query which now has a single entry page's url for testing. You can replace item.complete() with Zotero.debug(item) to see output in the terminal in a readable form.

 curl -d '{"url":"https://blog.wikimedia.org/2017/08/23/vitor-mazuco/","sessionid":"abc123"}' \
       --header "Content-Type: application/json" \
       127.0.0.1:1969/web

The embedded metadata translator will pull fields like url and title. For a blog entry, we can provide the author's name, the time of publishing the article, the tags it has, etc. Let us see how to extract some of these fields.

Extract creator information

[edit]

Open this article in a new tab. Under the title, we have author information. Inspect the first author name 'Ruby Mizrahi" with the help of node picker. Right click and copy the CSS path for this node. The node is <a href="https://blog.wikimedia.org/author/ruby-mizrahi/" title="Posts by Ruby Mizrahi" class="author url fn" rel="author">Ruby Mizrahi</a>. As we can see it has class name as author. The CSS path returned is

html.svg body.post-template-default.single.single-post.postid-52988.single-format-standard.logged-in.admin-bar.no-customize-support.mp6.customizer-styles-applied.highlander-enabled.highlander-light div.wrapper div.main.shell.cf div.content article.article.article-single header.article-head div.article-meta p.meta a.author.url.fn
Inspect node to get CSS path for authors' information

We can just use the last few CSS selectors a.author.url.fn to identify all nodes that have author/authors information. We will edit the handler of translator to use this path as follows. We take in all the nodes that have the same CSS path and from those nodes, we get the text they hold and make use of Zotero utility

translator.setHandler('itemDone', function (obj, item) {
		var authors = doc.querySelectorAll('a.author.url.fn');
		for (var i=0;i<authors.length;i++){
		item.creators.push(ZU.cleanAuthor(authors[i].text, "author"));
		}
		item.complete();
	});

Add rights

[edit]

We can add the Wikimedia blog as the value for catalog field. We can use the license provided in the footer of the blog as the rights for each blog entry. Inspect the Creative commons license with node picker and you will get the CSS path as follows.

html.svg body.post-template-default.single.single-post.postid-52999.single-format-standard.logged-in.admin-bar.no-customize-support.mp6.customizer-styles-applied.highlander-enabled.highlander-light div.wrapper div.footer div.shell div.footer-inner.cf div.copyright p a
Inspect node to get CSS path for copyright information for blog

We can identify this node by div.copyright p a and add this information in the item object as shown below.

    item.libraryCatalog = "Wikimedia blog";
    item.rights = text(doc, "div.copyright p a");

Test Single entry Pages

[edit]

Rebuild the container to reflect the updates in translator and run the server. Test a single post entry using curl to make sure there is no syntax or logical bug in the code.

 curl -d '{"url":"https://blog.wikimedia.org/2017/08/23/vitor-mazuco/","sessionid":"abc123"}' \
       --header "Content-Type: application/json" \
       127.0.0.1:1969/web

Add test cases

[edit]

We should add test cases for each type of item that can be identified by the translator. Here we have identified pages as either single blog posts or multiple blog posts. On successfully testing urls in the above sections, we can manually write test cases using the results generated.

/** BEGIN TEST CASES **/
var testCases = [
    {
		"type": "web",
		"url": "https://blog.wikimedia.org/?s=wikimania",
		"items": "multiple"
	},
	{
		"type": "web",
		"url": "https://blog.wikimedia.org/c/community/",
		"items": "multiple"
	}
]/** END TEST CASES **/
Output of Zotero.debug(item) for single Wikimedia blog entry

The best way to write test case for single entry item is to use Zotero.debug(item) while removing item.complete() in the scrape method temporarily, use the output and save it in the following format.

	{
		"type": "web",
		"url": "https://blog.wikimedia.org/2017/08/23/vitor-mazuco/",
		"items": [
			{
				"itemType": "blogPost",
				"title": "Vitor Mazuco and the fandom that drives his Wikipedia editing – Wikimedia Blog",
				"creators": [
					{
						"firstName": "Ruby",
						"lastName": "Mizrahi",
						"creatorType": "author"
					},
					{
						"firstName": "Michelle",
						"lastName": "Fitzhugh-Craig",
						"creatorType": "author"
					}
				],
				"rights": "Creative Commons Attribution 3.0 unported license",
				"url": "https://blog.wikimedia.org/2017/08/23/vitor-mazuco/",
				"libraryCatalog": "Wikimedia blog",
				"attachments": [
					{
						"title": "Snapshot"
					}
				],
				"tags": [],
				"notes": [],
				"seeAlso": []
			}
		]
	}

Submit Pull Request

[edit]

Once the translator file is complete, we should submit it in the Zotero translator upstream on Github.

  1. Fork this repository and clone in the local system.
  2. Create a branch and copy this file in the repository.
  3. Create a pull request to contribute.
  4. File a task against Citoid in Phabricator to request to pull it into Citoid (example).