Citoid is usually pretty good about UTF-8 conversion, but http://w.genealogy.euweb.cz/hung/batth3.html has the á come back as �. ~~~~
Talk:Citoid
Appearance
When I request http://w.genealogy.euweb.cz/hung/batth3.html in my browser the content comes back encoded as non-UTF-8 (perhaps Windows-1252?), e.g. <TITLE>Batthy�ny 3</TITLE>
. I think your browser is doing some magic to turn it into readable text, but I think the issue is in the source?
Playing around with the encoding in Safari, I found that the "Default" option looks great, but UTF-8 is wrong. I went through some options and ISO Latin is what it is. Chrome and Safari on MacOS are both smart enough (perhaps same OS library), to figure out that it is not UTF-8. Character set guessing is not an exact science sadly (harder than NP-complete, blah blah blah), but heuristic guesses are usually pretty good.
Hi, first, thank you for all the hard work you guys do on this tool. :)
Second, I've often wanted to cite papers on SSRN on Wikipedia, but citoid doesn't seem to work with SSRN. Is there any possibility of handling it? Thanks!
This thread on en.wp is the story of a 70k+ editcount editor hitting "convert" over and over in VisualEditor and causing the loss of information from existing citations. Plenty of diffs in the thread. I don't know whether this problem comes from Citoid's output or from VisualEditor not checking existing template parameters to ensure it's not deleting anything, so crossposting to both talkpages. Folly Mox (talk) 11:25, 14 May 2024 (UTC)
Update since the thread linked above has been archived. I haven't duplicated the problem yet, but I'll try to put together some test cases and see what is happening exactly so I can file a phab ticket.
This flaw is heartbreaking. An article I created was impacted by this: a user used the "Convert" button in visual editor, and they thought they were "improving" the article ... and the edit convinced them it was better (citation in footnote looked cleaner). The user did not realize that the Citoid script was deleting some information from the citation!!!! (usually author name, publication date, publisher name, ... but also other data).
My article lost data from about five citations (about 1/3 of those that Convert script was applied to). But the user that used Convert script has done over 50,000 edits with that tool (I think that # is correct, I have not confirmed it). Does that mean that perhaps 10,000 citations have had data deleted?
This needs to be addressed ASAP !! [Signed user Noleander, cannot get signature to work here]
Hi, I'm not really familiar with how this works, sorry if this is the wrong place to post.
Last year I had a translator pushed to master on the zotero translator github repo, but Wikipedia is yet to incorporate it, which has been kinda annoying for me since I use the source a lot.
It's this one. Do you know who I should ask or how to make sure it gets activated?
Hmm, I'm sorry about that.
We initially pulled in that translator by commit ea95b99d in September 2023, but our Zotero translator snapshot was from August 2023; we only synchronised the translators into the main service in February, and when trying to deploy them to production earlier this month, found that there was a big spike in errors, so we rolled back (T361728).
I'll ask there if there's further knowledge about the issue and if it's expected to roll forwards again.
The QID citations are provided by Zotero too? I wonder, why this works on mediawiki integration, but it does´t work on Zotero desktop. Is there somewhere publically available the script which handles this?
Citoid converts QIDs to wikidata urls and hands the url to Zotero. It will work in Zotero too if you give it the full url, not just the QID.
Well, the idea is where to place that url in Zotero desktop (My Library), because if I add that via Add Item by Identifier it ends with an error ("Zotero could not find any identifiers in your input. Please verify your input and try again.") And I dont see there any other area, this could be added. So my itentsions are to create a Zotero plugin, which would let the userts to upload citation to Zotero desktop via QID and potetionally also improve Wikidata iteam in the future, by sharing their Zotero improvments. So thats why I was interested, why it works for Citoid, but doesnt work for Zotero desktop.
Well I have tried to navigate the browser to a certain item in Wikidata and grap the page via Zotero connection, and it looks like, I have created something, but saying something, because its way from the expected behaviour. Namely it manages the resourse as a web page, doesnt grap the author mentioned in the item. So I assume that is more like scrap some metadate from the website rather then mapping the item itself.
A few of us at en-wiki are about 55% through a massive cleanup project that resulted from the careless use of a user script that used Citoid to overwrite existing references with automatically generated ones. The current phase involves manually checking about 2400 diffs from the period January through April 2023. We haven't yet identified the full scope of the cleanup.
Through the course of this cleanup, we've determined that Citoid's references do not require improvement by human editors for only a small percentage of sources. As one of the involved editors with a more technical background (although decades ago), I'm trying to understand how the whole thing works, so we can improve the quality of references on en-wiki and avoid repeats of this sort of cleanup. I've personally invested probably over 100 hours in this over the past few weeks.
What I'm understanding from the documentation available here, Citoid uses a fork of the Zotero translators. Is that correct? If so, how recently was it forked, and/or how often is it forked? Citoid/Determining if a URL has a translator in Zotero states In production for wikimedia, we've enabled three translators.
Sorry, that's outdated. We recently reactivated the fork, but not to make changes to whether it's supported in translation-server or not. I've now fixed that page. (In the past we enabled 3 translators that Zotero didn't have enabled, not 3 total)
The repository at GitHub shows a great number of javascript files, such that I can't figure out how to get to the end of "A" in the alphabetical listing. How does this align with having "enabled three translators"?
Talk:Citoid/Determining if a URL has a translator in Zotero has a comment from 2015 stating Citoid uses its own HTML meta-data scraper as a fall-back when Zotero doesn't return any result. Is there any way to record / indicate this? Like a hidden comment before the closing ref tag along the lines of "citation created by generic translator", or a warning message to the editor along the lines of "automated citations to this site may contain errors, please double check"?
Citoid is a very powerful library, and during the course of my cleanup efforts I've dropped into the visual editor a couple times to make use of it (in cases where the reference had been generated from a URL where Citoid's behaviour is suboptimal, but which contained a DOI that could be used to create a complete citation). However, at en-wiki at least, there's a culture of trusting code to function perfectly in all cases where it doesn't generate any warnings or errors. Effecting cultural change is difficult, and creating references manually is time-consuming, so I'm exploring all avenues. I don't think my technical skills are high enough to start writing Zotero translators, and I'm not sure how to get Citoid to incorporate those translators in its dependencies.
Also, citations created from Google Books never include editors, misattributing their contribution as authorship, and I'm not sure if that's something that can only be addressed by improving the translator or if it's something that is going on within Citoid. Thanks in advance for your answers. Kindly,
There's a "powered by Zotero" message in the citation picker if it's from Zotero, but Zotero also has a generic translators now, so that is probably not super useful to you (historically they did not- it's very rare to have a purely citoid response now).
The github repo you've found is a good place to look to see what's available (I note the aws tests haven't run in a while!) Note some of the poor citation quality is going to be from javascript loaded pages, so things that work well with Zotero the browser extension, which deals with the loaded page, won't necessarily give the same results after being scraped using translation-server.
(I tried to post the above in one go, but kept tripping the abuse filter for "linkspam". I couldn't even get the third paragraph to post as a single comment. Maybe the filter settings are too strict?)
I think the problem of mistaking a book's editors for its authors predates citoid and the visual editor. Part of the problem is that reality is complicated. If you look at page xix in https://books.google.com/books?id=bIIeBQAAQBAJ, you'll see that there is a main author, an editor, and a long list of people who wrote specific entries. The correct author's name depends on which bit you're actually citing, and you're supposed to notice the presence or absence of the author's initials at the end. In https://www.google.com/books/edition/The_Routledge_Encyclopedia_of_Mark_Twain/8BhUuxcKNPkC, however, Google correctly names the editors as being editors, and it would be nice if citoid/Zotero could figure that out.
- I definitely never expect automated referencing to identify chapter contributors. Sometimes the table of contents is not even available for preview. I've had some luck going directly to publishers for the info, but in a couple cases I've had to leave the author attribution empty. Identifying editors as editors seems like pretty low hanging fruit, by which I mean it's clearly stated at the bottom of the page, right by the publisher and isbn information already being correctly scraped.
Now that I think of it, something that's entirely within Citoid's remit would be, when creating a book citation, to use the "authorn-first" and "authorn-last" aliases instead of "firstn" and "lastn", since they'd be considerably easier to change into "editorn-" form, without needing to erase and retype the full parameter names as currently.
The specific names of the parameters to use are controlled on-wiki via the "maps" parameter in the TemplateData of the relevant citation template, i.e https://en.wikipedia.org/wiki/Template:Cite_book/TemplateData. I'm not convinced this is a good idea, but it is implementable easily.
In the interim, the CS1 templates have been updated to support "editor-lastn", "author-firstn" etc forms, so this particular suggestion is no longer relevant. Citoid still appears to suffer from the same issues, although I understand the maintainer has been tied up working on compatibility with other Wikimedia projects. A few lists of regexes for dealing with common failure states would go a long way.
Citoid knows if it's using a Zotero translator or not. Does it know which one? If it does, and citation templates were updated to hold an appropriate hidden parameter, could the translator in use be surfaced and passed to the template? That could facilitate identifying which translators are consistently inaccurate, which seems like a good first step in trying to improve them or track them for manual correction.
Zotero reports translator use in its logs, unfortunately the logging is not compatible with our infrastructure so we have those turned off. But if you run a version locally and try the url, it will tell you in the console.
I should also mention I was informed a few days ago at en:Module talk:Wd#References mapping that |website= (in citation templates) "should only get the domain name when the source is best known by that name". Citoid always chooses to fill this parameter, even when it can't discern a human readable website name and falls back on the first part of the URL (which is often). Apparently this behaviour is not desirable in general.
I've written a projectspace page about this at en:Wikipedia:WikiProject Citation cleanup/Repairing algorithmically generated citations. Corrections and additions are welcome. I really don't want to misinform anyone, and I have an incomplete understanding of the architecture and stack.
Does Citoid do any error checking its values? It's not immediately clear where to find the source code, but it's pretty clear the user scripts downstream of Citoid don't double check it, so we get silly things like a perfectly formatted citation to a 404 page, or numeric data in an author name field. I understand that the parsing issues themselves stem from Zotero, but if basic error checking could be performed in-house, it could cut down on the amount of bogus citations added by good faith editors not cautious enough to double check script output.
There is various error checking, for instance we check if a website sends a 404 page not found status code. However, unfortunately occasionally websites don't always comply with W3C standards and do silly things like report a 200 page OK status code and then in text write 404.
Well, I guess it's fair that websites should probably follow standards and return 404 codes for their 404 pages, but since many of them don't, do you think it would be possible to check for "404", "page not found", "page does not exist", "we're sorry" etc. in the |title= parameter? An ounce of prevention saving a pound of cure, and all that.
I'm minded to return to this subtopic specifically because I thought another title nearly universally indicating a failed reference is "is for sale", which is what typically shows up when a site has been usurped by a domain squatter.
1. Would you be at all willing to maintain a brief list of known paywalled sources such that Citoid can apply a "url-access" parameter to citations to such domains. I'm thinking places like nytimes.com, ft.com, forbes.com, stltoday.com, latimes.com, etc. At present url-access always needs to be added manually, usually after a failed attempt to verify a claim.
2. I've noticed that citations to The Guardian consistently render the website / work parameter as "the Guardian". Would you be willing to uppercase the first letter in the website / work parameter for all sources that don't equal the first bit of the domain name? There may be sources who prefer a different case styling, but it looks weird in the rendered template. Alternatively, could you uppercase the first letter of the website/ work parameter when the first word is "the"?
3. An astute unregistered editor noticed at en:Help talk:Citation Style 1#Unix epoch that many sources using the date "1970-01-01" (the unix epoch) are doing so in error. Would you be willing to discard this date as bogus for sources that are not books, journals, or periodicals?
4. Is this a good place to discuss improvements to Citoid, or would Phabricator work better? I've recently registered an account there.
3. So I realised I'm dumb, and web sources should not report a date prior to c. 1995 in any case. So the unix epoch should probably just be discarded regardless of spurce type.
3. The CS1 templates will all reject an |access-date=
before Wikipedia's inception regardless of type (see this discussion), so we must be talking about publication date. The bigger problem is, unlike a physical-media citation type, if a website shows a timestamp of 1970-01-01 on some page (which I cannot prove, but believe with near-certainty, happens somewhere in the wild), then that's the only date we have for that source. IOW, it's arguably "correct" to use it in the citation, despite its obvious impossibility.
Yeah I am talking about |date=, not access-date=
The Zotero translators seem to lean pretty heavily into HTML metadata, so it's possible the hypertext document could have a date listed as the unix epoch, with an actual publication date somewhere in the byline or footer, but the more common scenerio is probably like this one I fixed yesterday at en:Yuan Dynasty: https://www.academia.edu/2439642
Here, the service hosting the source (academia) reports a bogus unix epoch date, which any parser will pick up, but inspecting the actual source document reveals a publication date in 2010.
I'd say that if a genuine web based source has the only available publication date set prior to the deployment of the world wide web in the early 1990s, it's safest to ignore the date rather than use a known incorrect value.
The nice thing about book and journal sources is that they'll have more than one service documenting their existence, so if one site is erroneously reporting a unix epoch date for the source, it can be cross-checked and corrected.
I don't think you would be happy with that. Google Books returns publication dates, such as |date=1982
for https://www.google.com/books/edition/Chocolate_the_Consuming_Passion/egLRDF36ayoC, rather than webpage dates. PubMed and doi entries also get their proper publication dates.
Rereading my comment now a month later, I definitely wasn't clear about what constitutes "a genuine web based source". I miscommunicated similarly in a completely different discussion about overlinking, also by employing the term "genuine" as if I hadn't put a lot of assumptions behind it. Probably time to choose my words more carefully.
In any case, as regards the topic I was initially trying to discuss, the unix epoch date "1970-01-01", it makes more sense to have citation templates add it to a tracking category rather than never return a date from Citoid, purely for visibility reasons. It's easy (although time-consuming) to run through a maintenance category full of likely bad data and fix it; it's much more difficult to find every citation without a publication date and ensure there actually is none provided. The second set is probably three or four orders of magnitude larger than the first, so my initial idea was probably uh ill-considered 🙃
This post was hidden by FeRDNYC (history)
This post was hidden by FeRDNYC (history)
This post was hidden by FeRDNYC (history)
That sounds like a good idea to suggest at w:en:Help talk:Citation Style 1.
On English Wikipedia the {{Cite journal}} template has a jstor
parameter. Can Citoid be changed to extract the relevant stable link for the Jstor URL instead of copying the provided URL into the URL field?
If given a JSTOR link, it gives the stable JSTOR url.
For most other links, it doesn't typically know the JSTOR identifier, so it can't use that to then get the JSTOR link. Most links to journal articles, if they include extra identifiers will include the DOI, but not typically JSTOR.
If I wasn't clear, I'm discussing how it treats Jstor input only: eg input like https://www.jstor.org/stable/45019299
.
Ah, I misinterpreted you - you want the jstor url to go into the jstor field instead of into the url field?
That's a little tricky. We could definitely return a JSTOR parameter in the api; the problem is that TemplateData and Citoid extension only does really basic mapping, so then the jstor link would end up in the url as well and so it'd be linked in both. In the API we return a url no matter what because it's a required parameter (api guarantees return of a url in the url field) and for other language wikis that don't have separate parameters, they need it. We've had this issue as well with people not liking we return both the doi and the resolved doi link in the url field, though personally it doesn't bother me.
That kind of per-wiki customisation might have to be per-wiki user script / common.js kind of solution rather than something that goes in the back-end or the extension, which is designed to be fairly agnostic about the citation templates being used.
Yea, it would pretty much be just changing where the JSTOR parameter ends up. On English Wikipedia there's been an issue basically where there are three interrelated issues. ¶ First, the citation bot adds Jstor parameters given the stable Jstor URL but this causes unnecessary duplication... which some people want retained just in case someone meant the URL to be there (even though nobody means much of anything when using these citation generators). ¶ Second, the Internet Archive bot then can be run to "archive" the live Jstor URLs (but not the parameters) because the URL is there... even though, because Jstor is paywalled, the "archive" is just a landing page. Naturally some people don't want these useless archive links remove either. ¶ Third, Jstor because it's paywalled isn't always the best free full-text source and putting a URL there would on first glance seem misleading.
Anyway, I understand the technical issues involved, though I think the real solution in this instance is the root cause, which is the unthinking addition of Jstor URLs to templates that end up triggering all of the downstream clutter. A user script would have insufficient adoption to go much of anywhere in nipping the issue.
Just noting here that it's been my practice to remove url parameters when they point to jstor, and put the stable jstor identifier in the jstor parameter instead, to avoid the unnecessary archive and access-date cruft that follow-on scripts produce. I understand if it's not possible not to return a url parameter though.
en-wp's own in-house tools could be a vector for correction here, although the maintainers have been too busy to maintain them for a long time. Honestly given how popular automated referencing has become, we could use about four times as much staffing at every point in the stack.
Yea, when I was reading that Village Pump discussion about people claiming that an editor might have placed the Jstor URL there on purpose, my first thought was "lmao nobody formats citations manually anymore; there's no purpose involved".
I missed that discussion, but there's no reason to duplicate a link (to jstor content) in the cruft-inducing url field when it can be safely tucked into the parameter specifically included to hold it.
These days if I'm citing a journal article, I'll usually swap into Visual Editor to generate the citation with Citoid, but I swap back into source editing to touch it up afterwards.
I do find it worrisome how proliferate automated referencing has become when weighed against the accuracy of its output. I spend probably eighty per cent of my time on wiki cleaning up after thoughtless automatic references, but even with a team of fifteen or twenty the references would be flowing in at a rate we couldn't handle them, given the huge backlog currently present.
I agree, which is why I was thinking to get to (at least) one of the sources of those automatic reference generators. Is it possible, Mvolz, to add some kind of post-processing to trigger with Jstor? Or is that actually technically infeasible?
Currently the visuala editor citoid formatts reports using cite_journal. It'd be highly useful to wrap them instead in cite_report (especially when a Qikidata QID is provided, since such wikidata items will state that the instance of (P31)=report). It'd also be ideal in those ccasees to also include location data since the location of the publisher / commissioning organisation / authoring organisation is often highly relevant (indeed usually more relevant than a book's publisher's city!). Either drawing from the country (P17) of the publisher (P123) or maybe the location (P276) of the cited item itself?
See here for an example where cite_report would be helpful in formatting.
Hello, this is configurable.
See: Citoid/Enabling_Citoid_on_your_wiki#Step_2:_Configure_Citoid To change this you need to add template data to Cite report, and then change report -> Cite report in the config ( wiki:en:MediaWiki:Citoid-template-type-map.json).
@Mvolz (WMF) Thanks! I think it already had the necessary templatedata, so I'e put in an edit request at MediaWiki_talk:Citoid-template-type-map.json.
I would like to suggest that Citoid automatically archives pages when they are used in a reference. Yeah, I know, you have a very long to-do list, but put this on your eventually list.
Do you mean "automatically finds the archive URL from the Internet Archive and makes it available" or something else?
Yes, or, automatically creates the archive would be even better. We could potentially eliminate the scourge of dead links.
"Creates the archive" meaning "asks the Internet Archive to archive the URL"? Or are you asking for WMF to archive the Web (in which case I believe the answer is a very clear "no" from Legal, as repeatedly discussed over the last few years).
The internet Archive is already "crawling all new external links, citations and embeds made on Wikipedia pages within a few hours of their creation / update." Only enWP, it seems (and caveat robots.txt).
The french WP even automatically adds an archive-link to wikiwix on all reference links.
The rest is rotting away.
PS: Both cite_web and cite_news have parameters to use "archiveurl=" and "archivedate= ".
Open up a task on phabricator! :) https://phabricator.wikimedia.org/project/board/62/