Jump to content

Talk:Requests for comment/Server-side Javascript error logging

Add topic
From mediawiki.org
Latest comment: 9 years ago by 2601:8:AC00:FF3:4959:BE51:6556:E829 in topic EventLogging

EventLogging

[edit]

Rather than reimplement something EventLogging-ish, it probably makes sense to either require EventLogging or make it pluggable. There can be a core implementation (possibly using a MW API as the RFC suggests, with corresponding client-side JS module) and an EventLogging listener that sends it to the standard EventLogging server setup.

Event logging is a tier-2 service not designed for bursts of traffic at all. Actually bursts of traffic are likely to affect and break all users of the service. Also being a tier-2 system (see: EventLogging/OperationalSupport) it is understood that EL can be down for a couple days without issues so it is not a good choice for client side logging which should be tier-1.NRuiz (WMF) (talk) 08:53, 6 August 2014 (UTC)Reply

It may be enough to use mw.track; WikimediaEvents could subscribe and transform it into an EventLogging event. mw.track doesn't have any core subscribers, but perhaps a simple one (that could be turned off) that only listened to onerror could be added as part of this). Superm401 - Talk 00:16, 25 July 2014 (UTC)Reply

My main concern is that EventLogging has been known to have trouble under unexpectedly high load. If we deploy a software update that causes a JS error in a large percentage of users, that could REALLY spike the load on the system with millions of very-similar entries. If EL can handle this, then great! If not, what would it take to either improve the EL infrastructure or build something similar that can handle that? --brion (talk) 00:26, 25 July 2014 (UTC)Reply

What I have heard is that the way EventLogging transmits events could scale to significantly higher loads but the way it stores them (in MySQL) not so much. And we probably want something more searchable anyway. Is it possible to reuse the EL infrastructure but do something custom at the last step instead of writing into a DB? --Tgr (WMF) (talk) 00:29, 25 July 2014 (UTC)Reply

Eventlogging is designed for discrete events and low load. As I pointed above EL is a tier-2 service (see: EventLogging/OperationalSupport), in any case tier-1 like logging should be. While there are similarities between what EL does and client side logging would do there are more differences, the main one is bursts of traffic. Also EL data is neither public nor searchable. Seems to me that you need a specific client side mechanism with adjustable sampling rates, client side throttling and searchable backend over which alarms can be set. You have no need for schema versioning (one of EL strong points), very long term storage or an event capsule NRuiz (WMF) (talk) 08:53, 6 August 2014 (UTC)Reply
Nuria, what I meant by reusing is that the error logger would have its own JS code and its own Vagrant endpoint, would use the same mechanism as EventLogging (Kafka?) to send log events from Vagrant, and would have a different consumer process at the and than EL. Do you think this could work, or would bursts still be problematic? --Tgr (WMF) (talk) 16:40, 6 August 2014 (UTC)Reply
Tgr (WMF) I get your idea, I think. Bursts of traffic still will go through varnish and it seems that they could go directly to kafka. Also, EL is build around schema validation which in this case does not apply. I understand there are similarities but I would use a similar but much, much lighter system whose client besides being subjected to a sampling rate has throttling. Also I would log to flat files or logstash or sentry rather than to a db, or kafka or hadoop. Overall is a much simpler system when it comes to design but much higher volume that EL is. Do reach me in IRC if you want to talk about this further. 87.221.167.158 17:39, 6 August 2014 (UTC)Reply


The other way this could cause a spike is when errors are generated by some event which fires extremely often (say, a broken scroll or mousemove handler). This should probably be throttled on the client side though. --Tgr (WMF) (talk) 00:31, 25 July 2014 (UTC)Reply

Not only that, this needs "grouping" on the backend so you do not see literally millions of errors being triggered when someone 'scrolls' over a wikipedia page and some bug gets triggered 2601:8:AC00:FF3:4959:BE51:6556:E829 00:08, 11 April 2015 (UTC)Reply


Reusability and data processing

[edit]

I understand that your solution needs to scale to WMF needs, but please make it so that it is also usable by other sites.

We do have a solution at translatewiki.net which implements error catching and delivery to the server, but we lack good processing and displaying. Compared to PHP error logs, JS are of much lower quality (lack of info, l10n as mentioned in the RFC) but also lots of user specific issues due to user scripts or browser extensions – something you cannot reproduce easily and there isn't enough information to understand what is causing it. Hence it has not seen much use in practice.

I suggest you do more research for candidate solutions for processing and display part – preferably so that we are not tried to one component which is not easy to install by third parties.

There is no mention of tracking the version of MediaWiki core, extensions and what gadgets are enabled. I think these are important information to know when new errors are appearing or old ones going away.

--Nikerabbit (talk) 10:06, 25 July 2014 (UTC)Reply

For small sites, errors would be logged to a dedicated channel via the normal MediaWiki logging architecture; they are free to configure whatever handling they want for that channel. Any code that we write for processing should be reusable in such a scenario, that's a good point (the easiest way would be to expose them as Monolog filters). For display, small sites can probably use software built specifically with this aim (see Non-MediaWiki examples).

Good point about the version information, added that to the proposal. --Tgr (WMF) (talk) 19:07, 25 July 2014 (UTC)Reply

Target audience

[edit]

Were it implemented, would gadgets, global scripts, user scripts be able to use the server-side logging? --Gryllida 07:02, 27 July 2014 (UTC)Reply

Any Javascript error would be logged, regardless of the source. --Tgr (WMF) (talk) 23:20, 27 July 2014 (UTC)Reply

Opting out

[edit]

Would global scripts, user scripts, gadgets, or any code in principle be able to opt-out of server-side logging? Would you consider such ability to opt out necessary, and if so, in what circumstances? Gryllida 07:02, 27 July 2014 (UTC)Reply

Probably not. What would be the point of opting out, anyway? I can see users wanting to opt out for privacy reasons, and supporting that should be straightforward, but why would a gadget need to opt out? --Tgr (WMF) (talk) 23:23, 27 July 2014 (UTC)Reply

I had debugging use-case in mind, during which a user breaks things on purpose. Then users opt-out may work. :-)

I'll be happy to see more discussion on this topic, if needed, such as for other use-cases I may have missed. --Gryllida 08:55, 29 July 2014 (UTC)Reply

Errors coming from userscripts / dynamic code can just be filtered out on the server side (and have to be anyway, browser extensions tend to generate all sort of crap), no need for users to bother with that. The errors come with a stack trace, so it's easy to see whether they happen in MediaWiki core, an extension, a gadget or something else. --Tgr (WMF) (talk) 22:41, 29 July 2014 (UTC)Reply

Other languages people

[edit]

The proposal page is only in English. How would you like non-English projects people (who don't visit mw.org and who don't read mailing lists) to know about this request for comment? --Gryllida 08:57, 29 July 2014 (UTC)Reply

RfC meeting

[edit]

This RFC has been scheduled to be discussed in the Architecture RfC meeting today, 2014-09-03. Sorry for the late notice.--Qgil-WMF (talk) 15:49, 3 September 2014 (UTC)Reply

Meeting logs:

21:48:58 <TimStarling> tgr: I like your proposal, do you need anything to make it happen?
21:49:28 <tgr> first of all feedback that it's worth investing into
21:49:41 <TimStarling> it sounds pretty awesome to me
21:49:49 <tgr> it turned out significantly more complex than I initially thought
21:50:13 <TimStarling> #topic Server-side Javascript error logging | RFC meeting September 3 | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE).| https://meta.wikimedia.org/wiki/IRC_office_hours | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
21:50:21 <TimStarling> #link https://www.mediawiki.org/wiki/Requests_for_comment/Server-side_Javascript_error_logging
21:50:55 <tgr> I have three points specifically where I would like to know if the tradeoff is acceptable
21:51:07 <ori> what are they?
21:51:37 <tgr> 1. cors, 2. source maps, 3. data transport between client and server
21:52:32 <tgr> about cors: modern browsers sanitize errors if the script is on a different domain, need to add crossorigin="anonymous" to the script tag to avoid that
21:52:59 <tgr> if you do that and there is no Accept-Origin header, the browser will handle it as a 404
21:53:21 <csteipp> Eek. I'm not a fan of Access-Control-Allow-Origin: *
21:53:24 <ori> so, imo you shouldn't worry too much about those. having the user-agent and the name of the module where the error occurred (easy to do if you wrap RL's execute() with a try/catch) is already very useful and heaps better than flying blind, which is what we do now
21:53:25 <tgr> which means ResourceLoader would have to have some sort of fallback behavior
21:53:52 <ori> the error details aren't terribly important; you can just load the module in the same browser environment and find out that way
21:54:30 <ori> the bigger issue is how to avoid DDOSing ourselves if we happen to deploy code that breaks on a popular browser
21:54:33 <tgr> if the script is not same-domain, there is no error message, no file, no line number, no exception
21:54:47 <tgr> an error report without those is pretty pointless
21:55:01 <robla> "sanitize" == "swallow" then
21:55:09 <tgr> pretty much
21:55:31 <ori> i don't think that's true for RL modules, because they just register themselves, and it's RL itself that actually invokes them, IIRC
21:55:58 <csteipp> But they all come from bits right?
21:56:25 <csteipp> But... bits already allows cors, iirc.
21:56:30 <tgr> I think window.onerror gets the domain of the HTML file, regardless of what file sets it
21:57:02 <tgr> the problem is that there are weird things on the internet which remove CORS headers
21:57:13 <tgr> office firewalls and such
21:57:13 <ori> iirc every RL module is executed by <https://github.com/wikimedia/mediawiki-core/blob/master/resources/src/mediawiki/mediawiki.js#L1092>
21:57:21 <ori> which knows the name of the module it is about to invoke
21:57:33 <TimStarling> #chair ori csteipp
21:57:33 <wm-labs-meetbot`> Current chairs: TimStarling csteipp ori
21:58:02 <tgr> if you use the crossorigin attribute, but there is no header, the script just does not load, period
21:58:21 <TimStarling> can one of you do #endmeeting when you are done talking? I have to go
21:58:41 <csteipp> will do
21:59:27 <tgr> so unless there is some workaround that I don't see, ResourceLoader needs to catch CORS failures and load scripts without a crossorigin attribute in that case, which is a significant increase in complexity
22:00:16 <tgr> ori: here is a simple test case: https://www.mediawiki.org/wiki/User:Tgr_(WMF)/common.js
22:00:29 <tgr> the error source and the onerror handler are in the same file, even
22:00:39 <tgr> you still don't get error details
22:00:43 <csteipp> tgr: That does make it hard. Wouldn't knowing that there is an error be a good, and easy first step. And then adding a test for cors would be even easier, since we would be logging if the test started failing?
22:01:40 <tgr> csteipp: agreed
22:02:05 <ori> take a look at https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.echo.base&skin=vector&version=20140903T215031Z&*
22:02:10 <tgr> also, older browsers don't care about domains, and getting error details for old IE would already be useful
22:02:19 <ori> you can see that the script source is wrapped with:
22:02:28 <ori> if(window.mw){mw.loader.implement("ext.echo.base",function($,jQuery){
22:02:51 <ori> this registers the function with resourceloader, which then executes it
22:03:18 <tgr> ori: I don't see why that's relevant
22:04:46 <ori> it's relevant because the code of every RL module is enclosed in a closure rather than immediately executed in top scope
22:05:05 <ori> and the closure gets invoked by a script that is always on the same domain, which we control, and which knows the name of the module it is about to call
22:05:56 <ori> this is done in https://github.com/wikimedia/mediawiki-core/blob/master/includes/resourceloader/ResourceLoader.php#L1053 in PHP and in execute() in JS
22:06:24 <tgr> if you are thinking about adding a try-catch to the RL wrapper, that would still not catch exceptions in event handlers, timeout callbacks and such
22:06:55 <ori> yes, but
22:07:13 <ori> take a look at the wrapper code again
22:07:14 <ori> if(window.mw){mw.loader.implement("ext.echo.base",function($,jQuery){
22:07:19 <ori> we pass each module $ and jQuery
22:07:36 <ori> i added that explicitly so we can pass a patched $ / jQuery to each module
22:08:00 <ori> such that $('.foo').on('something', callback ) can be associated with the module that set it
22:08:02 <tgr> ah, I see
22:08:16 <ori> that doesn't cover window.setTimeout, but it's pretty comprehensive
22:08:17 <tgr> that would work most of the time
22:08:33 <ori> (we could even patch setTimeout, but that's getting ahead of ourselves)
22:08:47 <ori> IIRC Qunit does that
22:08:57 <tgr> sinon.js does, yeah
22:09:15 <ori> Krinkle is absurdly knowledgeable about these things, btw, and is a good resource for that reason
22:09:44 <Krinkle> Patch setTimeout for what?
22:09:57 <Krinkle> scope doesn't change
22:10:09 <ori> so that errors can always be associated with a module
22:10:17 <ori> even if it's asynchronously-executed module code
22:10:26 <ori> that a try/catch in resourceloader's execute() wouldn't catch
22:10:46 <Krinkle> ah, errors, not $/jQuery perf
22:11:02 * ori nods
22:12:36 <tgr> OK, that does sound like a valid workaround
22:13:20 <tgr> the second problem is with source maps, error reports in non-debug mode are pretty useless without that
22:13:54 <ori> you have the module name, that's pretty good already
22:14:02 <ori> module name + user agent is good enough in most cases
22:15:38 <tgr> the main use case of error logging would be 1) error stats, 2) good reports about errors which are not easy to reproduce
22:15:57 <tgr> non-deterministic or race condition or such
22:16:28 <Krinkle> QUnit doesn't, SinonJS does. And it does it synchronously, and in a terribly unstable just-about-works way.
22:16:30 <tgr> if it's easy to reproduce, the developer can usually figure how to do that out from the error report anyway
22:17:04 <Krinkle> what's the topic btw? Is there an office hour?
22:17:18 <Krinkle> debug mode is going away in it's current form as soon as source maps are implemetned.
22:17:19 <tgr> Krinkle: server-side Javascript error logging
22:17:34 <tgr> https://www.mediawiki.org/wiki/Requests_for_comment/Server-side_Javascript_error_logging
22:17:40 <ori> we should probably close the meeting (we can continue chatting informally)
22:17:57 <tgr> you are the chair :)
22:18:07 <ori> #endmeeting