Wikimedia Developer Summit/2017/ReviewStream
Appearance
ReviewStream: improving edit-review tools through a better data feed
Introduction
[edit]- Edit Review Improvements is a project of the Collaboration Team, which is building ways to improve edit review in general and, in particular, to reduce the negative effects current edit-review processes can have on new editors to the wikis.
- Most edit-review and patrolling tools were designed to safeguard content quality and fend off bad actorsâboth vitally important missions.
- But a body of research suggests that these processes can have the unintended consequence of discouraging and even driving away good-faith new editors. Particularly when they involve semi-automated tools (Huggle, RTRC, etc.)
- As a first step to providing a better review process for good-faith newcomers who are making mistakes, ERI is focusing on helping reviewers find users who are a) in good faith, b) newcomers and c) making mistakes.
- Most notably by productizing ORES --which includes a good-faith test as well as a damaging test.
- Weâve also added a âNewcomerâ test.
- Two efforts that will launch this quarter:
- RC Page Improvements --building a whole new filtering interface for the Recent Changes page,
- which will likely be rolled out to other review pages, like Watchlist.
- RC Page Improvements --building a whole new filtering interface for the Recent Changes page,
- Most notably by productizing ORES --which includes a good-faith test as well as a damaging test.
- ReviewStream--our subject today. An effort to find vandalism fighters where they live (which is not on RC page.)
ReviewStream
[edit]To the information currently in RCStream, ReviewStream adds additional data designed to improve the edit-review process. (Weâll look at that in a minute. )
[edit]- By directly incorporating data that currently has to be looked up in separate processes, ReviewStream is designed to make life easier for creators of downstream edit-review tools and to make their tools faster.
- At the same time, by making them easier to include and understand, we hope it will encourage  inclusion of features that will help new users.
- Specifically tools to determine: Intent (good faith), Quality (damage) and Newcomer status.
- Example: very early stage Pauâs Huggle Designs
Meeting Goals and Roles
[edit]- In terms of the Dev Summit Session Guidelines, this meeting is for âProblem-solving: surveying many possible solutionsâ
- That means weâve come here to get feedback and ideas.
- Assign Roles: Note-taker(s), Remote Moderator, Advocate (optional)
Attendees
[edit]- Jmatazzoni
- Mattflaschen
- WikiPathways - Third party wiki for biomedical community. Â Not a lot vandalism (not many 14-year old boys working in biomed). Â Do have issues with early contributors and handling them well.
- Andrew Otto - Stream part specifically. Â Weâre working on a service for exposing streams of data. Â Iâve done a lot of work on internal stream data.
- Dan Andreescu - Same.
- Francisco
- Leon - As a volunteer, counter-vandalism is one of my primary focuses.
- Danny Horn - Product Manager at WMF, wanted to know what your team is doing.
- James Hare - Mostly interested in seeing what kind of data is produced out of this, how is it best disseminated.
- Pau Giner - Iâm a designer and working with Collaboration team. Â Iâm interested in how to help reviewers meet their needs, listen to you all.
- Matthew Flaschen - Software Engineer on the Collaboration team and a long-time Wikipedia editor.
- Kaldari - On the CommTech team, interested in all kinds of reviewing tools and various software that can be used to support them.
- Roan Kattouw - Software engineer, heard we need to do something on other side when on VE.
Meeting Notes
[edit]Possible Questions
- Review the handout, âWhatâs in the ReviewStream Feed?â â does this list of data inspire ideas for features youâd like to have in, say, an anti-vandalism tool?
- JM: Need a sense of what edits will probably be helpful.
- JM: Standardizing categories. Â Four levels of ORES categories, in ranges (e.g. probably good). Â These will become something used for tools like Huggle, e.g. 3 levels of good faith (probably good to likely bad). Â Newcomer test. Brand-new, learner, after-learner.
- JM: Distinguishing characteristic is to recognize new editors in good faith, but struggling. Â Or bad faith, but struggling. Â Some of these other things are in RCStream, but most no.
- JM: Of reasonably readily available, what could we get in? Â Are there things you would love to have?
- WikiPathways - Idea that an edit may resolve a tag, e.g. needs sources. Â Those could be put together.
- Roan - It would be useful if some software detected e.g. add categories, but still no categories template. Â Bringing this to the attention of a human.
- Kaldari - One thing that might be useful is a list of whether it had triggered some abuse filters.
- Roan - Since the AbuseFilter logic runs before the edit is saved, we could in theory track this info.
- Roan - If we think of Huggle as a consumer, what bits of data should be in the stream vs. doing an API query.
- Andrew Otto - For a revert shortly after, e.g. if you have a stream of revision creates, then that same tool also consumes revert stream, it could combine those two in the interface and change the color.
- Roan - How does that connect to the data that we stream out?
- Andrew Otto - Is that a way to do it, in the UI later.
- Roan - Itâs in the order of milliseconds. Â In the code flow of MW, by the time the tags are known, itâs already been sent out. Â Refactoring this to be the other order is not that simple.
- Matt - Canât it also be tagged an arbitrary amount of time later?
- Roan - Yes, in theory, but not that common.
- Roan - Yes, we do delay for ORES.
- MusicAnimal - Back to AF, one thing is not just which are being triggered, but also which have triggered before. Â If theyâve triggered addition of bad words, knowing the userâs history, you want to take a closer look.
- Roan - Proposing to keep statistics of which users have triggered AF and how many times.
- MusicAnimal - Categorization of AbuseFilters. Â For instance, the bad faith category. Â When these are triggered, itâs almost always accurate. Â When it finally gets through, you see it in RS. Â You should know that theyâve triggered those in the past.
- Roan - Youâre distinguishing between two different things. Â 1. History of user triggers. 2. How many times did they get prevented?
- MusicAnimal - Also, this could be their 5th attempt. Â None of the software exposes this information, but itâs in the log.
- Roan - The filter log does track this.
- MusicAnimal - Might also take into account time.
- JM - A little bit of edit history for that user.
- MusicAnimal - In particular filter log history for bad faith users. Â Indication theyâre trying to get around edit.
- Kaldari - Wonder if it makes sense in part of this tool, or a higher level tool. Â People might want to customize this. Â Might depend on COI vs. sockpuppets, etc.
- MusicAnimal - If these filters were categorized, you could say that regardless of categories.
- Pau - Followup question: It would be great to know if someone was hit by filters related to vandalism. Â Would you prefer this to be done by the system and just tell you this is vandalism, or do you want this specific information (AF triggers).
- MusicAnimal - Both I guess, but if theyâve been hitting this bad faith filters, theyâve already surrendered their good faith.
- JM - Ores only sees one-by-one.
- PG - If we have two options: 1. Integrated to make better prediction, or.
- MusicAnimal - We still need categorization, bad-faith etc.
- Roan - Separately from that, if we put this info into machine learning input, would you still be interested in the low-level data?
- MusicAnimal - Yes
- Matt - I think it could, particularly this sessionâs edits. Â The history of this edit is relevant to whether this edit is good-faith.
- JM - Youâd want to see if they tripped it recently.
- MusicAnimal - All good.
- Alex - Talk page activity
- James Hare - Be careful of attaching data about user to stream. Â Seeing if the username is blue or red. Â If itâs red it implies a bad edit. Â Anon implies itâs bad. Â People do like this profiling thing. Â What Iâm concerned about is people or machines reaching decisions about the edit based on past history, not the content of the edit.
- Alex - Iâm thinking of things I manually. Â Iâm not sure how I automate that.
- MusicAnimal - Thereâs a whole system of warnings. Â It starts from level 1, goes to 4. Â That would be good to know. Â Huggle parses the talk page to see what level theyâre currently at.
- Andrew - They parsed presumably.
- MusicAnimal - They also have their own database.
- Andrew - This might be too much but this all sounds like a reputation score, which might be useful but also problematic.
- James - Yes, there are ethics concerns. Â History can help, but without precautions, it leads to profiling for no good reason. Iâm sympathetic to giving humans access, but not to robots making calls without knowledge.
- Danny - Robots are just presenting information, humans make decisions.
- JM - ORES doesnât look at history.
- Roan - To clarify, even if ORES did receive these inputs (history, etc.), the way it obtains its ground truth from humans is showing thousands of edits devoid of context. Â There is a layer separating human bias. Â I think the initial model is whether the user is anonymous or not.
- Dan - Going to back up Jamesâs point. Â I think blue/red implies something psychologically.
- Andrew - It could be more precise if you have a reputation score with a confidence attached to it.
- Dan - My point is that if blue/red is sufficient to bias people, some might not take a score properly.
- Danny - If people make judgement calls based on what they have available, blue/red does indicate something, this might help to surface useful info (rather the dumbest, more useful information)
- James - Itâs useful to have more granular data that is a predictor, rather than whether they have a user page.
- MusicAnimal - I personally donât think we need to worry, just like itâs on you to source it and prove itâs factually accurate. Â If youâre just being trigger-happy, weâre going to revoke whatever rights we gave you to do that. Â A bot hopefully wouldnât get that far.
- Andrew - If this hypothetical reputation score was smart enough, it could take into account history. Â Take into account itâs not some random new person.
- Dan - There are video games with secret reputation scores. Â They donât bias against each other. Â They use the reputation score to match people against each other.
- James - It would be very interesting to have this secret data on Wikipedia.
- Dan - Itâs game performance in these cases.
- Pau - With Huggle, trusted users appear in blue. Â Not sure how. Â I think it would be interesting to see. Â Interested to see how that works. Â Another perspective is to try to present this as not evaluating users, but the contributions they make. Â Being aware users can evolve.
- Which edit review and anti-vandalism tools should we prioritize as candidates for switching to ReviewStream and incorporating the new tools?
- Joe - Let me try a different question. Â Iâve been trying to get data on which are widely-used/high-impact. Â Not that simple.
- ORES only has a few languages, we should focus on wikis with AI.
- Huggle
- MusicAnimal - Huggle is the most important and has a good model. Â Itâs based on English which constantly has vandalism. Â Spoke to Amir, they can see the last few minutes on Hebrew, less than one minute on English. Â Not sure if this was an adaptation of the current RC feed. Â Youâre going to need a live stream to keep out.
- Roan: More or less a live stream, behind second or two.
- MusicAnimal: Huggle is behind too.
- Andrew: That was a question I was going to ask. Â The more latency, the cooler things you can add. Â A second, a minute.
- Roan: First stage, no reason to make it slower than ORES (itâs parallel). Â Â We could also pull again after few seconds.
- Dan: If we have acceptable latency, there can still be issues like reverts.
- MusicAnimal - Oddly enough Huggle 2 used to take into account conflicts, but no longer does. Â It does have page history, Â Â When you look at page history and see a problem there, you need to protect it. Â What often happens if students will mass-vandalism the page on their school. Â You would rather protect the page. Â If you have that right there in the UI, you can act.
- Roan - Protection status at the time of the edit is one thing weâre considering including. Â A revert stream could be a thing we could do. Â This would also be useful in the revision stream. Â Donât know the state of the art on this. Â Donât know how expensive revert discussion is? Â Not clear on undo then modify. Â For rollback itâs a clear-cut case. Â If you filter revisioncreate you would have a revert stream. Â We do notify for undo currently. Â For multiple edit undos we donât do anything.
- Matt: You really want protection status at the time of the edit, but at the time of review.
- Roan: There is a use case for protection status at time of edit. Â You can check current status with API queries as well.
- Joe: What is most important after Huggle?
- A: MusicAnimal - Stiki
- RTRC
- STIki
- MusicAnimal - Stiki. Â Â This goes off ClueBotâs scoring, so it can look at borderline cases. Â ClueBot wonât revert the same user twice on the same page. Â It will show up in Stiki. Â Edits that happened hours ago will happen in Stiki. Â Letâs assume youâre at the keyboard in Huggle, you didnât revert the one that happened an hour ago. Â Itâs quite good.
- Roan - Does Stiki get these benefits from ClueBot?
- MusicAnimal - Yes
- Roan - But close to reverting.
- MusicAnimal - When CB goes down Sticki does too. Â Iâm guessing it doesnât drink from the firehose.
- MusicAnimal - Sticki has its own database (?). West Andrew would be a wonderful person to talk to about this. Â If you go to Sticki homepage, his info is there. Â Heâs studied the data. Â People are marking some edits as good. Â You can do analysis and theoretically build something like ORES from that data. Â Thatâs why heâs retaining that data.
- LiveRC (popular on non-English wikis)
- What else?
- Alex - Weâre focusing almost exclusively on bad actors. Â Iâm interested in good users too. Â What about showing things to people before they click save.
- JM: A lot of people bring that up. Â To clarify, weâre talking about vandalism here because weâre focused on the feed. Â Yes, weâve thought about that. Â Were you in the earlier session?
- What is the best way to approach and then support/work with the communities that support these tools? (E.g., let them handle it all, provide design assistance, do the work for them...)
- How might edit reviewersâand anti- vandalism fighters in particularâuse the ability to detect newcomers who are in good faith but struggling? Will they care at all?
- JM: One last question: These are not by and large WMF tools, they are written by community. Â Iâm not sure what this engagement should look like. Â 1. Will the community care about this (anti-vandalism particularly. Â I think they will care, they are aware of research. Â They donât think ânot my problemâ, and given they also get other benefits, in addition to protecting users, do people have ideas about best way to work with communities? Â Is it okay for WMF to just step in?
- Danny Horn - This happens to us several times a year. Â How do you approach these people if you want to work on a tool?
- JM: This is a project based on research, but not necessarily community demand.
- Matt: There is community demand to protect new users, but not necessarily from Huggle folks.
- MusicAnimal - This would be really unique in combining power of ORES and functionality of Huggle. Â Especially if we also surface things more suggestive about good edits or likelihood of a good edit. Â People will be enthusiastic. Â At least initially itâs going to be really difficult to pull people away from Huggle.
- JM: Weâre not going to pull them away, weâre going to add something.
- MA: I do think itâs something new and valuable.
- Dan Andreescu (DA): In terms of stepping in, you end up owning stuff you touched. Â One of our prime directives is to do what the community canât do. Â Tools like this and focusing on them increases what the community can do. Â If thatâs the focus, people going to get real jobs becomes less of a problem.
- JM: That absolutely was the idea of ReviewStream.
- DA: Talking to these people to see their ideal scenario.
- Ryan Kaldari (RK): A lot of times tools that arenât actively maintained, the maintainers are glad to have people work on them.
- DA: We launched page view API, we had a buggy client on purpose, weâre not going to fix the bugs. Â Then someone fixed the bugs. Â That seemed like a good strategy.
- RK: If you can put out a prototype for how to use this stream, and steal the code, I think thatâs the most useful thing we can do.
- MusicAnimal - Thatâs how my work started. Â I saw Marcel earlier and again thanked him.
- AO: Iâve never used Huggle, thereâs not a lot of community demand to protect new users. Â Isnât that what Huggleâs about?
- [Thatâs Snuggle].
- JM: Snuggle is more of an editor-reviewing tool. Â It asks you to classify editors. Â It sounds like a friendlier Huggle but itâs not.
- AO: Maybe show âthis is an edit, maybe they need a hugâ.
- JM: Something we may use is the summary of an editorâs history.
- AO: Iâve never heard or thought of a rep system for users in Wikipedia, but is this something people have talked about.
- RK: From my perspective, my anecdote is the only times Iâve heard this come up is when people were asking us not to implement their worst fears of what that could be. Â People were concerned about harassment angle. Â Iâve only heard people stigmatize any attempt as doing things related to rep. Â Havenât heard people speak up in favor of. Â Iâm sure thereâs ways to reduce that impact.
- DA: Aaron designed WikiCredit, itâs based on content survival.
- MF: That could also be subtle vandalism.
- DA: Itâs not public by default.
- JM: Aaron said ORES could rank users, but weâve chosen not to do that. Â I think thatâs probably the right decision, people are very wary of robots making these decisions.
- JM: Really appreciate you coming by, please stay in touch.