أدوات الوسيط/الوسيط الآلي/خطة القياس
هذا ملخصٌ للمسودة الحالية لخطة قياس المُشغّل الآلي التي توضح كيف سنقيّم ما إذا كان المشروع ناجحًا في تحقيق أهدافه، وفهم تأثيره على مشاريع ويكيميديا.
تنقسم الصفحة إلى ثلاث فرضيات لدينا حول المشرف الآلي. تحتوي كل فرضية على نقطتي بيانات من المستوى الأعلى (أهم الأرقام التي نهتم بها) متبوعة بجدول يوضح بالتفصيل أسئلة البحث الحالية وأساليب التقييم أو المقاييس التي سنستخدمها لاختبارها. أسئلة البحث مستوحاة من مناقشاتنا الداخلية حول المشروع، والمحادثات التي أجريناها مع المحررين (على سبيل المثال هنا على ميدياويكي).
هذه الوثيقة ليست ثابتة أو نهائية وستتغير كلما عرفنا المزيد. Unfortunately we can't guarantee that this page will stay up to date following the initial community discussions we have about it. We may find that some questions are not feasible to answer with the available data, or might identify new questions we have further down the line. We aim to share any major changes in project updates.
We really want to know what you think about this plan on the project talk page - does this capture the main data points you think we should track? Is anything missing or do you have ideas we could incorporate? What data would help you decide whether this project was successful?
QN = Quantitative measure (data)
QL = Qualitative measure (e.g. surveys, unstructured feedback)
الفرضية رقم 1
فرضية: '
سيوسع المشرف الآلي من نطاق عمل الدوريات من خلال تقليل عبء العمل الإجمالي في مراجعة التغييرات الأخيرة وإعادتها وتمكينهم فعلياً من قضاء المزيد من الوقت في أنشطة أخرى.
Top level data:
- Automoderator has a baseline accuracy of 90%.
- Moderator editing activity increases by 10% in non-patrolling workflows (e.g. content contributions or other moderation processes).
Research questions | Evaluation method/metric(s) | Notes |
---|---|---|
Is automoderator effective in countering vandalism on wikis?
|
[QN] While the thresholds for success can vary based on the community, the team would consider the following as successes:
|
We don't yet know what a reasonable level of coverage is for Automoderator, so we will define X as we progress with the project. Each community will be able to customise the accuracy and coverage level for their community, so 90% would be a baseline figure applying to the most permissive option available. |
[QN] How long does vandalism stay in articles before being reverted, and how many readers see that vandalism.
|
Pageview data is not currently available on a per-revision basis, but this is something we can start collecting (T346350). | |
Does Automoderator reduce the workload of human patrollers in countering vandalism? | [QN] Proportion of edits reverted back by Automoderator, human patrollers, and tool assisted human patrollers across the time periods of 1 hr, 8 hrs, 24 hrs, and 48 hrs, after an edit takes place. | 'Tool assisted human patrollers' means patrollers using tools like Huggle and SWViewer. |
[QN/QL] Does the volume of various content moderation backlogs reduce?
|
Here we are hypothesising that patrollers might spend their additional time in other venues. We may need to start with some qualitative research here to understand which backlogs we can/should monitor. | |
Does Automoderator help patrollers spend their time on other activities of their interest?
|
[QN] Distribution of contributions/actions (pre and post deployment) by patrollers across:
Tentative list of contributions
The patrollers of the pilot wikis will be surveyed to
|
There are a wide range of possible ways to look at this, so we may need to speak to patrollers to understand which activities to consider. |
[QL] Perception of patrollers in how they are contributing to the wiki post-deployment.
Qualitative changes in workflows compared to pre-Automoderator deployment. As in - are they actually doing non-patroller work or simply more specialized patroller work that Automoderator can’t handle? |
الفرضية رقم 2
'الفرضية: المجتمعات متحمسة لاستخدام المشرف الآلي والتفاعل معه لأنها تثق في فعاليته في مكافحة التخريب.
Top level data:
- Automoderator is enabled on two Wikimedia projects by the end of FY23/24 (June 2024).
- 5% of patrollers engage with Automoderator tools and processes on projects where it is enabled.
Research questions | Evaluation method/metric(s) | Notes |
---|---|---|
Are communities enthusiastic to use Automoderator? | [QL] Sentiment towards Automoderator specifically and/or automated moderation tools broadly, both among administrators and non-administrator editors.
[QL] Presence of custom documentation for Automoderator (e.g. guidance or guidelines on use) [QL] Uptake of Automoderator by specialized counter-vandalism groups (especially crosswiki ones) - stewards, global sysops, SWMT [QN] String (TranslateWiki) and documentation (MediaWiki) translation activity. |
|
[QN] Do communities enable Automoderator, and keep it enabled? If so, how long?
|
||
Are communities actively engaging with Automoderator because they believe it is an important part of their workflows? | Note: may change based on the final design/form Automoderator takes
[QN] What proportion of false positive report logs are reviewed and are yet to be reviewed? |
|
Note: may change based on the final design/form Automoderator will take
[QN] What is the usage of model exploration/visualisation tools?
|
||
Note: may be expanded based on the final design/form Automoderator will take
[QN] How often is Automoderator’s configuration adjusted?
|
This may only be relevant when Automoderator is initially enabled and configured. After this we may not expect high activity levels. | |
Are communities able to understand the impact of Automoderator on the health of their community? | [QL] UX testing of Automoderator configuration page and dashboards (if relevant) | On our first pilot wikis we may need to simply have a json or similar page, before Community Configuration is ready to provide a better front-end experience. |
الفرضية رقم 3
'الفرضية:' عندما يتم إرجاع التعديلات بحسن نية من قبل المشرف الآلي، فإن المحررين المعنيين قادرون على الإبلاغ عن الإيجابيات الخاطئة، ولا تضر إجراءات الإرجاع بمسيرة المحررين، لأنه من الواضح أن المشرف الآلي هو أداة آلية لا تصدر حكمًا بشأنهم بشكل فردي.
Note: As editors’ experiences and journeys widely vary based on device, the following metrics where relevant should be split by platform and device.
Top level data:
- 90% of false positive reports receive a response or action from another editor.
Research questions | Evaluation method/metric(s) | Notes |
---|---|---|
Are good faith editors aware of the reverts made by Automoderator and able to report if they believe it is a false positive? | [QL/QN] What is the perception of good faith newcomers when their edit has been reverted by Automoderator?
|
This may be a survey, interviews, or using QuickSurveys. |
Are users who intend to submit a false positive report able to successfully submit one? | [QN] What proportion of users who have started the report filing process completed it?
[QL] UX testing of the false positive reporting stream. |
|
What is the effect of Automoderator in new editors’ contribution journey?
|
[QN] A/B experiment: Automoderator will randomly choose between taking and not taking a revert action on a newcomer (details to be defined). The treatment group will be newcomers on whom Automoderator takes a revert action on, and the control group will be newcomers on whom Automoderator should have taken a revert action on (based on the revert risk score) but hasn't, as part of the experiment, but were later taken action on by human moderators.
[QL] Quicksurveys or similar short survey tool may be feasible.
|
Retention and surveying new editors is hard, but we have a lot of experience with this at the Wikimedia Foundation in the Growth team. We will be meeting with them to learn more about the options we have for evaluating this research question. |
Guardrails
In addition to this goal-focused measurement plan, we are also planning to define 'guardrails' - metrics that we will monitor to ensure we're avoiding negative impacts of Automoderator. For example, do fewer new editors stick around because Automoderator reverts are frustrating, or do patrollers become too complacent because they put too much trust in Automoderator? These guardrails have not yet been documented, but we'll share them here when they have.
If you have thoughts about what could go wrong with this project, and data points we could be monitoring to verify these scenarios, please let us know.
Pilot phase metrics
While the measurement plan can be helpful to understand and evaluate the impact of the project in the long term, we have identified some metrics to focus on for the pilot phase. The goal of these is to provide an overview of Automoderator's activity to the team and also the community, and monitoring to making sure that nothing abnormal. If you have suggestions for any other metrics that we should be tracking during the pilot phase, please leave a message on the talk page.
Indicator for | Metric(s) | Dimensions |
---|---|---|
Volume | Number of edits being reverted by Automoderator (absolute & percentage of all reverts) | Anonymous users, newcomers[1], non-newcomers[2] |
Accuracy (False positives) | Percentage of Automoderator's reverts reverted back | |
Accuracy (False negatives) | Proportion of reverts not performed by Automoderator while it is turned on | - |
Efficiency | Average time taken for Automoderator to revert an edit | - |
- | Average time taken for Automoderator's reverts to be reverted back | - |
Guardrail | Post deployment, proportion of edits reverted by performer | Automoderator, humans, and tool-assisted humans (if applicable) |