Extension talk:BayesianFilter/GSoC 2013
Add topicProposal Comments
[edit]Some quick initial comments:
- Updating the UI to collect the corpus is going to be hard, much more work than one week. Getting a button added to the UI is something that would need design review, and approval from the administrators. Alternatively, you may be able to collect reverts from Cluebot-ng or STiki, or possibly look at reverted edits by users who have been block for spam. You could also add a button to the page using javascript, that tags the revision just before the revert-- convincing a few administrators to use your javascript will be much easier than convincing them all that another button is needed in the interface.
- Thanks for the suggestion. I guess I will use STiki, it labels texts as vandalism and innocent, so it would be easier to gather classified data.
- For the offline processing, you may want to focus on implementing the filter as a bot, which reads all of the incoming edits, and does the processing outside of the WMF cluster. The data handling will need to be pretty mature before we can run it on the production servers. Running this on a wmflabs instance shouldn't be a problem.
- I am doing that only. The filter will be a python daemon. It will be called from a php script, extension SpamFilter. It will provide it with all the incoming edits. Filter will evaluate it as a sapm or ham, update the DB, return the result.
- So the difference is where the python script actually runs. To get it on the Wikimedia cluster, it will need to be pretty mature, and go through a rigorous review for performance and security before it can be deployed. This can take several weeks. If, instead, it's actually running on a wmflabs instance, and just consuming a feed (using irc or the api) of recent changes, then there are almost no security or performance requirements. So I'd recommend starting with that, with the goal have having it run on the cluster (either from a hook, or as a job runner) during the second half of the program.
- I am doing that only. The filter will be a python daemon. It will be called from a php script, extension SpamFilter. It will provide it with all the incoming edits. Filter will evaluate it as a sapm or ham, update the DB, return the result.
- After talking with anubhav, the goal will not be to run this on the WMF cluster this summer, but just to develop the extension, and the WMF can evaluate it's usefulness on WMF sites when it's done. So above comments about the cluster are irrelevant. CSteipp (talk) 17:49, 25 April 2013 (UTC)
- Why creating the classifier in python? As you're developing it from zero, it may be better to write it in php. Unless python has some advantage for the task.
- Well I am thinking of developing this as a bot(will discuss that with Chris Steipp), so it will be not really a part of mediawiki codebase. As such php provided no advantage. I viewed a question on Stackoverflow concerning the same where people suggested Python. More over Php does not have a in built support for multi-threading.
- You may find some revision metadata interesting, too. The variables recorded by AbuseFilter are: user_editcount, user_name, user_groups, article_article_id, article_namespace, article_text, article_prefixedtext, article_recent_contributors, action, summary, minor_edit, old_wikitext, new_wikitext, edit_diff, new_size, old_size, edit_delta, added_lines, removed_lines, added_links, all_links, old_links, tor_exit_node, timestamp. Some values which could be interesting: aggregated user_editcount, article_namespace, summary, minor_edit, tokenized added_lines, time since last edit...
- Can you provide me an analysis of abusefilter on how some of these variables infect the spamming. I will take a look in abusefilter in a while.
- The alpha/special/whitespace characters should be configurable/depend on the language. Perhaps Unicode properties could be used.
- That's a good suggestion. I would keep that in mind
- What's the reasoning behind the short words % attribute? Seems more likely that a problematic edit contains a 25 character "word"
- Looking up words in a dictionary may be interesting (% of the words found in the language dictionary) as an alternative method.
- Will do that
Platonides (talk) 21:44, 25 April 2013 (UTC)
Going over your proposal again:
- I noticed Aug 3-17, for the online integration. I'm not sure if you were planning on it or not, but it might be best to build that on top of the existing AbuseFilter extension. The extension has hooks to allow other extensions to supply variables, so simply adding whether the filter thinks it's spam or not could be added. That will save you from re-implementing things like logging (which would need to includes deleting and suppression of entries, and other nasty issues like that), and the tagging/warning/blocking logic.
- I have changed the proposal accordingly. I will be understanding the AbuseFilter code before the GSoC timeline to understand how can I use it better.
- Also, just to clarify, the week of Aug 24th, you're planning to use the existing JobQueue infrastructure in MediaWiki, correct? I think it should be able to all of what you need.
- No. Actually as you in the model I am proposing the filter would be separate bot as such it won't be a part of MediaWiki code base. So I guess I wont't be able to use the MediaWiki job scheduler. As the classifier is as a separate bot I was thinking of developing it in python for writing the daemon script. I will be using Advance Python scheduler for the job queue.
- Lastly, do you need to schedule time to come up to speed on MediaWiki in general? And getting your dev environment setup? Just make sure to give yourself time. A lot of people will be traveling to Wikimania the week of Aug 5th, so we want to make sure you're all setup before that to hit your Aug 3rd-10th milestone CSteipp (talk) 22:03, 26 April 2013 (UTC)
Spam-Data
[edit]Maybe the data gathered by our Abuse-Filter will help:
www.wiki-aventurica.de/wiki/Spezial:Missbrauchsfilter
The logs of filters 2, 4, 5 and 6 are probably more than 99% Spam. Logs of filters 1 and 3 also probably have a high spam percentage. Filters 7, 8 and 9 are not designed to capture spam.--78.35.52.207 12:39, 26 May 2013 (UTC)
More initial training data
[edit]Hello, thanks for your updates! I read «Got the MediaWiki sysop permissions. I will now fetch mediawiki deleted pages and use them to train the DB.» Would you use some more training data from sysops of other (big, "international") wikis? I assume you used some script and I'd gladly provide you with translatewiki.net's. We have at least a few thousands deleted spam pages; not a huge set, but who knows. --Nemo 09:28, 7 September 2013 (UTC)