Jump to content

Manual:$wgSpamRegex

From mediawiki.org
Access: $wgSpamRegex
A regular expression pattern which, if the page content matched it, stops a page from being saved.
Introduced in version:1.2.6
Removed in version:Still in use
Allowed values:(array of regex strings)
Default value:[]

Any text added to a wiki page matching this regular expression (or "regex") will be recognized as Wiki spam and the edit will be blocked. $wgSpamRegex will affect all user groups; even members of the sysop and bureaucrats user groups will be disallowed saving a text, if it matches $wgSpamRegex. Use Extension:AbuseFilter to be able to set up rules, which also allow you to filter by group! $wgSpamRegex is one of MediaWiki's most effective built in anti-spam features. It will not block all spam, but it can reduce spam dramatically, with almost no negative impact upon legitimate users. $wgSpamRegex's configuration settings will control how mediawiki examines the text of contributions and determines if the contributions are spam or not.

Warning Warning: If your spam filter regular expression quietly fails, it may need more memory! See #pcre.backtrack_limit

A large example

The following example is a good setting to try out on your wiki, if it is a medium/small size wiki suffering from spamming attacks. Paste the following into your LocalSettings.php file:

 $wgSpamRegex = ["/".                        # The "/" is the opening wrapper
                "s-e-x|zoofilia|sexyongpin|grusskarte|geburtstagskarten|".
                "(animal|cam|chat|dog|hardcore|lesbian|live|online|voyeur)sex|sex(cam|chat)|adult(chat|live)|".
                "adult(porn|video|web.)|(hardcore|teen|xxx)porn|".
                "live(girl|nude|video)|camgirl|".
                "spycam|casino-online|online-casino|kontaktlinsen|cheapest-phone|".
                "laser-eye|eye-laser|fuelcellmarket|lasikclinic|cragrats|parishilton|".
                "paris-(hilton|tape)|2large|fuel(ing)?-dispenser|huojia|".
                "jinxinghj|telemati[ck]sone|a-mortgage|diamondabrasives|".
                "reuterbrook|sex-(with|plugin|zone)|lazy-stars|eblja|liuhecai|".
                "buy-viagra|-cialis|-levitra|boy-and-girl-kissing|". # These match spammy words
                "dirare\.com|".           # This matches dirare.com a spammer's domain name
                "overflow\s*:\s*auto|".   # This matches against overflow:auto (regardless of whitespace on either side of the colon)
                "height\s*:\s*[0-4]px|".  # This matches against height:0px (most CSS hidden spam) (regardless of whitespace on either side of the colon)
                "==<center>\[|".          # This matches some recent spam related to starsearchtool.com and friends
                "\<\s*a\s*href|".         # This blocks all href links entirely, forcing wiki syntax
                "display\s*:\s*none".     # This matches against display:none (regardless of whitespace on either side of the colon)
                "/i"];                     # The "/" ends the regular expression and the "i" switch which follows makes the test case-insensitive
                                          # The "\s" matches whitespace
                                          # The "*" is a repeater (zero or more times)
                                          # The "\s*" means to look for 0 or more amount of whitespace

Note that the second-to-last line does not have the "|" at the end of the string. This is because the next line ends the regular expression with the closing wrapper / followed by the "i" switch.

This example incorporates common spamming keywords (some taken from Meta-Wiki's Spam Blacklist) and also techniques for blocking CSS hidden spam.

Using regular expressions to block spam

Here is a tutorial on regular expressions. Experiment with the $wgSpamRegex setting, and test out some edits on your SandBox page, to see what gets blocked. But beware! Take care to avoid false positives i.e. incorrectly matching legitimate edits, see Avoid false positives below.

The setting which you assign to $wgSpamRegex, is a regular expression (See Wikipedia's article and PHP's manual on regular expressions). The above example shows a regular expression being built up over several lines, using PHP's dot syntax to concatenate strings. This makes this long regular expression more compact, but also a bit more complicated.

If you create your own regular expressions you may want to test them out in a PCRE Regex Evaluator (click the PCRE tab on this page).

Simple example

Here's a more simple example:

$wgSpamRegex = ["/buy-viagra/"];

Remember the idea is to decide - Is this spam: yes or no? With this example, any contribution text containing 'buy-viagra' will match as spam. The '/' symbols at the beginning and end are part of the regular expression syntax.

Block several different words/domains

Lets extend our example to try to match more kinds of spam:

$wgSpamRegex = ["/buy-viagra|adultporn|online-casino|dirare\.com|sexcluborgy\.net/"];

Using a '|' symbol between words, the above example will block several different spammy words, and also some domain names which are promoted by spammers.

The $wgSpamRegex is applied to all contributed text, including the spam link URLs. As such, blocking domain names can be a very effective way of getting rid of a particular spammer.

Avoid false positives

Avoiding false positives is the real challenge here, and it's best illustrated with a bad example:

# Don't do this!
$wgSpamRegex = ["/cialis/"];

Lots of spammers like to talk about 'cialis' (some kind of drug. Who cares? not us!) and so you might be tempted to match the word as a spam, but this will also prevent users from mentioning the word 'specialist.' It is very easy to make this kind of mistake. Be careful with your regular expression setting. You want to stop spammers without inconveniencing your users. This problem can be overcome in many cases by including the "\b" word boundary pattern before and after any words that might be contained in a larger word, e.g.

# This will match "cialis", but not "specialist"
$wgSpamRegex = ["/\bcialis\b/"];

# You can also include this option around a group of patterns, e.g.
$wgSpamRegex = ["/\b(cialis|viagra|porn|sex|anal)\b/"];
# This will avoid banning words like "analysis" or "Essex".

Other regexp tips

Regular expressions are very powerful. $wgSpamRegex matching is applied to all text of the page or section being edited, not just URLs. This gives you the power to block anything you don't like, if you can work out a good regular expression to match it (be as specific as possible to avoid false positives). In the following section on CSS Hidden Spam we make use of this tool.

Spam match message

Normally when the $wgSpamRegex setting matches some spam, the following message is displayed:

The page you wanted to save was blocked by the spam filter. This is probably caused by a link to a blacklisted external site.
The following text is what triggered our spam filter:

[word/domain name which was blocked]

This text can be changed, and is located on two editable wiki pages in the MediaWiki namespace. Click 'Special Pages' -> 'Wiki data and tools: System Messages', type 'spampro' into the 'Filter by prefix:' field and click 'Go'. If you get 'View Source' instead of 'Edit' on the top tab, then you don't have permission to edit. You need to log in as an sysop user (or the WikiSysop user which you configured during installation).

'$1' in MediaWiki:Spamprotectionmatch displays the failed edit's regex match that tripped the spam filter. Delete '$1' if you want it hidden.

Displaying/Hiding the matched text

If you've made a regex which is too restrictive, or you have made some other mistake in the setting, then you may get false positives. Indeed the full example above might match legitimate text in some rare circumstances (maybe your users really do want to talk about buying Viagra).

By displaying the text which matched, the MediaWiki:Spamprotectionmatch message helps to reduce problems caused by false positives.

It allows your users to accurately report problems to you, about your $wgSpamRegex setting.

It also allows them to figure out a workaround, so they can continue with their wiki editing.

Unfortunately it's also a very useful bit of information for spammers visiting your site. Some spammers are automated bots, so they won't be seeing this information anyway, however many spammers (believe it or not) are humans. These humans could go to the trouble of looking at the matching information, and trying to devise a workaround (e.g. just missing out the domain name that you have blocked, but linking to various other domains). It's difficult to know how prevalent this kind of behavior is, but if you wanted to make life more difficult for them. You could hide the spam matching information by simply setting your MediaWiki:Spamprotectionmatch message as empty. You should only do this if you are very aware of the above points about false positives, and have carefully designed your regexp to avoid them.

CSS Hidden Spam

MediaWiki is quite permissive when it comes to HTML tags, and CSS style definitions (see Help:HTML in wikitext )

This has given spammers the opportunity to invent a sneaky trick to hide their spam from view. It doesn't show up on your pages, but it does show up in your edit boxes, and the changes show up in your 'recent changes' display. As such it causes confusion to your legitimate users, and that's before you consider the effects of helping a spammer by hosting their links. Generally 'CSS Hidden Spam' is all bad. Just because you can't see it (easily), doesn't mean you can ignore it.

The problem was identified by the folks at chongqed.org in 2005, but has got a lot worse in 2006, to the point where it seems most MediaWiki spammers are using this trick.

We can use a regular expression to prevent the CSS tricks which they are using. Two of these are incorporated in the full example above (combined using the '|' symbol):

To prevent CSS hidden spam of the form <div style="overflow:auto; height:0px;":

$wgSpamRegex = ["/".
  "overflow\s*:\s*auto|".
  "height\s*:\s*[0-4]px|".
  "/i"];

To prevent CSS hidden spam of the form style="display:none;":

$wgSpamRegex = ["/style\s*=\s*"\s*display\s*:\s*none\s*"/i"];
     # Which parses as follows:
     # "       = PHP string wrapper
     # /       = RegEx opening wrapper

     # style   = search for the string 'style'
     # \s*=\s* = search for an equals sign with any amount of whitespace (including no whitespace) on either end
     # display = search for the string 'display'
     # \s*:\s* = search for a colon sign with any amount of whitespace (including no whitespace) on either end
     # none\s* = search for the string 'none' followed by any amount of whitespace (including no whitespace)

     # /       = RegEx closing wrapper
     # i       = RegEx switch makes tests case-insensitive
     # "       = PHP string wrapper
     # ;       = PHP line end

For a slightly more strict setting you might prefer to disallow various attributes of the style tag altogether:

$wgSpamRegex = ["/\<.*style.*(display|position|overflow|visibility|height)\s*:.*>/i"];

...but you may find this starts to restrict your users more than you would like.

You can block all external links by using this regex:

# Block ALL external links
$wgSpamRegex = ["/https?:\/\//"];
$wgSummarySpamRegex = "/https?:\/\//";

This is extremely restrictive to the wiki's legitimate users, as they cannot link to any external site anymore. It is a poor solution to the spam problem, although it is marginally better than a complete lock down.

If you are going to use this, make sure your 'MediaWiki:Spamprotectiontext' page has an explanation of what you have done.

You can limit the total number of external links allowed per page, to say 100, with this

# Limit total number of external links allowed per page (the ? in *? makes * ungreedy and is important for efficiency)
$wgSpamRegex = ["/(http:(.|\n)*?){101}/"];

If you do this, make sure your 'MediaWiki:Spamprotectiontext' page has an explanation of what you've done.

pcre.backtrack_limit

Warning Warning: If your spam filter regex quietly fails, it may need more memory! Or you may need to write your regex better so it does not waste itself: making * ungreedy by adding a ? to it, like so *?, can greatly help efficiency! Test your home brewed regexes in a PCRE Regex Evaluator (click the PCRE tab there).

PHP since version 5.3.7 has a pcre.backtrack_limit which defaults to 1000000 (1M). However this may still be too low. Try adding the following line to your "LocalSettings.php" file:

// Perl Compatible Regular Expressions backtrack memory limit
ini_set( 'pcre.backtrack_limit', '2M' );

If this still not enough you may gradually increase this limit until it fits you wikis actual requirement.

See also