semanticweb.org:Fighting spam

From semanticweb.org
Jump to: navigation, search

Nobody loves spam. This page aims to aggregate the ways to combat spam on semanticweb.org wiki and manage the community efforts in that.

Proposal and ideas (this page)
Common spam patterns
How to clean up spam manually

Contents

[edit] Kinds of spam on the wiki

[edit] By user

  • Anonymous spam
  • Spam from registered user

[edit] By page action

  • Spamming on a user page
  • Spamming by creating new page
  • Spamming on existing pages

[edit] By sort of spam itself

  • Posting links to websites
  • Posting text with non-spam links for example liks to a URL-shortener services
  • Posting text without a links

[edit] Test pages

There are several dozens of pages that are not spam but it's not a useful content. Typically these are pages for testing Semantic MediaWiki features: they have SMW properties and it's possible to export RDF out of them. These pages should be removed but the data dumps should be uploaded to http://sandbox.semantic.mediawiki.org.

[edit] Current vulnerabilities of the wiki

  • Weak captcha. Currently questy captcha is used: there are about 5 different questions so it could be easily broken.
  • Registered users can post immediately
  • Anonymous users can post links

[edit] Tasks

[edit] Cleaning up the existing spam

[edit] Block spam users and delete spam pages

There are plenty of spam users that are not blocked yet and several pages that are entirely spam. Both blocking and removing created pages of the users can be done at once, using SecretaryBot, AutoWikiEditor and Nuke (see #Tools)

Proposed actions


Block existing registered spammers! There are several ways to figure out if the user is the spammer and block the user:

  • Take spammer red-handed: block user during rollback
  • First glance: block all users with inappropriate usernames.
  • Second glance: view userpage and if it looks bad delete it and block spammer
  • Detailed view: View user's contribution and if they are bad delete user page and block spammer

Note: it would be great to collect all undoing actions so we can (semi-)automatically generate the regular expressions for spam filters (see below). To do that every undo must be marked as spam in a description section.

[edit] Remove spam users' contribution to the wiki

This is trickier. A spammer could write something on an existing page, but after that someone could have edited it once again. Here several strategies are applicable.

  • The simplest case is when wiki experienced mass spam attack. In this case we can do mass rollback.
  • First of we can try to remove ONLY spammer's contribution to the page and merge the result to a most recent version of the page. This is done with simple Undo function. Is there a bot available that tries to undo all user's edits?
  • However sometimes it's not possible to merge two versions without conflicts. In this case all we can do is to form a regular expression that will undo all the similar changes.

Proposed actions


  • find of write the bot which will try to undo all contributions of the blocked users

[edit] Prevent spam in the future

[edit] Add new expressions to spam filter

There is a regex based spam filter extension installed on this wiki: ConfirmEdit. It uses two blacklists: MediaWiki:Spam-blacklist and meta:MediaWiki:Spam-blacklist to check if the edit is good. Every time user tries to save the page the extension scans the text of the edit and deny saving in case text matches the regular expression.

Proposed actions


  • Deny the shorting links creation: many spam here is google shortening service links
  • Analyze the html tags used for spam links and deny them if it's possible. Typically these are tags for invisible text, for creating black colored links.
  • (probably) write a little MediaWiki Extension that allows to quickly add an expression to the blacklist when doing rollback and/or undo actions marked as spam. The extension have to extract the URL from the edit and ask Administrator if it should be added to a blacklist. There is such an extension called SpamDiffTool but I'm not sure if it works with modern MediaWiki.

[edit] Improve captcha

Now the only captcha that is used on a wiki is QuestyCaptcha: asking the question from predifined set.

Proposed actions


QuestyCaptcha showed itself as a great one. However we need to update the set of questions from time to time. User should enter the CAPTCHA for the following anonymous user's actions:

  • creation of an URL
  • registering new account

Another efficient thing is honeypot CAPTCHA which is for spambot that fill all the values on the registration. Honeypot CAPTCHA adds a hidden field that a human user will never change.

Proposed actions


Extension:SimpleAntiSpam is a just such a thing.

[edit] Change the policy of anonymous and newly registered users

Now the policy is the following:

  • anonymous users can add links after solving the captcha
  • they can add any other text normally
  • registered users can add external links normally

The security of the system may be increased after better captcha, but it's always better to protect yourself twice. Many spammers write something right after registration.

Proposed actions


  • added a timeout interval after signing up.
$wgAutoConfirmAge = 3600*24;
  • (probably) allow external links creation only after confirmation of an e-mail
$wgEnableEmail      = true; // enable the e-mail basic features
$wgEmailAuthentication = true; // require email authentication for using any email function (except password reminder which works independently from this setting)
$wgEmailConfirmToEdit = true; // Require a confirmed address to edit pages
  • deny anonymous user page creation. Only after a timeout user can create pages.
$wgGroupPermissions['*']['createpage'] = false;
$wgGroupPermissions['user'         ]['createpage'] = false;
$wgGroupPermissions['autoconfirmed']['createpage'] = true;

[edit] Analyze spam-like behavior

There several different ways to analyze Bad Behavior. First is to analyse the information of the headers that client send to server. Second one is to analyse the actions of a client on a wiki.

Proposed actions


  • Block users with black User-Agent field: add the following to .htaccess. All browsers fill User-Agent fields, so we can protect wiki from the dumiest of spambots:
SetEnvIf User-Agent ^$ spammer=yes     # block blank user agents

Order allow,deny
allow from all           
deny from env=spammer
  • Install AbuseFilter. AbuseFilter allows to block users that behave like spambots based on set of rules defined by administrator. After installation we need to look at the rules on existing wikis like wikipedia:Special:AbuseFilter and Appropedia and try to apply appropriate rules to our wiki
  • Install Bad Behavior extension. It is extension based on Bad Behavior php project that analyse every request on a set of heuristics.

[edit] Create a command of volunteers that can periodically clean up spam

Some spam will nevertheless occur even in the most protected wiki. We need several people that will have Administrator rights and will be able to read RecentChanges every week blocking and undoing the spam revisions.

[edit] Tools

[edit] Tools for batch editions

  • AutoWikiBrowser - allows to quickly and interactively form lists, edit pages by regex and delete pages. Batch block of users and remove contribution is not supported.
  • secretaribot includes script that shows username+userpage and allow you to instantly delete the page and block the user.
  • Spam blacklist cleanup script allows to quickly clean up all spam URLs added to MediaWiki:Spam-blacklist of the wiki
  • Nuke allows to delete all pages created by a given user
  • DeleteBatch allows to create a list of pages and delete them in one click.

[edit] Volunteers

[edit] Links to read

Personal tools

Variants
Actions
Navigation
services
Toolbox