Jump to content

DeadlinkChecker: Difference between revisions

From mediawiki.org
Content deleted Content added
lead
Line 18: Line 18:
== Links ==
== Links ==
* [https://github.com/wikimedia/DeadlinkChecker Code available here]
* [https://github.com/wikimedia/DeadlinkChecker Code available here]
* [https://packagist.org/packages/niharika29/deadlinkchecker Available on Packagist]
* [https://packagist.org/packages/wikimedia/deadlinkchecker Available on Packagist]
* To report bugs, create a ticket in [https://phabricator.wikimedia.org/ Phabricator] and tag with [https://phabricator.wikimedia.org/tag/community-tech/ Community-tech tag]
* To report bugs, create a ticket in [https://phabricator.wikimedia.org/ Phabricator] and tag with [https://phabricator.wikimedia.org/tag/community-tech/ Community-tech tag]
* [https://github.com/wikimedia/DeadlinkChecker/blob/master/README.md How to use?]
* [https://github.com/wikimedia/DeadlinkChecker/blob/master/README.md How to use?]

Revision as of 08:04, 14 September 2016

DeadlinkChecker is a PHP library for checking if a given url is dead or alive.

About the project

While working on Community Wishlist survey Wish #1 - Migrating dead links to archives, we were faced with the problem of detecting dead links. Seems like a problem someone somewhere would have already solved, right? So we thought. It turned out checking dead links is quite a non-trivial problem. There's hundreds of ways a website can be dead, be temporarily dead, not be dead, yet say it is dead, say it's not dead, yet be dead....you get the idea. So we started to write our own Deadlink Checker library for PHP. It started out really basic - with just checking for HTTP response code but over time we started doing more complicated checks with it.

Here's how the code works (roughly):

  • For each incoming url - we curl it to get the header information
    • The curl options are set based on whether it's an HTTP/FTP request and whether we want just the header information or the complete html
  • Based on the data we get back from curl we perform the following checks:
    • Whether the url redirected to domain root? - We derive a set of probably domain roots before making this check.
    • Whether the response code was bad/uncertain? - Do a full request in that case.
    • Whether the end url gives away that it's an error/404.

The big piece missing from the code is soft 404 checks, but apart from that it's good to work. This library was used in conjunction with a software which kept a database log of how many times has it checked a url. If a url claims to be dead at least 3 times, over the course of a few days, we conclude it to be dead, to exclude the possibility of temporary dead links.

Authors