SchlurcherBot

Function overview: Convert links from http:// to https://

Programming language: C#

Source code available: Main C# script: commons:User:SchlurcherBot/LinkChecker

Function details: The link checking algorithm is as follows:

  1. The bot extracts all http-links from the parsed html code of a page
    • It searches for all href elements and extracts the links
    • It does not search the wikitext, and thus does not rely on any Regex
    • This is also to avoid any problems with templates that modify links (like archiving templates)
    • Links that are subsets of other links are filtered out to minimize search and replace errors
  2. The bot checks if the identified http-links also occur in the wikitext, otherwise they are skipped
  3. The bot checks if both the http-link and the corresponding https-link is accessible
    • This step also uses a blacklist of domains that were previously identified as not accessible
  4. If both links redirect to the same page, the http-link will be replaced by the https-link (the link will not be changed to the redirect page, the original link path will be kept)
  5. If both Links are accessible and return a success code (2xx), it will be checked if the content is identical
    1. If the content is identical, and the link is directly to the host, then the http-link will be replaced by the https-link
    2. If the content is identical but not the host, it will be checked if the content is identical to the host link, only if the content is different, then the http-link will be replaced by the https-link
      • This step is added as some hosts return the same content for all their pages (like most ___domain sellers, some news sites or pages in ongoing maintenance)
    3. If the content is not identical, it will be checked if the content is at least 99.9% identical (calculated via the en:Levenshtein distance)
      • This step is added as most homepages use dynamic IDs for certain elements, like for ad containers to circumvent Ad Blockers.
    4. If the content is at least 99.9% identical, the same host check as before will be performed.
    5. If any of the checked links fails (like Code 404), then nothing will happen.

Source for pages: The bot works on the list of pages identified through the external links SQL dump. The list was scrambled to ensure that subsequent edits are not clustered from a specific area.

Status
Area Language Request Pages Status
Commons Commons Approved 31'145'089 Running…
Wikipedia De Approved 1'888'381 Running…
Wikipedia En Approved 8'570'327 Running…
Wikipedia Es Pending 2'191'542  In attesa
Wikipedia Fr Approved 2'970'187 Running…
Wikipedia It Approved 2'359'233 Running…
Wikipedia Ja Allows global bots 994'375 Working Waiting
Wikipedia Pl Approved 1'527'763 Running…
Wikipedia Pt Pending 1'214'889  In attesa
Wikipedia Ru Allows global bots 1'797'992 Working Waiting
Wikipedia Zh Allows global bots 1'105'051 Working Waiting