Wikipedia:Bots/Requests for approval/ScannerBot: Difference between revisions

Content deleted Content added
mNo edit summary
mNo edit summary
Line 62:
{{cob}}
 
*:Note: The functionality and the scope of the bot was made more specific. See page history for more details. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 06:28, 14 May 2022 (UTC)
*::Regex? [[User:Primefac|Primefac]] ([[User talk:Primefac|talk]]) 15:13, 14 May 2022 (UTC)
*:::{{re|Primefac}} You can look at the [https://gist.github.com/fee1-dead/8428cd954b55d83043f94a1753e91a18 gist] I linked. <code><nowiki>https://twitter\.com/\w+/status/\d+\?[^\s}<|]+</nowiki></code> is used to match the URL, and then urllib is used to parse, and then remove the parameters. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 15:19, 14 May 2022 (UTC)
*: You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in [[Brandon Clarke]]). -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 16:15, 14 May 2022 (UTC)
*::Yeah, I should probably match {{code|[^/]}} or <code><nowiki>[\s=>]</nowiki></code> for it to be primary. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 02:07, 15 May 2022 (UTC)
*:::Great, thanks. Also WebCite like <code><nowiki>https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com</nowiki></code> .. couple others use <code><nowiki>?url=</nowiki></code> vs. "/" as the break point. -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 03:12, 15 May 2022 (UTC)
*::::{{re|GreenC}} Hmm, then it would be hard to distinguish a template parameter from a URL parameter in an URL...
*::::<code>{{green|<nowiki>{{Foo|1=https://twitter.com}}</nowiki>}}</code>
*::::<code>{{red|<nowiki>https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com</nowiki>}}</code> [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 04:03, 15 May 2022 (UTC)
*:::::Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 17:33, 15 May 2022 (UTC)
*::::::Okay I used a negative lookbehind and you can look at the tests here: https://regexr.com/6lmgl [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 23:18, 15 May 2022 (UTC)
*::::::<code><nowiki>(?<!\?url=|/|cache:)https://twitter\.com/\w+/status/\d+/?\?[^\s}<|]+</nowiki></code> [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 04:25, 16 May 2022 (UTC)
*:::::::Nice. There is also sometimes very rarely protocol relative ([[WP:PRURL]]) eg. <code><nowiki>{{cite web |url=//twitter.com}}</nowiki></code>. They are so uncommon and can be tricky it would probably be OK to skip or log them if it doesn't fit with the regex. -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 05:21, 16 May 2022 (UTC)
*::::::::[https://en.wikipedia.org/w/index.php?search=insource%3A%2F%5B%3D%5Cs%7C%5C%5B%3E%7B%5D%5C%2F%5C%2Ftwitter%2F&title=Special:Search&profile=advanced&fulltext=1&ns0=1 a quick search] seems to show that it is fine. I've fixed all three that appeared from that search. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 06:52, 16 May 2022 (UTC)