Wikipedia:Bots/Requests for approval/ScannerBot: Difference between revisions
Content deleted Content added
→Discussion: Reply |
mNo edit summary |
||
Line 64:
*Note: The functionality and the scope of the bot was made more specific. See page history for more details. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 06:28, 14 May 2022 (UTC)
*:Regex? [[User:Primefac|Primefac]] ([[User talk:Primefac|talk]]) 15:13, 14 May 2022 (UTC)
*::{{re|Primefac}} You can look at the [https://gist.github.com/fee1-dead/8428cd954b55d83043f94a1753e91a18 gist] I linked. <code><nowiki>https://twitter\.com/\w+/status/\d+\?[^\s}<|]+</nowiki></code> is used to match the URL, and then urllib is used to parse, and then remove the parameters. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 15:19, 14 May 2022 (UTC)
* You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in [[Brandon Clarke]]). -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 16:15, 14 May 2022 (UTC)
*:Yeah, I should probably match {{code|[^/]}} or <code><nowiki>[\s=>]</nowiki></code> for it to be primary. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 02:07, 15 May 2022 (UTC)
Line 73:
*::::Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 17:33, 15 May 2022 (UTC)
*:::::Okay I used a negative lookbehind and you can look at the tests here: https://regexr.com/6lmgl [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 23:18, 15 May 2022 (UTC)
*:::::<code><nowiki>(?<!\?url=|/|cache:)https://twitter\.com/\w+/status/\d+/?\?[^\s}<|]+</nowiki></code> [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 04:25, 16 May 2022 (UTC)
*::::::Nice. There is also sometimes very rarely protocol relative ([[WP:PRURL]]) eg. <code><nowiki>{{cite web |url=//twitter.com}}</nowiki></code>. They are so uncommon and can be tricky it would probably be OK to skip or log them if it doesn't fit with the regex. -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 05:21, 16 May 2022 (UTC)
*:::::::[https://en.wikipedia.org/w/index.php?search=insource%3A%2F%5B%3D%5Cs%7C%5C%5B%3E%7B%5D%5C%2F%5C%2Ftwitter%2F&title=Special:Search&profile=advanced&fulltext=1&ns0=1 a quick search] seems to show that it is fine. I've fixed all three that appeared from that search. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 06:52, 16 May 2022 (UTC)
|