Wikipedia:Bots/Requests for approval/ScannerBot: Difference between revisions

Content deleted Content added
m script name change (will not change again, if the script is renamed again FTT will remain as a meaningless abbreviation)
 
(28 intermediate revisions by 5 users not shown)
Line 1:
<noinclude>[[Category:OpenApproved Wikipedia bot requests for approval|ScannerBot]]</noinclude><div class="boilerplate metadata" style="background-color:
#EAFFEA; margin:2em 0 0 0; padding:0 10px 0 10px; border:1px solid #AAAAAA;">
:''The following discussion is an archived debate. <span style="color:red">'''Please do not modify it.'''</span> To request review of this BRFA, please start a new section at [[Wikipedia:Bots/Noticeboard]].'' The result of the discussion was {{BotApproved}}<!-- from Template:Bot Top-->
==[[User:ScannerBot|ScannerBot]]==
{{BRFA help}}
Line 32 ⟶ 34:
 
<!-- Should be a reasonable guess as to how many distinct pages you'll be editing. For open-ended tasks, estimate pages per some reasonable time period.-->
'''Estimated number of pages affected:''' <3000 per [https://en.wikipedia.org/w/index.php?search=insource%3A%2Ftwitter.com%5C%2F%5Ba-zA-Z_%5D%2B%5C%2Fstatus%5C%2F%5B0-9%5D%2B%5C%2F%3F%5C%3F%28cxt%7Cs%7Ct%29%3D%2F&title=Special%3ASearch&go=Go&ns0=1 this query]
'''Estimated number of pages affected:''' Probably 10000+
 
<!-- Which namespace(s) will the bot edit? Mainspace/Articles, Categories, Files, ...-->
Line 44 ⟶ 46:
 
<!-- List full and complete function details here. Please be precise and explicit, describing all changes the bot would make, including those "bundled" with bot frameworks like AWB or pywikipedia genfixes. Bots cannot be approved for open-ended tasks, so ensure the details cover all cases. Consider making several BRFAs for tasks with large independent changes. Vague or incomplete details can delay the BRFA process. Straight-forward details and examples will speed up the approval. If you need to modify these details after a discussion starts, it is recommended that you use del tags – <del></del> – to remove text and ins tags – <ins></ins> – for additions (this preserves coherence of the discussion), or show/hide boxes as you see fit. -->
'''Function details:''' Finds twitter.com URLs and remove parameters named as {{code|s}}, {{code|t}}, or {{code|tcxt}}.
 
===Discussion===
Line 50 ⟶ 52:
<!-- This is not a vote. It is a discussion -->
 
{{cot|title=Comments before task change}}
{{comment}} if a bot account is needed, I will probably use {{u|ScannerBot}}. [[User:0xDeadbeef|<span style="font-family: Fira Code, Fira Mono, JetBrains Mono, Noto Mono, Courier New, monospace">0xDEADBEEF</span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 01:51, 5 May 2022 (UTC)
* {{TakeNote}} This bot appears to have edited since this BRFA was filed. Bots may not edit outside their own or their operator's userspace unless approved or approved for trial. [[User:AnomieBOT|AnomieBOT]][[User talk:AnomieBOT|<span style="color:#880">⚡</span>]] 10:53, 5 May 2022 (UTC) <small>— [[User:AnomieBOT|AnomieBOT]] ([[User talk:AnomieBOT|talk]]&#32;• [[Special:Contributions/AnomieBOT|contribs]]) has made [[Wikipedia:Single-purpose account|few or no other edits]] outside this topic. </small>
Line 62 ⟶ 64:
{{cob}}
 
*:Note: The functionality and the scope of the bot was made more specific. See page history for more details. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 06:28, 14 May 2022 (UTC)
*::Regex? [[User:Primefac|Primefac]] ([[User talk:Primefac|talk]]) 15:13, 14 May 2022 (UTC)
*:::{{re|Primefac}} You can look at the [https://gist.github.com/fee1-dead/8428cd954b55d83043f94a1753e91a18 gist] I linked. <code><nowiki>https://twitter\.com/\w+/status/\d+\?[^\s}<|]+</nowiki></code> is used to match the URL, and then urllib is used to parse, and then remove the parameters. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 15:19, 14 May 2022 (UTC)
::::You'll likely want <code>https:\/\/twitter\.com\/\w+\/status\/\d+\?[^\s}<|]+</code> for regex, to escape the <code>/</code> characters. (Same for below). &#32;<span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 01:13, 17 May 2022 (UTC)
* You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in [[Brandon Clarke]]). -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 16:15, 14 May 2022 (UTC)
*:Yeah, ::::I shouldembedded probablythe matchregex {{code|[^/]}}as ora <code><nowiki>[\s=>]</nowiki></code>Python forraw itstring which does not need to beescape forward primaryslashes. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 0201:0717, 1517 May 2022 (UTC)
::::::But dots still need escaping? &#32;<span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 01:56, 17 May 2022 (UTC)
*::Great, thanks. Also WebCite like <code><nowiki>https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com</nowiki></code> .. couple others use <code><nowiki>?url=</nowiki></code> vs. "/" as the break point. -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 03:12, 15 May 2022 (UTC)
*:::<code>::::Yes because {{redcode|<nowiki>https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com</nowiki>}}</ and {{code>|\.}} have different meanings in regex. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 0402:0330, 1517 May 2022 (UTC)
*:::{{re|GreenC}} Hmm, then it would be hard to distinguish a template parameter from a URL parameter in an URL...
::::::::I know. Just surprised one needs escaping and the other doesn't. Not important, if the code works, it works. &#32;<span style="font-variant:small-caps; whitespace:nowrap;">[[User:Headbomb|Headbomb]] {[[User talk:Headbomb|t]] · [[Special:Contributions/Headbomb|c]] · [[WP:PHYS|p]] · [[WP:WBOOKS|b]]}</span> 10:24, 17 May 2022 (UTC)
*:::<code>{{green|<nowiki>{{Foo|1=https://twitter.com}}</nowiki>}}</code>
:::::::::@[[User:Headbomb|Headbomb]], for what it's worth, I believe it's because some non-python RegEx is enclosed in / . . . /, so <code><nowiki>/</nowiki></code> needs to be escaped, but in python RegEx is just given as a string ' . . . '&nbsp;&#8213;&nbsp;<span id="Qwerfjkl:1653834173516:WikipediaFTTCLNBots/Requests_for_approval/ScannerBot" class="FTTCmt">[[User:Qwerfjkl|<span style="background:#1d9ffc; color:white; padding:5px; box-shadow:darkgray 2px 2px 2px;">Qwerfjkl</span>]][[User talk:Qwerfjkl|<span style="background:#79c0f2;color:white; padding:2px; box-shadow:darkgray 2px 2px 2px;">talk</span>]] 14:22, 29 May 2022 (UTC)</span>
*:::<code>{{red|<nowiki>https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com</nowiki>}}</code> [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 04:03, 15 May 2022 (UTC)
*::::Right, I canYou'tll say what the regex would be. One method is match every string "/https?://twitter" and convertwant to "__hidestring__"detect (sameprimary withURLs, "?url=")or -skip archive and when doneURLs, convertchanging those hiddenwill stringsbreak backthem. before savingArchive theURLs article. The "__hidestring__" mightcan be "__hidestring-fs-http__"20+ ortypes, "__hidestring-fs-https__"it's soprobably you know howeasiest to revert back.detect Or really best, saveif the literaltwitter stringURL instarts awith table"/" and(example thein hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9Brandon Clarke][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 1716:3315, 1514 May 2022 (UTC)
*:::::OkayYeah, I usedshould aprobably negativematch lookbehind{{code|[^/]}} andor you<code><nowiki>[\s=>]</nowiki></code> canfor lookit atto thebe tests here: https://regexrprimary.com/6lmgl [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 2302:1807, 15 May 2022 (UTC)
*:::::Great, thanks. Also WebCite like <code><nowiki>(?<!\https://www.webcitation.org/6d0sXMyOT?url=|/|cache:)https://twitter.com</\w+nowiki></status/\d+/?\?[^\s}code> .. couple others use <|]+code><nowiki>?url=</nowiki></code> vs. "/" as the break point. -- [[User:0xDeadbeefGreenC|<span style="font-familycolor:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase#006A4E;">Deadbeef</span>'''Green'''</span>]][[User talk:GreenC|<span style="font-familycolor: serif#093;">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|'''C]])'''</span>]] 0403:2512, 1615 May 2022 (UTC)
*::::{{re|GreenC}} Hmm, then it would be hard to distinguish a template parameter from a URL parameter in an URL...
*::::::Nice. There is also sometimes very rarely protocol relative ([[WP:PRURL]]) eg. <code><nowiki>{{cite web |url=//twitter.com}}</nowiki></code> -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 05:21, 16 May 2022 (UTC)
*::::<code>{{green|<nowiki>{{Foo|1=https://twitter.com}}</nowiki>}}</code>
::::<code>{{red|<nowiki>https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com</nowiki>}}</code> [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 04:03, 15 May 2022 (UTC)
:::::Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 17:33, 15 May 2022 (UTC)
::::::Okay I used a negative lookbehind and you can look at the tests here: https://regexr.com/6lmgl [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 23:18, 15 May 2022 (UTC)
::::::<code><nowiki>(?<!\?url=|/|cache:)https://twitter\.com/\w+/status/\d+/?\?[^\s}<|]+</nowiki></code> [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 04:25, 16 May 2022 (UTC)
*:::::::Nice. There is also sometimes very rarely protocol relative ([[WP:PRURL]]) eg. <code><nowiki>{{cite web |url=//twitter.com}}</nowiki></code>. They are so uncommon and can be tricky it would probably be OK to skip or log them if it doesn't fit with the regex. -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 05:21, 16 May 2022 (UTC)
::::::::[https://en.wikipedia.org/w/index.php?search=insource%3A%2F%5B%3D%5Cs%7C%5C%5B%3E%7B%5D%5C%2F%5C%2Ftwitter%2F&title=Special:Search&profile=advanced&fulltext=1&ns0=1 a quick search] seems to show that it is fine. I've fixed all three that appeared from that search. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 06:52, 16 May 2022 (UTC)
:{{TakeNote}} number of pages affected has been lowered following a quick search with {{code|insource:}}. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 04:23, 21 May 2022 (UTC)
:{{tl|BAG assistance needed}} Requesting BAG assistance due to stale BRFA. [[User:0xDeadbeef|<span style="font-family:Fira Mono,Courier New,monospace">0x<span style="text-transform:uppercase">Deadbeef</span></span>]] <span style="font-family: serif">([[User talk:0xDeadbeef|T]] [[Special:Contributions/0xDeadbeef|C]])</span> 05:08, 27 May 2022 (UTC)
::To be clear: This BRFA has been inactive for some time. Primefac told me that they wanted input from other BAG members first. I would like to know if this is declined or approved for trial. Thanks. <span style="font-family:Iosevka,monospace">0x[[User talk:0xDeadbeef|<span style="text-transform:uppercase;color:black">'''Deadbeef'''</span>]]</span> 07:43, 28 May 2022 (UTC)
*::Great,:Looks thanks.fine Alsoto WebCiteme likefor <code><nowiki>https://wwwtrial.webcitation.org/6d0sXMyOT?url=https://twitter.com</nowiki></code> ..All coupleissues othersraised useabove <code><nowiki>?url=</nowiki></code>appear vs.addressed "/" as the break pointanyway. -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 0319:1205, 1529 May 2022 (UTC)
::{{BotTrial|edits=50}} Let's give it a try. —&#8239;[[User:The Earwig|<span style="opacity:0.8;">The</span>&nbsp;Earwig]]&nbsp;([[User talk:The Earwig|talk]]) 21:18, 30 May 2022 (UTC)
:::{{BotTrialComplete}} [https://en.wikipedia.org/wiki/Special:Contributions/ScannerBot] <span style="font-family:Iosevka, monospace">0x[[User talk:0xDeadbeef|<span style="text-transform:uppercase;color:black">'''Deadbeef'''</span>]]</span> 04:57, 31 May 2022 (UTC)
::::Deadbeef, checked [https://en.wikipedia.org/w/index.php?title=2022_Philippine_presidential_election&diff=prev&oldid=1090750033 one edit] and noticed the Wayback link actually works with the tracker removed. Who knew. After all that above :) Wayback magic. But can't say this holds true for every link, it's the kind of thing would have to verify with a header check on the Wayback link with tracking removed. It would be like an added feature to the bot, only if you wanted to try. - [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 06:18, 31 May 2022 (UTC)
:::::So I tried querying the wayback machine api to fix archive.org URLs: [https://gist.github.com/fee1-dead/8428cd954b55d83043f94a1753e91a18] Looking at the preview of the bot's edits, it looks fine. Perhaps it needs an extended trial? <span style="font-family:Iosevka,monospace">0x[[User talk:0xDeadbeef|<span style="text-transform:uppercase;color:black">'''Deadbeef'''</span>]]</span> 08:01, 31 May 2022 (UTC)
::::::<small>(@[[User:The Earwig|The Earwig]]) <span style="font-family:Iosevka,monospace">0x[[User talk:0xDeadbeef|<span style="text-transform:uppercase;color:black">'''Deadbeef'''</span>]]</span> 11:52, 4 June 2022 (UTC)</small>
* You::::::That'lls want to detect primary URLsgreat, oras skipit archivechecks URLs,there changingis thosea willcopy breakin them.the API, Archiveit URLs canshould be 20+ types, it's probably easiestgood to detect if the twitter URL starts with "/" (example in [[Brandon Clarke]])go. -- [[User:GreenC|<span style="color: #006A4E;">'''Green'''</span>]][[User talk:GreenC|<span style="color: #093;">'''C'''</span>]] 16:15:35, 144 MayJune 2022 (UTC)
::::::{{tl|BAG assistance needed}} <span style="font-family:Iosevka,monospace">0x[[User talk:0xDeadbeef|<span style="text-transform:uppercase;color:black">'''Deadbeef'''</span>]]</span> 05:28, 12 June 2022 (UTC)
:::::::{{BotApproved}} {{re|0xDeadbeef}} Thanks for your patience. Edits look good. I am fine with the expanded functionality for Wayback links and don't see a need for an extra trial provided you monitor these changes. —&#8239;[[User:The Earwig|<span style="opacity:0.8;">The</span>&nbsp;Earwig]]&nbsp;([[User talk:The Earwig|talk]]) 02:35, 13 June 2022 (UTC)
:''The above discussion is preserved as an archive of the debate. <span style="color:red">'''Please do not modify it.'''</span> To request review of this BRFA, please start a new section at [[Wikipedia:Bots/Noticeboard]].''<!-- from Template:Bot Bottom --></div>