Content deleted Content added
→Example contains an error.: Reply |
|||
Line 424:
:In the paper, “the index ''I'' of the original string ''S'' in the sorted list of rotations” is an additional output of the compression scheme that is needed as an input for decompression. The WP article instead uses the fact that $ being in the last column will identify M[I]. But I don't see why the start-of-text character ^ is used. [[User:Stw|stw]] ([[User talk:Stw|talk]]) 19:57, 23 January 2024 (UTC)
::I’m particularly confused as to why the start of string character is used in the bijective example [[User:NixCode|NixCode]] ([[User talk:NixCode|talk]]) 02:57, 31 January 2024 (UTC)
== "the transformed text will differ from the result of BWT by only one character per Lyndon word" - uh, no? ==
The article currently claims that the output of David Scott's bijective variant of BWT "will differ from the result of BWT by only one character per Lyndon word". There are a couple of problems here:
:1. It's not clear what this means. By what [[string metric]] are we measuring by how many characters the two outputs differ? Levenshtein distance? Hamming distance? Whatever you call the metric used by Myers diff that has insertions and deletions but not substitutions? Something else that involves transpositions? All of these metrics will yield a difference measured in number of characters - but not the same number!
:2. More importantly: the example immediately after the claim seems to clearly show that the claim is false, for ''any'' of the options above.
:Our example is the string <code>SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES</code> which, as shown in the article, consists of ''five'' Lyndon words. Applying the conventional BWT to it yields the output <code>TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT</code> (until recently noted earlier in the article - https://en.wikipedia.org/w/index.php?title=Burrows%E2%80%93Wheeler_transform&diff=1287024229&oldid=1286999401). Applying Scott's bijective variant, on the other hand, yields <code>STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT</code> These two outputs definitely "differ" by ''at least'' 6 characters - not 5 - even if we are generous and allow a cut and paste of one character to a different ___location in the output to count as a single character difference. If we instead take the edit distance measured in insertions and deletions, it's a difference of 12 characters. There's no sane way at all that I can see to conceive of the difference as being equal to five characters (the number of Lyndon words in the original string).
If I could, I'd dig into the source of the claim to try to figure out if there's a true underlying claim that has just been badly worded here on Wikipedia, and to then reflect that claim in the article... but the claim is uncited, always has been uncited, and was added by an anonymous user who appears to no longer be active on Wikipedia (https://en.wikipedia.org/w/index.php?title=Burrows%E2%80%93Wheeler_transform&diff=prev&oldid=863645216). So that idea is a complete dead end.
I'm just going to remove the claim, but wanted to write up my thoughts here first so that others can review my reasoning and tell me if I'm missing something. [[User:ExplodingCabbage|ExplodingCabbage]] ([[User talk:ExplodingCabbage|talk]]) 15:41, 30 April 2025 (UTC)
|