Lossless compression: Difference between revisions

Content deleted Content added
Undid revision 1278387017 by 189.196.54.156 (talk); Unexplained removal of a template
 
(11 intermediate revisions by 8 users not shown)
Line 27:
For images, this step can be repeated by taking the difference to the top pixel, and then in videos, the difference to the pixel in the next frame can be taken.
 
A hierarchical version of this technique takes neighboring pairs of data points, stores their difference and sum, and on a higher level with lower resolution continues with the sums. This is called [[discrete wavelet transform]]. [[JPEG2000]] additionally uses data points from other pairs and multiplication factors to mix them into the difference. These factors must be integers, so that the result is an integer under all circumstances. So the values are increased, increasing file size, but hopefully the distribution of values iscould be more peaked. {{Citation needed|date=December 2007}} <!-- can someone please explain JPEG2000 for dummies!? Wavelets are fun if taken as continuous real valued function. But all this math for simple integer linear algebra? -->
 
The adaptive encoding uses the probabilities from the previous sample in sound encoding, from the left and upper pixel in image encoding, and additionally from the previous frame in video encoding. In the wavelet transformation, the probabilities are also passed through the hierarchy.<ref name="Unser" />
Line 36:
Many of the lossless compression techniques used for text also work reasonably well for [[indexed image]]s, but there are other techniques that do not work for typical text that are useful for some images (particularly simple bitmaps), and other techniques that take advantage of the specific characteristics of images (such as the common phenomenon of contiguous 2-D areas of similar tones, and the fact that color images usually have a preponderance of a limited range of colors out of those representable in the color space).
 
As mentioned previously, lossless sound compression is a somewhat specialized area. Lossless sound compression algorithms can take advantage of the repeating patterns shown by the wave-like nature of the {{nowrap|data{{px2}}{{mdash}}{{px2}}}}essentially using [[autoregressive]] models to predict the "next" value and encoding the (hopefullypossibly small) difference between the expected value and the actual data. If the difference between the predicted and the actual data (called the ''error'') tends to be small, then certain difference values (like 0, +1, −1 etc. on sample values) become very frequent, which can be exploited by encoding them in few output bits.
 
It is sometimes beneficial to compress only the differences between two versions of a file (or, in [[video compression]], of successive images within a sequence). This is called [[delta encoding]] (from the Greek letter [[delta (letter)|Δ]], which in mathematics, denotes a difference), but the term is typically only used if both versions are meaningful outside compression and decompression. For example, while the process of compressing the error in the above-mentioned lossless audio compression scheme could be described as delta encoding from the approximated sound wave to the original sound wave, the approximated version of the sound wave is not meaningful in any other context.
Line 85:
** [[Portable Network Graphics|PNG]] – Portable Network Graphics
** [[GIF]] – Graphics Interchange Format
* Lossy orand Lossless encoding options
** [[AV1#AV1 Image File Format (AVIF)|AVIF]] – AV1 Image File Format
** [[FLIF]] – Free Lossless Image Format
Line 91:
** [[ILBM]] – (RLE compression of [[Amiga]] [[Interchange File Format|IFF]] images)
** [[JBIG2]] – compression of B&W images
** [[JPEG 2000]] – (via Le Gall–Tabatabai 5/3<ref>{{cite web |last1=Sullivan |first1=Gary |title=General characteristics and design considerations for temporal subband video coding |publisher=[[Video Coding Experts Group]] |website=[[ITU-T]] |date=8–12 December 2003 |url=https://www.itu.int/wftp3/av-arch/video-site/0312_Wai/VCEG-U06.doc |access-date=13 September 2019}}</ref><ref name="Unser">{{cite journal |last1=Unser |first1=M. |last2=Blu |first2=T. |title=Mathematical properties of the JPEG2000 wavelet filters |journal=IEEE Transactions on Image Processing |date=2003 |volume=12 |issue=9 |pages=1080–1090 |doi=10.1109/TIP.2003.812329 |pmid=18237979 |bibcode=2003ITIP...12.1080U |s2cid=2765169 |url=https://pdfsinfoscience.semanticscholarepfl.orgch/6ed4record/dece8b364416d9c390ba53df913bca7fb9a6.pdf |archive-url=https:63104/files/web.archive.org/web/20191013222932/https://pdfs.semanticscholar.org/6ed4/dece8b364416d9c390ba53df913bca7fb9a6unser0305.pdf |url-status=dead |archive-date=2019-10-13 }}</ref><ref>{{cite book |last1=Bovik |first1=Alan C. |title=The Essential Guide to Video Processing |date=2009 |publisher=[[Academic Press]] |isbn=9780080922508 |page=355 |url=https://books.google.com/books?id=wXmSPPB_c_0C&pg=PA355}}</ref> reversible integer [[wavelet transform]])
** [[Lossless JPEG#JPEG-LS|JPEG-LS]]
** [[JPEG XL]]
** [[JPEG XR]] – formerly ''WMPhoto'' and ''HD Photo''
** [[Discrete cosine transform|LDCT]] – [[Discrete Cosine Transform]]<ref>{{cite journal |last1=Ahmed |first1=Nasir |author1-link=N. Ahmed |last2=Mandyam |first2=Giridhar D. |last3=Magotra |first3=Neeraj |editor-first1=Arturo A. |editor-first2=Robert J. |editor-first3=Edward J. |editor-last1=Rodriguez |editor-last2=Safranek |editor-last3=Delp |title=DCT-based scheme for lossless image compression |journal=Digital Video Compression: Algorithms and Technologies 1995 |date=17 April 1995 |volume=2419 |pages=474–478 |doi=10.1117/12.206386 |publisher=International Society for Optics and Photonics|bibcode=1995SPIE.2419..474M |s2cid=13894279 }}</ref><ref>{{cite book |last1doi=Komatsu |first1=K10.1109/ICASSP.1998.681802 |last2chapter=SezakiReversible |first2=Kaorudiscrete cosine transform |title=Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181) |chapterdate=Reversible1998 discrete|last1=Komatsu cosine|first1=K. transform|last2=Sezaki |datefirst2=1998K. |volume=3 |pages=1769–1772 vol.3 |doi=10.1109/ICASSP.1998.681802 |isbn=0-7803-4428-6 |s2cid=17045923 |chapter-url=https://www.researchgate.net/publication/3747502}}</ref>
** [[PCX]] – PiCture eXchange
** [[QOI (image format)|QOI]] – Quite OK Image Format
Line 119:
===Executables===
{{main article|Executable compression}}
Self-extracting executables contain a compressed application and a decompressor. When executed, the decompressor transparently decompresses and runs the original application. This is especially often used in [[Demo (computer programming)|demo]] coding, where competitions are held for demos with strict size limits, as small as 1 [[Kilobyte|1kkilobyte]].
This type of compression is not strictly limited to binary executables, but can also be applied to scripts, such as [[JavaScript]].
 
Line 137:
 
== Limitations ==
Lossless data compression algorithms cannot guarantee compression for all input data sets. In other words, for any lossless data compression algorithm, there will be an input data set that does not get smaller when processed by the algorithm, and for any lossless data compression algorithm that makes at least one file smaller, there will be at least one file that it makes larger. This is easily proven with elementary mathematics using a counting argument called the [[pigeonhole principle]], as follows:{{Sfn|Sayood|2002|p=41}}<ref name=ISSEP-2015>{{cite book |authordoi=[[Tim10.1007/978-3-319-25396-1_1 Bell (computer scientist)|Bell, Tim]]|contributionchapter=Surprising Computer Science |title=8th International Conference on Informatics in Schools:. SituationCurricula, EvolutionCompetences, and PerspectivesCompetitions |series=Lecture Notes in Computer Science |publisherdate=[[Springer2015 Nature|Springer]]last1=Bell |datefirst1=SeptemberTim 28 – October 1, 2015|volume=9378 |pages=1–11|doi=10.1007/978-3-319-25396-1_1 |isbn=978-3-319-2539625395-1|s2cid=263132834 }} See in particular [https://books.google.com/books?id=exicCgAAQBAJ&pg=PA9 pp. 8–9].</ref>
 
* Assume that each file is represented as a string of bits of some arbitrary length.
Line 148:
Most practical compression algorithms provide an "escape" facility that can turn off the normal coding for files that would become longer by being encoded. In theory, only a single additional bit is required to tell the decoder that the normal coding has been turned off for the entire input; however, most encoding algorithms use at least one full byte (and typically more than one) for this purpose. For example, [[deflate]] compressed files never need to grow by more than 5 bytes per 65,535 bytes of input.
 
In fact, if we consider files of length N, if all files were equally probable, then for any lossless compression that reduces the size of some file, the expected length of a compressed file (averaged over all possible files of length N) must necessarily be ''greater'' than N.<ref>{{Cite web fact|title=Lossless Compression - an overview {{!}} ScienceDirect Topics |url=https://www.sciencedirect.com/topics/computer-science/lossless-compression |access-date=2022-10-30January |website=www.sciencedirect.com2025}}</ref> So if we know nothing about the properties of the data we are compressing, we might as well not compress it at all. A lossless compression algorithm is useful only when we are more likely to compress certain types of files than others; then the algorithm could be designed to compress those types of data better.
 
Thus, the main lesson from the argument is not that one risks big losses, but merely that one cannot always win. To choose an algorithm always means implicitly to select a ''subset'' of all files that will become usefully shorter. This is the theoretical reason why we need to have different compression algorithms for different kinds of files: there cannot be any algorithm that is good for all kinds of data.
Line 161:
 
=== Mathematical background ===
Abstractly, a [[compression algorithm]] can be viewed as a [[Function (mathematics)|function]] on sequences (normally of octets). Compression is successful if the resulting sequence is shorter than the original sequence (and the instructions for the decompression map). For a compression algorithm to be [[lossless]], the compression map must form an [[Injective function|injection]] from "plain" to "compressed" bit sequences. The pigeonhole principle prohibits a bijection between the collection of sequences of length ''N'' and any subset of the collection of sequences of length ''N''−1. Therefore, it is not possible to produce a lossless algorithm that reduces the size of every possible input sequence.<ref>{{cite book |chapter-urldoi=https://books10.google.com1007/books?id=Bn6dBwAAQBAJ&pg=PA21|title=Proof978-3-319-16250-8_3 Patterns|author=[[Mark S. Joshi|Joshi, Mark S.]]|chapter=The Pigeonhole Principle |publishertitle=[[SpringerProof Patterns Nature|Springer]]|date=2015-03-18 |access-datelast1=2021-08-24Joshi |pagefirst1=21Mark |doipages=10.1007/978-3-319-16250-8_319–23 |isbn=978-3-319-1625016249-8|s2cid=1169836972 }}</ref>
 
=== Points of application in real compression theory ===
Line 171:
| first = Mark
| title = The Million Random Digit Challenge Revisited
| datework = 2006-06-20Mark Nelson
| date = 2006-06-20
| url = https://marknelson.us/posts/2006/06/20/million-digit-challenge.html }}</ref>
A similar challenge, with $5,000 as reward, was issued by Mike Goldman.<ref>{{cite web