Huffman coding: Difference between revisions

Content deleted Content added
m Reverted edits by 37.191.112.252 (talk) (HG) (3.4.11)
m Link fix
 
(46 intermediate revisions by 36 users not shown)
Line 1:
{{distinguish|Hamming code}}
{{Short description|Technique to compress data}}{{Use dmy dates|date=May 2019|cs1-dates=y}}
[[Image:Huffman tree 2.svg|thumb|Huffman tree generated from the exact frequencies of the text "this is an example of a huffman tree". The frequencies and codes of each character are below. Encoding the sentence with this code requires 135 (or 147) bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used. (This assumes that the code tree structure is known to the decoder and thus does not need to be counted as part of the transmitted information.)]]
{{Use dmy dates|date=May 2019|cs1-dates=y}}
{| class="wikitable sortable" style="float:right; clear:right; margin:0.5em 0 1.3em 1.4em"
[[Image:Huffman tree 2.svg|thumb|Huffman tree generated from the exact frequencies of the text "this is an example of a huffman tree". The frequencies and codes of each character are below. Encoding the sentence with this code requires 135 (or 147) bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used. (This assumes that the code tree structure is known to the decoder and thus does not need to be counted as part of the transmitted information.)]]. The frequencies and codes of each character are shown in the accompanying table.
 
{| class="wikitable sortable"
!Char!!Freq!!Code
|-
Line 36 ⟶ 39:
|x ||1||10010
|}
]]
In [[computer science]] and [[information theory]], a '''Huffman code''' is a particular type of optimal [[prefix code]] that is commonly used for [[lossless data compression]]. The process of finding or using such a code proceeds by means of '''Huffman coding''', an algorithm developed by [[David A. Huffman]] while he was a [[Doctor of Science|Sc.D.]] student at [[Massachusetts Institute of Technology|MIT]], and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".<ref>{{Cite journal | last1 = Huffman | first1 = D. |author-link1=David A. Huffman| title = A Method for the Construction of Minimum-Redundancy Codes | doi = 10.1109/JRPROC.1952.273898 | journal = [[Proceedings of the IRE]]| volume = 40 | issue = 9 | pages = 1098–1101 | year = 1952 | url = http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf}}</ref>
 
In [[computer science]] and [[information theory]], a '''Huffman code''' is a particular type of optimal [[prefix code]] that is commonly used for [[Lossless compression|lossless data compression]]. The process of finding or using such a code is '''Huffman coding''', an algorithm developed by [[David A. Huffman]] while he was a [[Doctor of Science|Sc.D.]] student at [[Massachusetts Institute of Technology|MIT]], and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".<ref name=":0">{{Cite journal | last1 = Huffman | first1 = D. |author-link1=David A. Huffman| title = A Method for the Construction of Minimum-Redundancy Codes | doi = 10.1109/JRPROC.1952.273898 | journal = [[Proceedings of the IRE]]| volume = 40 | issue = 9 | pages = 1098–1101 | year = 1952 | url = http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf}}</ref>

The output from Huffman's algorithm can be viewed as a [[variable-length code]] table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (''weight'') for each possible value of the source symbol. As in other [[entropy encoding]] methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in time [[linear time|linear]] to the number of input weights if these weights are sorted.<ref>{{cite journal | first = Jan | last = Van Leeuwen | author-link = Jan van Leeuwen | url = http://www.staff.science.uu.nl/~leeuw112/huffman.pdf | title = On the construction of Huffman trees | journal = ICALP | year =1976 | pages = 382–410 | access-date = 20 February 2014}}</ref> However, although optimal among methods encoding symbols separately, Huffman coding [[#Optimality|is not always optimal]] among all compression methods - it is replaced with [[arithmetic coding]]<ref name="LiDrew2014">{{cite book|author1=Ze-Nian Li|author2=Mark S. Drew|author3=Jiangchuan Liu|title=Fundamentals of Multimedia|url=https://books.google.com/books?id=R6vBBAAAQBAJ|date=9 April 2014|publisher=Springer Science & Business Media|isbn=978-3-319-05290-8}}</ref> or [[asymmetric numeral systems]]<ref name=PCS2015>[https://ieeexplore.ieee.org/document/7170048 J. Duda, K. Tahboub, N. J. Gadil, E. J. Delp, ''The use of asymmetric numeral systems as an accurate replacement for Huffman coding''], Picture Coding Symposium, 2015.</ref> if a better compression ratio is required.
 
== History ==
Line 52 ⟶ 57:
== Problem definition ==
{{More citations needed|date=December 2021}}
[[File:HuffmanCodeAlg.png|thumb|Constructing a Huffman Treetree]]
 
=== Informal description ===
;Given: A set of symbols <math>S</math> and for each symbol <math>x \in S</math>, the frequency <math>f_x</math> representing the fraction of symbols in the text that are equal to <math>x</math>.<ref name="kleinberg&Tardos">{{cite book |last1=Kleinberg |first1=Jon |last2=Tardos |first2=Eva |title=Algorithm Design |date=March 16, 2005 |publisher=[[Pearson Education]] |isbn=9780321295354 |page=165 |edition=1 |url=https://www.pearson.com/en-us/subject-catalog/p/algorithm-design/P200000003259/9780137546350 |access-date=26 January 2025}}</ref>
;Given: A set of symbols and their weights (usually [[Proportionality (mathematics)|proportional]] to probabilities).
;Find: A [[Prefix code|prefix-free binary code]] (a set of codewords) with minimum [[Expected value|expected]] codeword length (equivalently, a tree with minimum [[weighted path length from the root]]).
 
Line 67 ⟶ 72:
<br />
'''Goal'''.<br />
Let <math display="inline">L\left(C\left(W\right)\right) = \sum_{i=1}^{n} {w_{i}w_i\operatorname{length}\left(c_{i}\rightc_i)}</math> be the weighted path length of code <math>C</math>. Condition: <math>L\left(C\left(W\right)\right) \leq L\left(T\left(W\right)\right)</math> for any code <math>T\left(W\right)</math>.
 
=== Example ===
Line 100 ⟶ 105:
|rowspan="2"|&nbsp;
|-
!style="background:#efefef;font-weight:normal"| Codeword length (in bits)<br />({{math|''l''<sub>''i''</sub>}})
|align="center"| 3
|align="center"| 3
Line 107 ⟶ 112:
|align="center"| 2
|-
!style="background:#efefef;font-weight:normal"| Contribution to weighted path length<br />({{math|''l''<sub>''i''</sub>}} {{math|''w''<sub>''i''</sub>}} )
|align="center"| 0.30
|align="center"| 0.45
Line 116 ⟶ 121:
|-
!rowspan="3" style="background:#efefef"| Optimality
!style="background:#efefef;font-weight:normal"| Probability budget<br />({{math|2<sup>−''l''<sub>''i''</sub></sup>}})
| align="center" | 1/8
| align="center" | 1/8
Line 132 ⟶ 137:
|align="center"| &nbsp;
|-
! style="background: #efefef; font-weight: normal;" | Contribution to entropy<br />({{math|-''w''<sub>''i''</sub> log<sub>2</sub> ''w''<sub>''i''</sub>}})
|align="center"| 0.332
|align="center"| 0.411
Line 149 ⟶ 154:
The [[information entropy|entropy]] ''H'' (in bits) is the weighted sum, across all symbols {{math|''a''<sub>''i''</sub>}} with non-zero probability {{math|''w''<sub>''i''</sub>}}, of the information content of each symbol:
 
:<math display="block"> H(A) = \sum_{w_i > 0} w_i h(a_i) = \sum_{w_i > 0} w_i \log_2 {1 \over w_i} = - \sum_{w_i > 0} w_i \log_2{ w_i}. </math>
 
(Note: A symbol with zero probability has zero contribution to the entropy, since <math>\lim_{w \to 0^+} w \log_2 w = 0</math>. So for simplicity, symbols with zero probability can be left out of the formula above.)
Line 160 ⟶ 165:
 
===Compression===
[[File:Huffman_coding_visualisation.svg|thumb|360pxupright=1.5|Visualisation of the use of Huffman coding to encode the message "A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_A_DEAD_DAD_{{breakzwsp}}BEDCEDED_A_BAD_{{zwsp}}BABE_A_BEADED_{{zwsp}}ABACA_BED". In steps 2 to 6, the letters are sorted by increasing frequency, and the least frequent two at each step are combined and reinserted into the list, and a partial tree is constructed. The final tree in step 6 is traversed to generate the dictionary in step 7. Step 8 uses it to encode the message.]]
 
[[Image:Huffman coding example.svg|thumb|A source generates 4 different symbols <math>\{a_1 , a_2 , a_3 , a_4 \}</math> with probability <math>\{0.4 ; 0.35 ; 0.2 ; 0.05 \}</math>. A binary tree is generated from left to right taking the two least probable symbols and putting them together to form another equivalent symbol having a probability that equals the sum of the two symbols. The process is repeated until there is just one symbol. The tree can then be read backwards, from right to left, assigning different bits to different branches. The final Huffman code is:
Line 205 ⟶ 210:
 
#Start with current node set to the root.
#If node is not a leaf node, label the edge to the left child as ''0'' and the edge to the right child as ''1''. Repeat the process at both the left child and the right child.
 
The final encoding of any symbol is then read by a concatenation of the labels on the edges along the path from the root node to the symbol.
Line 241 ⟶ 246:
 
== Variations ==
Many variations of Huffman coding exist,<ref>{{cite journal |title=Code and Parse Trees for Lossless Source Encodingbook |first=J. |last=Abrahams |others=Division of Mathematics, Computer & Information Sciences, [[Office of Naval Research]] (ONR) |___locationtitle=Arlington,Proceedings. VA, USA |journal=Compression and Complexity of SequencesSEQUENCES 1997 Proceedings(Cat. No.97TB100171) |chapter=Code and parse trees for lossless source encoding |___location=Arlington, VA, USA |pages=145–171 |date=1997-06-11 |isbn=0-8186-8132-2 |publication-place=Salerno |doi=10.1109/SEQUEN.1997.666911 |publisher=[[IEEE]] |citeseerx=10.1.1.589.4726 |s2cid=124587565 }}</ref> some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be [[polynomial time]].
 
=== ''n''-ary Huffman coding ===
The '''''n''-ary Huffman''' algorithm uses thean alphabet of size ''n'', typically {0, 1, ..., ''n'' − -1} alphabet, to encode messagemessages and build an ''n''-ary tree. This approach was considered by Huffman in his original paper. The same algorithm applies as for binary (<math alt="''n'' equals 2">n = 2</math>) codes, except that the ''n'' least probable symbols are taken together,but instead of justcombining the 2two least probable.likely Note that for ''n'' greater than 2symbols, not all sets of source words can properly form an ''n''-ary tree for Huffman coding. In these cases, additional 0-probability place holders must be added. This is because the tree must form an ''n'' toleast 1likely contractor;symbols forare binarygrouped coding, this is a 2 to 1 contractor, and any sized set can form such a contractortogether. If the number of source words is congruent to 1 modulo ''n''&minus;1, then the set of source words will form a proper Huffman tree.
 
Note that for ''n'' > 2, not all sets of source words can properly form a complete ''n''-ary tree for Huffman coding. In these cases, additional placeholder symbols with 0 probability may need to be added. This is because the structure of the tree needs to repeatedly join ''n'' branches into one - also known as an "''n'' to 1" combination. For binary coding, this is a "2 to 1" combination, which works with any number of symbols. For ''n''-ary coding, a complete tree is only possible when the total number of symbols (real + placeholders) leaves a remainder of 1 when divided by (n-1). <ref name=":0" />
 
=== Adaptive Huffman coding ===
A variation called '''[[adaptive Huffman coding]]''' involves calculating the probabilities dynamically based on recent actual frequencies in the sequence of source symbols, and changing the coding tree structure to match the updated probability estimates. It is used rarely in practice, since the cost of updating the tree makes it slower than optimized [[Arithmetic coding#Adaptive arithmetic coding|adaptive arithmetic coding]], which is more flexible and has better compression.{{citation needed|date=June 2025}}
 
=== Huffman template algorithm ===
Line 258 ⟶ 265:
In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is ''N'' digits will always have a cost of ''N'', no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.
 
''Huffman coding with unequal letter costs'' is the generalization without this assumption: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of [[Morse code]], where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message. No algorithm is known to solve this in the same manner or with the same efficiency as conventional Huffman coding, though it has been solved by [http://ieeexplore.ieee.org/xpl/articleDetails[Richard M.jsp?arnumber Karp]]<ref>{{Cite journal |last=1057615&newsearchKarp |first=true&queryTextRichard M. |author-link=Richard M. Karp |title=Minimum-redundancy%20coding%20for%20the%20discrete%20noiseless%20channel Karp]coding for the discrete noiseless channel |journal=IRE Transactions on Information Theory |publisher=IEEE |publication-date=31 January 1961 |volume=7 |issue=1 |pages=27–38 |doi=10.1109/TIT.1961.1057615 }}</ref> whose solution has been refined for the case of integer costs by [httpMordecai J. Golin.<ref>{{Cite journal |last=Golin |first=Mordekai J. |date=January 1998 |publication-date=1 September 1998 |title=A Dynamic Programming Algorithm for Constructing Optimal Prefix-Free Codes with Unequal Letter Costs |url=https://ieeexplorepage.ieeemi.orgfu-berlin.de/xplrote/articleDetails.jsp?arnumber=705558&queryText=Papers/pdf/A+dynamic%20programming%20golin%20constructing%20optimal%20prefix+programming+algorithm+for+constructing+optimal+prefix-free&newsearch+codes+for+unequal+letter+costs.pdf |access-date=true10 Golin]September 2024 |s2cid=2265146 |doi=10.1109/18.705558 |journal=IEEE Transactions on Information Theory |volume=44 |issue=5 |pages=1770–1781}}</ref>
 
=== {{anchor|Hu-Tucker coding|Alphabetic Huffman coding|Alphabetic Huffman tree}}Optimal alphabetic binary trees (Hu–Tucker coding) ===
In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, <math>A = \left\{a,b,c\right\}</math> could not be assigned code <math>H\left(A,C\right) = \left\{00,1,01\right\}</math>, but instead should be assigned either <math>H\left(A,C\right) =\left\{00,01,1\right\}</math> or <math>H\left(A,C\right) = \left\{0,10,11\right\}</math>. This is also known as the '''Hu–Tucker''' problem, after [[T. C. Hu]] and [[Alan Tucker]], the authors of the paper presenting the first [[Time complexity#Quasilinear time|<math>O(n\log n)</math>-time]] solution to this optimal binary alphabetic problem,<ref>{{Cite journal | doi = 10.1137/0121057| title = Optimal Computer Search Trees and Variable-Length Alphabetical Codes| journal = SIAM Journal on Applied Mathematics| volume = 21| issue = 4| pages = 514| year = 1971| last1 = Hu | first1 = T. C.|author1-link=T. C. Hu |last2 = Tucker | first2 = A. C. | author2-link = Alan Tucker| jstor = 2099603}}</ref> which has some similarities to Huffman algorithm, but is not a variation of this algorithm. A later method, the [[Garsia–Wachs algorithm]] of [[Adriano Garsia]] and [[Michelle L. Wachs]] (1977), uses simpler logic to perform the same comparisons in the same total time bound. These optimal alphabetic binary trees are often used as [[binary search tree]]s.<ref>{{citation
| last = Knuth | first = Donald E. | author-link = Donald Knuth
| contribution = Algorithm G (Garsia–Wachs algorithm for optimum binary trees)
Line 289 ⟶ 296:
* [http://rosettacode.org/wiki/Huffman_coding Huffman coding in various languages on Rosetta Code]
* [https://gist.github.com/jasonrdsouza/1c9c895f43497d15eb2e Huffman codes (python implementation)]
* [https://github.com/gdavidbutler/canonicalHuffman Canonical Huffman codes (C implementation)]
* [https://demowww.tinyray.com/huffman A visualization of Huffman coding]
 
{{Compression Methods}}
Line 297 ⟶ 305:
[[Category:Lossless compression algorithms]]
[[Category:Binary trees]]
[[Category:Data compression]]