Content deleted Content added
No edit summary |
|||
Line 1:
[[de:Huffman-Code]] [[ja:ハフマン符号]] [[nl:Huffmancodering]] [[pl:Kodowanie Huffmana]] [[fi:Huffmanin_koodaus]]
In [[computer science]], '''Huffman coding''' is an [[entropy encoding]] [[algorithm]] used for [[data compression]] that finds the optimal system of encoding strings based on the relative frequency of each character. It was developed by [[David A. Huffman]] as a student at [[MIT]] in [[1952]], and published in ''A Method for the Construction of Minimum-Redundancy Codes''.
Huffman coding uses a specific method for choosing the representations for each symbol, resulting in a prefix-free code (that is, no bit string of any symbol is a prefix of the bit string of any other symbol) that expresses the most common characters in the shortest way possible. It has been proven that Huffman coding is the most effective compression method of this type: no other mapping of source symbols to strings of bits will produce a smaller output when the actual symbol frequencies agree with those used to create the code.
For a set of symbols whose cardinality is a [[power of two]] and a uniform probability distribution, Huffman coding is equivalent to simple binary block encoding.
== History ==
In 1951 David Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, [[Robert M. Fano]], assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree, and quickly proved this method the most efficient.
A similar idea had been used before, in [[Shannon-Fano coding]] (created by [[Claude Shannon]], inventor of [[information theory]], and Fano, Huffman's teacher!), but Huffman fixed its major flaw by building the tree from the bottom up instead of from the top down.
== Basic technique ==
The technique works by creating a [[binary tree]] of symbols:
#Start with as many trees as there are symbols.
#While there is more than one tree:
Line 24:
== Variations ==
The frequencies used can be generic ones for the application ___domain that are based on average experience, or they can be the actual frequencies found in the text being compressed.
(This variation requires that a [[frequency table]] or other hint as to the encoding must be stored with the compressed text; implementations employ various tricks to store these tables efficiently.)
A variation called "adaptive Huffman coding" calculates the frequencies dynamically based on recent actual frequencies in the source string. This is somewhat related to the [[LZ77|LZ]] family of algorithms.▼
▲A variation called "'adaptive Huffman coding'" calculates the frequencies dynamically based on recent actual frequencies in the source string. This is somewhat related to the [[LZ77|LZ]] family of algorithms.
Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only a way to order weights and to add them.▼
For example, see http://alexvn.freeservers.com/s1/huffman_template_algorithm.html▼
▲Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only a way to order weights and to add them. Huffman Template algorithm enables to use non-numerical weights (costs, frequences).
n-ary Huffman algorithm uses the {0, 1, ..., n-1} alphabet to encode message. Built tree is n-ary one.▼
▲For example, see [http://alexvn.freeservers.com/s1/huffman_template_algorithm.html]
▲The '''n-ary Huffman''' algorithm uses the {0, 1, ..., n-1} alphabet to encode message
Extreme cases of Huffman codes are connected with [[Fibonacci number]]s. For example, see http://mathforum.org/discuss/sci.math/t/207334.▼
▲Extreme cases of Huffman codes are connected with [[Fibonacci number]]s. For example, see [http://mathforum.org/discuss/sci.math/t/207334.]
Huffman coding is optimal when the probability of each input symbol is a power of two. Prefix-free codes tend to have slight inefficiency on small alphabets, where probabilities often fall between powers of two. Expanding the alphabet size by coalescing multiple symbols into "words" before Huffman coding can help a bit.
[[Arithmetic coding]] produces slight gains over Huffman coding, but in practice these gains have not been large enough to offset arithmetic coding's higher computational complexity and patent royalties ([[As of 2001|as of November 2001]], IBM owns patents on the core concepts of arithmetic coding in several jurisdictions
== Applications ==
Huffman coding today is often used as a "back-end" to some other compression method.
[[DEFLATE]] ([[PKZIP]]'s algorithm) and multimedia [[codec]]s such as [[JPEG]] and [[MP3]] have a front-end model and [[quantization (signal processing)|quantization]] followed by Huffman coding.
== External links ==
*Huffman's orignal article
*Background story: [http://www.huffmancoding.com/david/scientific.html Profile: David A. Huffman], [[Scientific American]], Sept. 1991, pp. 54-58
▲*Huffman's orignal article, (D.A. Huffman, "A method for the construction of minimum-redundancy codes", Proceedings of the I.R.E., sept 1952, pp 1098-1102), is to be found here: http://compression.graphicon.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf
|