Huffman coding: Difference between revisions

Content deleted Content added
Seb (talk | contribs)
m turned refs to specialized concepts into links
 
No edit summary
Line 1:
<b>Huffman coding</b> is an [[algorithm]] used for [[data compression]].
 
 
 
The basic idea is borrowed from an older and slightly less efficient method called
 
[[Data compression/Shannon-Fano coding|Shannon-Fano coding]].
 
 
 
The text to be compressed is considered as a [[string]] of symbols. Symbols that are likely to be frequent are represented by a short sequence of [[bit]]s, and symbols that are likely to be rare are represented by a longer sequences of bits.
 
 
 
Huffman coding uses a specific method for choosing the representations for each symbol.
 
It has been proven that Huffman coding is the most effective compression method of this type, that is, no other mapping of source symbols to strings of bits will produce a smaller output when the actual symbol frequencies agree with those used to create the code.
 
Huffman coding is optimal when the frequencies of input characters are powers of two.
[[Data compression/Arithmetic coding|Arithmetic coding]] produces slight gains over Huffman coding,
 
[[Arithmetic coding]] produces slight gains over Huffman coding,
 
but in practice these gains have not been large enough to offset arithmetic coding's higher computational complexity and patent royalties (as of November 2001, IBM owns patents on the core concepts of arithmetic coding in several jurisdictions).
 
 
 
Huffman works by creating a [[binary tree]] of symbols:
 
#Start with as many trees as there are symbols.
 
#While there is more than one tree:
 
##Find the two trees with the smallest total probability.
 
##Combine the trees into one, setting one as the left child and the other as the right.
 
#Now the tree contains all the symbols. A '0' represents following the left child; a '1' represents following the right child.
 
 
 
There are variations.
 
The frequencies used can be generic ones for the application ___domain that are based on average experience, or they can be the actual frequencies found in the text being compressed.
 
(This variation requires that a [[frequency table]] or other hint as to the encoding must be stored with the compressed text; implementations employ various tricks to store these tables efficiently.)
 
A variation called "adaptive Huffman coding" calculates the frequencies dynamically based on recent actual frequencies in the source string.
 
 
 
Huffman coding today is often used as a "back-end" to some other compression method.
 
[[Data compression/deflation|deflation]] (PKZIP's algorithm) and multimedia codecs such as [[JPEG]] and [[MP3]] have a front-end model and quantization followed by Huffman coding.