Prefix code: Difference between revisions

Content deleted Content added
m Citation maintenance. [Pu131]Removed redundant parameters. You can use this bot yourself! Report bugs here.
m cleanup: whitespace, dashes
Line 1:
A '''prefix code''' is a [[code]] system, typically a [[variable-length code]], with the "prefix property": there is no valid [[code word]] in the system that is a [[prefix (computer science)|prefix]] (start) of any other valid code word in the set. A code with code words {9, 59, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not, because "5" is a prefix of both "59" and "55". With a prefix code, a receiver can tell when the end of the word is without a special marker.
 
Prefix codes are also known as '''prefix-free codes''', '''prefix condition codes''' and '''instantaneous codes'''. Although [[Huffman coding]] is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm.
 
The term '''comma-free code''' refers to a more restricted class of codes. Consider the strings of symbols formed by concatenating two codewords. If the substring starting at the second symbol and ending at the second-last symbol does not contain any codewords, then the code is comma-free.<ref>{{citation|last1=Berstel|first1=Jean|last2=Perrin|first2=Dominique|title=Theory of Codes|publisher=Academic Press|year=1985}}</ref> The term is sometimes incorrectly applied as a synonym for prefix-free codes.<ref>US [[Federal Standard 1037C]]</ref>
The term is sometimes incorrectly applied as a synonym for prefix-free codes.<ref>US [[Federal Standard 1037C]]</ref>
 
Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any [[out-of-band]] markers to [[framing (telecommunication)|frame]] the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing prefixes that form valid code words. This is not possible with codes that lack the prefix property, such as our example of {0,&nbsp;1,&nbsp;10,&nbsp;11}: a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11".
 
The variable-length [[Huffman coding|Huffman codes]], [[country calling codes]], the country and publisher parts of [[ISBN]]s, and the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard are prefix codes.
Line 14 ⟶ 13:
[[Kraft's inequality]] characterizes the sets of code word lengths that are possible in a prefix code.
 
== Techniques ==
Techniques for constructing a prefix code can be simple, or quite complicated.
 
If every word in the code has the same length, the code is called a '''fixed-length code''', or a '''block code''' (though the term [[block code]] is also used for fixed-size [[error-correcting code]]s in [[channel coding]]). For example, [[ISO 8859-15]] letters are always 8 bits long. [[UTF-32/UCS-4]] letters are always 32 bits long. [[Asynchronous Transfer Mode|ATM packets]] are always 424 bits long. A block code of fixed length ''k'' bits can encode up to <math>2^{k}</math> source symbols.
 
Prefixes cannot exist in a fixed-length code without padding fixed codes to the shorter prefixes in order to meet the length of the longest prefixes (however such padding codes may be selected to introduce redundancy that allows autocorrection and/or synchronisation). However, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others (in which case some or all of the redundancy may be eliminated for data compression).
 
[[Truncated binary encoding]] is a straightforward generalization of block codes to deal with cases where the number of symbols ''n'' is not a power of two. Source symbols are assigned codewords of length ''k'' and ''k''+1. where <math>2^{k} < n < 2^{k+1}</math>.
 
[[Huffman coding]] is a more sophisticated technique for constructing variable-length prefix codes. The Huffman coding algorithm takes as input the frequencies that the code words should have, and constructs a prefix code that minimizes the weighted average of the code word lengths. This is a form of [[lossless data compression]] based on [[entropy encoding]].
 
Some codes mark the end of a code word with a special "comma" symbol, different from normal data.<ref>[http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf "Development of Trigger and Control Systems for CMS"] by J. A. Jones: "Synchronisation" p. 70</ref> This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is prefix-free. However, modern communication systems send everything as sequences of "1" and "0" &ndashnbsp; adding a third symbol would be expensive, and using it only at the ends of words would be inefficient. [[Morse code]] is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly, [[Fibonacci coding]] uses a "11" to mark the end of every code word.
Some codes mark the end of a code word with a special "comma" symbol, different from normal data. <ref>
[http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf "Development of Trigger and Control Systems for CMS"] by J. A. Jones: "Synchronisation" p. 70
</ref> This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is prefix-free. However, modern communication systems send everything as sequences of "1" and "0" &ndash; adding a third symbol would be expensive, and using it only at the ends of words would be inefficient. [[Morse code]] is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly, [[Fibonacci coding]] uses a "11" to mark the end of every code word.
 
[[Self-synchronizing code]]s are prefix codes that allow [[frame synchronization]].
Line 53 ⟶ 50:
 
==References==
* P. Elias, Universal codeword sets and representations of integers, IEEE Trans. Inform. Theory 21 (2) (1975) 194-203194–203.
* D.A. Huffman, "[http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf A method for the construction of minimum-redundancy codes]" (PDF), Proceedings of the I.R.E., Sept. 1952, pp.&nbsp;1098-11021098–1102 (Huffman's original article)
* [http://www.huffmancoding.com/david/scientific.html Profile: David A. Huffman], [[Scientific American]], Sept. 1991, pp. 54-5854–58 (Background story)
* [[Thomas H. Cormen]], [[Charles E. Leiserson]], [[Ronald L. Rivest]], and [[Clifford Stein]]. ''[[Introduction to Algorithms]]'', Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 16.3, pp.385&ndash;392385–392.
* {{FS1037C}}
 
==External links==
* [http://plus.maths.org/issue10/features/infotheory/index.html Codes, trees and the prefix property] by Kona Macphee
 
[[Category:Coding theory]]