Prefix code: Difference between revisions

Content deleted Content added
Faisal.akeel (talk | contribs)
m ==External links==
top: now prefix property redirects to here
 
(141 intermediate revisions by 94 users not shown)
Line 1:
{{Short description|Type of code system}}
A '''prefix code''' is a [[code]], typically a [[variable-length code]], with the "prefix property": no [[code word]] is a [[prefix (computer science)|prefix]] of any other code word in the set. A code with code words {0, 10, 11} has the prefix property; a code consisting of {0, 1, 10, 11} does not, because "1" is a prefix of both "10" and "11".
A '''prefix code''' is a type of [[code]] system distinguished by its possession of the '''prefix property''', which requires that there is no whole [[Code word (communication)|code word]] in the system that is a [[prefix (computer science)|prefix]] (initial segment) of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration for [[variable-length code|variable-length codes]].
 
For example, a code with code {9, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not, because "5" is a prefix of "59" and also of "55". A prefix code is a [[uniquely decodable code]]: given a complete and accurate sequence, a receiver can identify each word without requiring a special marker between words. However, there are uniquely decodable codes that are not prefix codes; for instance, the reverse of a prefix code is still uniquely decodable (it is a suffix code), but it is not necessarily a prefix code.
Prefix codes are also known as '''prefix-free codes''', '''prefix condition codes''', '''comma-free codes''', and '''instantaneous codes'''. Although [[Huffman coding]] is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm.
 
Prefix codes are also known as '''prefix-free codes''', '''prefix condition codes''' and '''instantaneous codes'''. Although [[Huffman coding]] is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. The term '''comma-free code''' is sometimes also applied as a synonym for prefix-free codes<ref>US [[Federal Standard 1037C]]</ref><ref>{{citation|title=ATIS Telecom Glossary 2007|url=http://www.atis.org/glossary/definition.aspx?id=6416|access-date=December 4, 2010|archive-date=July 8, 2010|archive-url=https://web.archive.org/web/20100708083829/http://www.atis.org/glossary/definition.aspx?id=6416|url-status=dead}}</ref> but in most mathematical books and articles (e.g.<ref>{{citation|last1=Berstel|first1=Jean|last2=Perrin|first2=Dominique|title=Theory of Codes|publisher=Academic Press|year=1985}}</ref><ref>{{citation|doi=10.4153/CJM-1958-023-9|last1=Golomb|first1=S. W.|author1-link=Solomon W. Golomb|last2=Gordon|first2=Basil|author2-link=Basil Gordon|last3=Welch|first3=L. R.|title=Comma-Free Codes|journal=Canadian Journal of Mathematics|volume=10|issue=2|pages=202–209|year=1958|s2cid=124092269 |url=https://books.google.com/books?id=oRgtS14oa-sC&pg=PA202|doi-access=free}}</ref>) a comma-free code is used to mean a [[self-synchronizing code]], a subclass of prefix codes.
Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any [[out-of-band]] markers to [[framing (telecommunication)|frame]] the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing prefixes that form valid code words. This is not possible with codes that lack the prefix property, such as our example of {0,&nbsp;1,&nbsp;10,&nbsp;11}: a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11".
 
Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any [[Out-of-band data|out-of-band]] markers or (alternatively) special markers between words to [[framing (telecommunication)|frame]] the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing prefixessequences that form valid code words. This is not generally possible with codes that lack the prefix property, such as ourfor example of {0,&nbsp;1,&nbsp;10,&nbsp;11}: a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11"; so the string "10" could be interpreted either as a single codeword or as the concatenation of the words "1" then "0".
The variable-length [[Huffman coding|Huffman codes]], [[country calling codes]], the country and publisher parts of [[ISBN]]s, and the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard are prefix codes. Prefix codes are also a form of [[entropy encoding]] used in [[lossless data compression]].
 
The variable-length [[Huffman coding|Huffman codes]], [[country calling codes]], the country and publisher parts of [[ISBN]]s, and the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard, areand prefixthe codes.[[instruction Prefixset]]s codes(machine are also a formlanguage) of [[entropymost encoding]]computer usedmicroarchitectures inare [[losslessprefix data compression]]codes.
Prefix codes are not [[error-correcting codes]]. In actual practice, a message might first be compressed with a prefix code, and then encoded again (with an error-correcting code) before transmission.
 
Prefix codes are not [[error-correcting codes]]. In actual practice, a message might first be compressed with a prefix code, and then encoded again (with an[[channel coding]] (including error-correcting codecorrection) before transmission.
<small>''This article is partly derived from [[Federal Standard 1037C]], which uses the term '''comma-free code'''.''</small>
 
For any [[Variable-length_code#Uniquely_decodable_codes|uniquely decodable]] code there is a prefix code that has the same code word lengths.<ref name=LTU2015>Le Boudec, Jean-Yves, Patrick Thiran, and Rüdiger Urbanke. Introduction aux sciences de l'information: entropie, compression, chiffrement et correction d'erreurs. PPUR Presses polytechniques, 2015.</ref> [[Kraft's inequality]] characterizes the sets of code word lengths that are possible in a [[Variable-length_code#Uniquely_decodable_codes|uniquely decodable]] code.<ref name=BRS75>Berstel et al (2010) p.75</ref>
== Techniques ==
 
== Techniques ==
Techniques for constructing a prefix code can be simple, or quite complicated.
If every word in the code has the same length, the code is called a '''fixed-length code''', or a '''block code''' (though the term [[block code]] is also used for fixed-size [[error-correcting code]]s in [[channel coding]]). For example, [[ISO 8859-15]] letters are always 8 bits long. [[UTF-32/UCS-4]] letters are always 32 bits long. [[Asynchronous Transfer Mode|ATM cells]] are always 424 bits (53 bytes) long. A fixed-length code of fixed length ''k'' bits can encode up to <math>2^{k}</math> source symbols.
 
IfA everyfixed-length wordcode inis thenecessarily a prefix code. hasIt theis samepossible length,to theturn any code is calledinto a [[fixed-length code]]. Forby example,padding [[ISOfixed 8859-15]]symbols lettersto arethe alwaysshorter 8prefixes bitsin long.order [[UTF-32/UCS-4]]to lettersmeet arethe alwayslength 32of the bitslongest longprefixes. [[AsynchronousAlternately, Transfersuch Mode|ATMpadding packets]]codes aremay alwaysbe 424employed bitsto long.introduce Prefixesredundancy cannotthat existallows inautocorrection aand/or fixed-length codesynchronisation. UnfortunatelyHowever, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others.
 
[[Truncated binary encoding]] is a straightforward generalization of fixed-length codes to deal with cases where the number of symbols ''n'' is not a power of two. Source symbols are assigned codewords of length ''k'' and ''k''+1, where ''k'' is chosen so that ''2<sup>k</sup> < n ≤ 2<sup>k+1</sup>''.
Some variable-length codes mark the end of a code word with a special "comma" symbol. This is somewhat analogous to the period at the end of a sentence; it marks where one sentence ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is prefix-free. However, modern communication systems send everything as sequences of "1" and "0" &ndash; adding a third symbol would be expensive, and using it only at the ends of words would be inefficient. [[Morse code]] is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins.
 
[[Huffman coding]] is a more sophisticated technique for constructing variable-length prefix codes. The Huffman coding algorithm takes as input the frequencies that the code words should have, and constructs a prefix code that minimizes the weighted average of the code word lengths. (This is closely related to minimizing the entropy.) This is a form of [[lossless data compression]] based on [[entropy encoding]].
 
Some codes mark the end of a code word with a special "comma" symbol (also called a [[Sentinel value]]), different from normal data.<ref>{{cite web |url=http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf |title=Development of Trigger and Control Systems for CMS |first1=J. |last1=A. Jones |page=70 |publisher=High Energy Physics, Blackett Laboratory, Imperial College, London |url-status=dead |archive-url= https://web.archive.org/web/20110613183447/http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf |archive-date= Jun 13, 2011 }}</ref> This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is automatically prefix-free. However, reserving an entire symbol only for use as a comma can be inefficient, especially for languages with a small number of symbols. [[Morse code]] is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly, [[Fibonacci coding]] uses a "11" to mark the end of every code word.
[[Kraft's inequality]] characterizes the sets of code word lengths that are possible in a prefix code.
 
[[Self-synchronizing code]]s are prefix codes that allow [[frame synchronization]].
 
==Related concepts==
A '''suffix code''' is a set of words none of which is a suffix of any other; equivalently, a set of words which are the reverse of a prefix code. As with a prefix code, the representation of a string as a concatenation of such words is unique. A '''bifix code''' is a set of words which is both a prefix and a suffix code.<ref name=BPR58>Berstel et al (2010) p.58</ref>
An '''optimal prefix code''' is a prefix code with minimal average length. That is, assume an alphabet of {{mvar|n}} symbols with probabilities <math>p(A_i)</math> for a prefix code {{mvar|C}}. If {{mvar|C'}} is another prefix code and <math>\lambda'_i</math> are the lengths of the codewords of {{mvar|C'}}, then <math>\sum_{i=1}^n { \lambda_i p(A_i) } \leq \sum_{i=1}^n { \lambda'_i p(A_i) } \!</math>.<ref>[http://www.cim.mcgill.ca/~langer/423/lecture2.pdf McGill COMP 423 Lecture notes]</ref>
 
==Prefix codes in use today==
Examples of prefix codes include:
The [[UTF-8]] system for encoding [[Unicode]] characters using between one and four bytes per character can be seen as a form of prefix coding, as can [[VCR Plus|VCR Plus+ codes]] and the system of [[country calling codes]] for international telephony.
* variable-length [[Huffman coding|Huffman codes]]
* [[country calling codes]]
* [[Chen–Ho encoding]]
* the country and publisher parts of [[ISBN]]s
* the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard
* [[VCR Plus|VCR Plus+ codes]]
* [[Unicode Transformation Format]], in particular the [[UTF-8]] system for encoding [[Unicode]] characters, which is both a prefix-free code and a [[self-synchronizing code]]<ref>{{cite web
| url = http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
| title = UTF-8 history
| first = Rob
| last = Pike
| date = 2003-04-03
}}</ref>
* [[variable-length quantity]]
 
===Techniques===
Commonly used techniques for constructing prefix codes include [[Shannon-FanoHuffman coding|Shannon-FanoHuffman codes]], and the earlier [[HuffmanShannon–Fano coding|HuffmanShannon–Fano codes]], and [[universal code (data compression)|universal code]]s such as [[Elias delta coding]], [[Elias gamma coding]], [[Elias omega coding]], [[Fibonacci coding]], and [[Levenshtein coding]].:
* [[Elias delta coding]]
* [[Elias gamma coding]]
* [[Elias omega coding]]
* [[Fibonacci coding]]
* [[Levenshtein coding]]
* [[Unary coding]]
* [[Golomb Rice code]]
* [[Straddling checkerboard]] (simple cryptography technique which produces prefix codes)
* binary coding<ref>{{citation|doi=10.25209/2079-3316-2018-9-4-239-252|last1=Shevchuk|first1=Y. V.|author1-link=Yury V. Shevchuk|title=Vbinary: variable length integer coding revisited|journal=Program Systems: Theory and Applications|volume=9|issue=4|pages=239–252|year=2018|url=http://psta.psiras.ru//read/psta2018_4_239-252.pdf|doi-access=free}}</ref>
 
==Notes==
{{Reflist}}
 
==References==
* {{cite book | last1=Berstel | first1=Jean | last2=Perrin | first2=Dominique | last3=Reutenauer | first3=Christophe | title=Codes and automata | series=Encyclopedia of Mathematics and its Applications | volume=129 | ___location=Cambridge | publisher=[[Cambridge University Press]] | year=2010 | url=http://www-igm.univ-mlv.fr/~berstel/LivreCodes/Codes.html | isbn=978-0-521-88831-8 | zbl=1187.94001 }}
 
* P.{{cite journal | last=Elias, | first=Peter | author-link=Peter Elias | title=Universal codeword sets and representations of the integers, | journal=IEEE Trans. InformInf. Theory | volume=21 (| number=2) (| year=1975) 194| pages=194–203 | issn=0018-2039448 | zbl=0298.94011 | doi=10.1109/tit.1975.1055349}}
* D.A. Huffman, "[http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf A method for the construction of minimum-redundancy codes]" (PDF), Proceedings of the I.R.E., Sept. 1952, pp.&nbsp;1098-11021098–1102 (Huffman's original article)
* [https://web.archive.org/web/20070220234037/http://www.huffmancoding.com/david/scientific.html Profile: David A. Huffman], [[Scientific American]], Sept. 1991, pp. 54-58&nbsp;54–58 (Background story)
* [[Thomas H. Cormen]], [[Charles E. Leiserson]], [[Ronald L. Rivest]], and [[Clifford Stein]]. ''[[Introduction to Algorithms]]'', Second Edition. MIT Press and McGraw-Hill, 2001. {{ISBN |0-262-03293-7}}. Section 16.3, pp.385&ndashnbsp;392385–392.
* {{FS1037C}}
 
==External links==
* [http://plus.maths.org/issue10/features/infotheory/index.html Codes, trees and the prefix property] by Kona Macphee
{{Compression methods}}
 
[[Category:Coding theory]]
[[Category:Prefixes|code]]
[[Category:Data compression]]
[[Category:Lossless compression algorithms]] <!-- do I really need both categories? -->
 
[[cs:Prefixový kód]]
[[de:Fano-Bedingung]]
[[pl:Kod prefiksowy]]
[[ru:Префиксный код]]