Prefix code: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 22:52, 12 December 2009 edit Verdy p (talk \| contribs) Extended confirmed users 13,591 edits m →Techniques ← Previous edit		Latest revision as of 12:06, 12 May 2025 edit undo Jochen Burghardt (talk \| contribs) Extended confirmed users 24,310 edits →top: now prefix property redirects to here
(115 intermediate revisions by 74 users not shown)
Line 1: {{Short description\|Type of code system}} A '''prefix code''' is a [[code]] system, typically a [[variable-length code]], with the "prefix property": there is no valid [[code word]] in the system that is a [[prefix (computer science)\|prefix]] (start) of any other valid code word in the set. A code with code words {9, 59, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not, because "5" is a prefix of both "59" and "55". With a prefix code, a receiver can tell when the end of the word is without a special marker. A '''prefix code''' is a type of [[code]] system distinguished by its possession of the '''prefix property''', which requires that there is no whole [[Code word (communication)\|code word]] in the system that is a [[prefix (computer science)\|prefix]] (initial segment) of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration for [[variable-length code\|variable-length codes]]. For example, a code with code {9, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not, because "5" is a prefix of "59" and also of "55". A prefix code is a [[uniquely decodable code]]: given a complete and accurate sequence, a receiver can identify each word without requiring a special marker between words. However, there are uniquely decodable codes that are not prefix codes; for instance, the reverse of a prefix code is still uniquely decodable (it is a suffix code), but it is not necessarily a prefix code. Prefix codes are also known as '''prefix-free codes''', '''prefix condition codes''' and '''instantaneous codes'''. Although [[Huffman coding]] is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. Prefix codes are also known as '''prefix-free codes''', '''prefix condition codes''' and '''instantaneous codes'''. Although [[Huffman coding]] is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. The term '''comma-free code''' is sometimes also applied as a synonym for prefix-free codes<ref>US [[Federal Standard 1037C]]</ref><ref>{{citation\|title=ATIS Telecom Glossary 2007\|url=http://www.atis.org/glossary/definition.aspx?id=6416\|access-date=December 4, 2010\|archive-date=July 8, 2010\|archive-url=https://web.archive.org/web/20100708083829/http://www.atis.org/glossary/definition.aspx?id=6416\|url-status=dead}}</ref> but in most mathematical books and articles (e.g.<ref>{{citation\|last1=Berstel\|first1=Jean\|last2=Perrin\|first2=Dominique\|title=Theory of Codes\|publisher=Academic Press\|year=1985}}</ref><ref>{{citation\|doi=10.4153/CJM-1958-023-9\|last1=Golomb\|first1=S. W.\|author1-link=Solomon W. Golomb\|last2=Gordon\|first2=Basil\|author2-link=Basil Gordon\|last3=Welch\|first3=L. R.\|title=Comma-Free Codes\|journal=Canadian Journal of Mathematics\|volume=10\|issue=2\|pages=202–209\|year=1958\|s2cid=124092269 \|url=https://books.google.com/books?id=oRgtS14oa-sC&pg=PA202\|doi-access=free}}</ref>) a comma-free code is used to mean a [[self-synchronizing code]], a subclass of prefix codes. The term '''comma-free code''' refers to a more restricted class of codes. Consider the strings of symbols formed by concatenating two codewords. If the substring starting at the second symbol and ending at the second-last symbol does not contain any codewords, then the code is comma-free.<ref>{{citation\|last1=Berstel\|first1=Jean\|last1=Perrin\|first1=Dominique\|title=Theory of Codes\|publisher=Academic Press\|year=1985}}</ref> ~~The term is sometimes incorrectly applied as a synonym for prefix-free codes.<ref>US [[Federal Standard 1037C]]</ref>~~ Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any [[Out-of-band data\|out-of-band]] markers or (alternatively) special markers between words to [[framing (telecommunication)\|frame]] the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing ~~prefixes~~sequences that form valid code words. This is not generally possible with codes that lack the prefix property, ~~such as our~~for example of {0, 1, 10, 11}: a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11"; so the string "10" could be interpreted either as a single codeword or as the concatenation of the words "1" then "0". The variable-length [[Huffman coding\|Huffman codes]], [[country calling codes]], the country and publisher parts of [[ISBN]]s, ~~and~~ the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard, ~~are~~and ~~prefix~~the ~~codes.~~[[instruction ~~Prefix~~set]]s ~~codes~~(machine ~~are also a form~~language) of ~~[[entropy~~most ~~encoding]]~~computer ~~used~~microarchitectures inare ~~[[lossless~~prefix ~~data compression]]~~codes. Prefix codes are not [[error-correcting codes]]. In ~~actual~~ practice, a message might first be compressed with a prefix code, and then encoded again (with an[[channel coding]] (including error~~-correcting~~ ~~code~~correction) before transmission. For any [[Variable-length_code#Uniquely_decodable_codes\|uniquely decodable]] code there is a prefix code that has the same code word lengths.<ref name=LTU2015>Le Boudec, Jean-Yves, Patrick Thiran, and Rüdiger Urbanke. Introduction aux sciences de l'information: entropie, compression, chiffrement et correction d'erreurs. PPUR Presses polytechniques, 2015.</ref> [[Kraft's inequality]] characterizes the sets of code word lengths that are possible in a [[Variable-length_code#Uniquely_decodable_codes\|uniquely decodable]] code.<ref name=BRS75>Berstel et al (2010) p.75</ref> == Techniques ==▼ ~~Techniques for constructing a prefix code can be simple, or quite complicated.~~ ▲== Techniques == If every word in the code has the same length, the code is called a '''fixed-length code''', or a '''block code''' (though the term [[block code]] is also used for fixed-~~length~~size [[error-correcting code]]s in [[channel coding]]). For example, [[ISO 8859-15]] letters are always 8 bits long. [[UTF-32/UCS-4]] letters are always 32 bits long. [[Asynchronous Transfer Mode\|ATM ~~packets~~cells]] are always 424 bits (53 bytes) long. A fixed-length code of fixed length ''k'' bits can encode up to <math>2^{k}</math> source symbols. ~~Prefixes~~A ~~cannot~~fixed-length ~~exist~~code inis necessarily a prefix code. It is possible to turn any code into a fixed-length code ~~without~~by padding fixed ~~codes~~symbols to the shorter prefixes in order to meet the length of the longest prefixes. ~~(however~~Alternately, such padding codes may be ~~selected~~employed to introduce redundancy that allows autocorrection and/or synchronisation). ~~Unfortunately~~However, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others ~~(in which case some or all of the redundancy may be eliminated for data compression)~~. [[Truncated binary encoding]] is a straightforward generalization of fixed-length codes to deal with cases where the number of symbols ''n'' is not a power of two. Source symbols are assigned codewords of length ''k'' and ''k''+1, where ''k'' is chosen so that ''2<sup>k</sup> < n ≤ 2<sup>k+1</sup>''. ~~Some codes mark the end of a code word with a special "comma" symbol, different from normal data.~~ <ref>▼ ~~[http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf "Development of Trigger and Control Systems for CMS"] by J. A. Jones: "Synchronisation" p. 70~~ ~~</ref>~~ This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is prefix-free. However, modern communication systems send everything as sequences of "1" and "0" – adding a third symbol would be expensive, and using it only at the ends of words would be inefficient. [[Morse code]] is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly, [[Fibonacci coding]] uses a "11" to mark the end of every code word.▼ [[Huffman coding]] is a more sophisticated technique for constructing variable-length prefix codes. The Huffman coding algorithm takes as input the frequencies that the code words should have, and constructs a prefix code that minimizes the weighted average of the code word lengths. (This is closely related to minimizing the entropy.) This is a form of [[lossless data compression]] based on [[entropy encoding]]. ▲Some codes mark the end of a code word with a special "comma" symbol (also called a [[Sentinel value]]), different from normal data.<ref>{{cite web \|url=http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf \|title=Development of Trigger and Control Systems for CMS \|first1=J. \|last1=A. Jones \|page=70 \|publisher=High Energy Physics, Blackett Laboratory, Imperial College, London \|url-status=dead \|archive-url= https://web.archive.org/web/20110613183447/http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf \|archive-date= Jun 13, 2011 }}</ref> This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is automatically prefix-free. However, ~~modern~~reserving ~~communication~~an ~~systems~~entire ~~send~~symbol ~~everything~~only asfor ~~sequences~~use ~~of "1" and "0" – adding~~as a ~~third~~comma ~~symbol would~~can be ~~expensive~~inefficient, ~~and~~especially ~~using~~for itlanguages ~~only~~with ata ~~the~~small ~~ends~~number of ~~words would be inefficient~~symbols. [[Morse code]] is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly, [[Fibonacci coding]] uses a "11" to mark the end of every code word. ~~[[Kraft's inequality]] characterizes the sets of code word lengths that are possible in a prefix code.~~ [[Self-synchronizing code]]s are prefix codes that allow [[frame synchronization]]. ==Related concepts== ==Prefix codes in use today==▼ A '''suffix code''' is a set of words none of which is a suffix of any other; equivalently, a set of words which are the reverse of a prefix code. As with a prefix code, the representation of a string as a concatenation of such words is unique. A '''bifix code''' is a set of words which is both a prefix and a suffix code.<ref name=BPR58>Berstel et al (2010) p.58</ref> An '''optimal prefix code''' is a prefix code with minimal average length. That is, assume an alphabet of {{mvar\|n}} symbols with probabilities <math>p(A_i)</math> for a prefix code {{mvar\|C}}. If {{mvar\|C'}} is another prefix code and <math>\lambda'_i</math> are the lengths of the codewords of {{mvar\|C'}}, then <math>\sum_{i=1}^n { \lambda_i p(A_i) } \leq \sum_{i=1}^n { \lambda'_i p(A_i) } \!</math>.<ref>[http://www.cim.mcgill.ca/~langer/423/lecture2.pdf McGill COMP 423 Lecture notes]</ref> ▲==Prefix codes in use today== Examples of prefix codes include: * variable-length [[Huffman coding\|Huffman codes]] * [[country calling codes]] * [[Chen–Ho encoding]] * the country and publisher parts of [[ISBN]]s * the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard * [[VCR Plus\|VCR Plus+ codes]] * [[Unicode Transformation Format]], in particular the [[UTF-8]] system for encoding [[Unicode]] characters, which is both a prefix-free code and a [[self-synchronizing code]]<ref>{{cite web \| url = http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt \| title = UTF-8 history \| first = Rob \| last = Pike \| date = 2003-04-03 ▲}}</ref> * [[variable-length quantity]] ===Techniques=== Commonly used techniques for constructing prefix codes include [[Huffman coding\|Huffman codes]] and the earlier [[~~Shannon-Fano~~Shannon–Fano coding\|~~Shannon-Fano~~Shannon–Fano codes]], and [[universal code (data compression)\|universal code]]s such as: * [[Elias delta coding]] * [[Elias gamma coding]] Line 49 ⟶ 57: * [[Unary coding]] * [[Golomb Rice code]] * [[Straddling checkerboard]] (simple cryptography technique which produces prefix codes) * binary coding<ref>{{citation\|doi=10.25209/2079-3316-2018-9-4-239-252\|last1=Shevchuk\|first1=Y. V.\|author1-link=Yury V. Shevchuk\|title=Vbinary: variable length integer coding revisited\|journal=Program Systems: Theory and Applications\|volume=9\|issue=4\|pages=239–252\|year=2018\|url=http://psta.psiras.ru//read/psta2018_4_239-252.pdf\|doi-access=free}}</ref> ==Notes== Line 54 ⟶ 64: ==References== * {{cite book \| last1=Berstel \| first1=Jean \| last2=Perrin \| first2=Dominique \| last3=Reutenauer \| first3=Christophe \| title=Codes and automata \| series=Encyclopedia of Mathematics and its Applications \| volume=129 \| ___location=Cambridge \| publisher=[[Cambridge University Press]] \| year=2010 \| url=http://www-igm.univ-mlv.fr/~berstel/LivreCodes/Codes.html \| isbn=978-0-521-88831-8 \| zbl=1187.94001 }} * P. Elias, Universal codeword sets and representations of integers, IEEE Trans. Inform. Theory 21 (2) (1975) 194-203. * {{cite journal \| last=Elias \| first=Peter \| author-link=Peter Elias \| title=Universal codeword sets and representations of the integers \| journal=IEEE Trans. Inf. Theory \| volume=21 \| number=2 \| year=1975 \| pages=194–203 \| issn=0018-9448 \| zbl=0298.94011 \| doi=10.1109/tit.1975.1055349}} * D.A. Huffman, "~~[http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf~~ A method for the construction of minimum-redundancy codes]" ~~(PDF)~~, Proceedings of the I.R.E., Sept. 1952, pp. ~~1098-1102~~1098–1102 (Huffman's original article) * [https://web.archive.org/web/20070220234037/http://www.huffmancoding.com/david/scientific.html Profile: David A. Huffman], [[Scientific American]], Sept. 1991, pp. ~~54-58~~ 54–58 (Background story) * [[Thomas H. Cormen]], [[Charles E. Leiserson]], [[Ronald L. Rivest]], and [[Clifford Stein]]. ''[[Introduction to Algorithms]]'', Second Edition. MIT Press and McGraw-Hill, 2001. {{ISBN \|0-262-03293-7}}. Section 16.3, pp.~~385~~&~~ndash~~nbsp;~~392~~385–392. * {{FS1037C}} ==External links== * [http://plus.maths.org/issue10/features/infotheory/index.html Codes, trees and the prefix property] by Kona Macphee {{Compression methods}} [[Category:Coding theory]] [[Category:Prefixes\|code]] [[Category:Data compression]] [[Category:Lossless compression algorithms]] <!-- do I really need both categories? --> ~~[[cs:Prefixový kód]]~~ ~~[[de:Präfixcode]]~~ ~~[[es:Código prefijo]]~~ ~~[[ja:接頭符号]]~~ ~~[[pl:Kod prefiksowy]]~~ ~~[[ru:Префиксный код]]~~