Prefix code: Difference between revisions

Content deleted Content added
m dab framing
top: now prefix property redirects to here
 
(212 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Type of code system}}
{{cleanup-date|November 2005}}
A '''prefix code''' is a type of [[code]] system distinguished by its possession of the '''prefix property''', which requires that there is no whole [[Code word (communication)|code word]] in the system that is a [[prefix (computer science)|prefix]] (initial segment) of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration for [[variable-length code|variable-length codes]].
 
For example, a code with code {9, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not, because "5" is a prefix of "59" and also of "55". A prefix code is a [[uniquely decodable code]]: given a complete and accurate sequence, a receiver can identify each word without requiring a special marker between words. However, there are uniquely decodable codes that are not prefix codes; for instance, the reverse of a prefix code is still uniquely decodable (it is a suffix code), but it is not necessarily a prefix code.
A '''prefix code''', also known as a '''prefix-free code''', '''comma-free code''' or '''instantaneous code''', is a [[code]] constructed so that dividing the code word into two pieces cannot result in the first being a code word. For example, the set {010,011,0,101} is not a prefix code because the entry '0' is the prefix of the entries 010 and 011. Hence, when receiving in series, 0101101, how do you know if it means 0,101,101 or 010,110,0, and so on. However, {01,101,110,0010} is a prefix code, because no entry is the start of any other entry.
 
Prefix codes are also known as '''prefix-free codes''', '''prefix condition codes''' and '''instantaneous codes'''. Although [[Huffman coding]] is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. The term '''comma-free code''' is sometimes also applied as a synonym for prefix-free codes<ref>US [[Federal Standard 1037C]]</ref><ref>{{citation|title=ATIS Telecom Glossary 2007|url=http://www.atis.org/glossary/definition.aspx?id=6416|access-date=December 4, 2010|archive-date=July 8, 2010|archive-url=https://web.archive.org/web/20100708083829/http://www.atis.org/glossary/definition.aspx?id=6416|url-status=dead}}</ref> but in most mathematical books and articles (e.g.<ref>{{citation|last1=Berstel|first1=Jean|last2=Perrin|first2=Dominique|title=Theory of Codes|publisher=Academic Press|year=1985}}</ref><ref>{{citation|doi=10.4153/CJM-1958-023-9|last1=Golomb|first1=S. W.|author1-link=Solomon W. Golomb|last2=Gordon|first2=Basil|author2-link=Basil Gordon|last3=Welch|first3=L. R.|title=Comma-Free Codes|journal=Canadian Journal of Mathematics|volume=10|issue=2|pages=202–209|year=1958|s2cid=124092269 |url=https://books.google.com/books?id=oRgtS14oa-sC&pg=PA202|doi-access=free}}</ref>) a comma-free code is used to mean a [[self-synchronizing code]], a subclass of prefix codes.
This property permits the proper [[framing (telecommunication)|framing]] of transmitted code words when (a) external [[synchronization]] is provided to identify the start of the first code word in a [[sequence]] of code words and (b) no uncorrected errors occur in the symbol stream.
 
Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any [[Out-of-band data|out-of-band]] markers or (alternatively) special markers between words to [[framing (telecommunication)|frame]] the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing sequences that form valid code words. This is not generally possible with codes that lack the prefix property, for example {0,&nbsp;1,&nbsp;10,&nbsp;11}: a receiver reading a "1" at the start of a code word would not know whether that was the complete code word "1", or merely the prefix of the code word "10" or "11"; so the string "10" could be interpreted either as a single codeword or as the concatenation of the words "1" then "0".
Examples of prefix codes are the variable-length [[Huffman coding|Huffman codes]], [[country calling codes]], [[ISBN]]s and the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard.
 
The variable-length [[Huffman coding|Huffman codes]], [[country calling codes]], the country and publisher parts of [[ISBN]]s, the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard, and the [[instruction set]]s (machine language) of most computer microarchitectures are prefix codes.
''This article is partly derived from [[Federal Standard 1037C]], which uses the term '''comma-free code'''.''
 
Prefix codes are not [[error-correcting codes]]. In practice, a message might first be compressed with a prefix code, and then encoded again with [[channel coding]] (including error correction) before transmission.
----
 
For any [[Variable-length_code#Uniquely_decodable_codes|uniquely decodable]] code there is a prefix code that has the same code word lengths.<ref name=LTU2015>Le Boudec, Jean-Yves, Patrick Thiran, and Rüdiger Urbanke. Introduction aux sciences de l'information: entropie, compression, chiffrement et correction d'erreurs. PPUR Presses polytechniques, 2015.</ref> [[Kraft's inequality]] characterizes the sets of code word lengths that are possible in a [[Variable-length_code#Uniquely_decodable_codes|uniquely decodable]] code.<ref name=BRS75>Berstel et al (2010) p.75</ref>
'''prefix codes''' are a form of [[entropy encoding]] used in [[lossless data compression]].
 
==Techniques==
When you are reading a newspaper, how do you know where one sentence ends and the next begins?
If every word in the code has the same length, the code is called a '''fixed-length code''', or a '''block code''' (though the term [[block code]] is also used for fixed-size [[error-correcting code]]s in [[channel coding]]). For example, [[ISO 8859-15]] letters are always 8 bits long. [[UTF-32/UCS-4]] letters are always 32 bits long. [[Asynchronous Transfer Mode|ATM cells]] are always 424 bits (53 bytes) long. A fixed-length code of fixed length ''k'' bits can encode up to <math>2^{k}</math> source symbols.
You use a full-stop or a query-mark, symbols different from any other letter, number, or symbol.
How do you know where one word ends and the next begins?
You use a full-size space, which looks different from any letter, number or symbol.
How do you know where one letter ends and the next begins?
In [[block printing]], the hairline space between letters separates them.
However, this "space" is not necessary --
people are able to read [[cursive]] [[handwriting]] and [[shorthand]]
even though one letter runs right into the next.
 
A fixed-length code is necessarily a prefix code. It is possible to turn any code into a fixed-length code by padding fixed symbols to the shorter prefixes in order to meet the length of the longest prefixes. Alternately, such padding codes may be employed to introduce redundancy that allows autocorrection and/or synchronisation. However, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others.
Many communication systems send information as a series of bits.
Many storage systems store information as a series of bits.
In some systems, the letter 'a' is transmitted as the sequence
10000110
while the letter 'd' is transmitted as the sequence
00100110
and the full-sized space between words is
00000100
.
Each letter is formed from a sequence of bits -- how can the computer know where one letter ends and the next one starts ?
 
[[Truncated binary encoding]] is a straightforward generalization of fixed-length codes to deal with cases where the number of symbols ''n'' is not a power of two. Source symbols are assigned codewords of length ''k'' and ''k''+1, where ''k'' is chosen so that ''2<sup>k</sup> < n ≤ 2<sup>k+1</sup>''.
=== the "comma" ===
 
[[Huffman coding]] is a more sophisticated technique for constructing variable-length prefix codes. The Huffman coding algorithm takes as input the frequencies that the code words should have, and constructs a prefix code that minimizes the weighted average of the code word lengths. (This is closely related to minimizing the entropy.) This is a form of [[lossless data compression]] based on [[entropy encoding]].
Certainly one ''could'' use a special symbol -- analogous to the period at the end of the sentence -- to mark where one letter ends and the next begins. So the word "dada" could be transmitted
00101110,10000110,00101110,10000110,
where the "," represents a special symbol, different from "1" or "0".
However, modern communication systems send everything as sequences of "1" or "0" -- adding a "third symbol" would be expensive.
 
Some codes mark the end of a code word with a special "comma" symbol (also called a [[Sentinel value]]), different from normal data.<ref>{{cite web |url=http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf |title=Development of Trigger and Control Systems for CMS |first1=J. |last1=A. Jones |page=70 |publisher=High Energy Physics, Blackett Laboratory, Imperial College, London |url-status=dead |archive-url= https://web.archive.org/web/20110613183447/http://www.imperial.ac.uk/research/hep/group/theses/JJones.pdf |archive-date= Jun 13, 2011 }}</ref> This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is automatically prefix-free. However, reserving an entire symbol only for use as a comma can be inefficient, especially for languages with a small number of symbols. [[Morse code]] is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly, [[Fibonacci coding]] uses a "11" to mark the end of every code word.
(In general, we call the "pause" between items a "[[comma]]").
 
[[Self-synchronizing code]]s are prefix codes that allow [[frame synchronization]].
=== fixed-length comma-free codes ===
 
==Related concepts==
Fortunately, the "third symbol" turns out to be unnecessary -- it's possible for a machine to receive the '''comma-free''' sequence
A '''suffix code''' is a set of words none of which is a suffix of any other; equivalently, a set of words which are the reverse of a prefix code. As with a prefix code, the representation of a string as a concatenation of such words is unique. A '''bifix code''' is a set of words which is both a prefix and a suffix code.<ref name=BPR58>Berstel et al (2010) p.58</ref>
00101110100001100010111010000110
An '''optimal prefix code''' is a prefix code with minimal average length. That is, assume an alphabet of {{mvar|n}} symbols with probabilities <math>p(A_i)</math> for a prefix code {{mvar|C}}. If {{mvar|C'}} is another prefix code and <math>\lambda'_i</math> are the lengths of the codewords of {{mvar|C'}}, then <math>\sum_{i=1}^n { \lambda_i p(A_i) } \leq \sum_{i=1}^n { \lambda'_i p(A_i) } \!</math>.<ref>[http://www.cim.mcgill.ca/~langer/423/lecture2.pdf McGill COMP 423 Lecture notes]</ref>
and correctly decode the word "dada".
 
The simplest method is to make every letter the same length -- a "[[fixed-length code]]".
For example, [[ISO 8859-15]] letters are always 8 bits long.
[[UTF-32/UCS-4]] letters are always 32 bits long.
[[Asynchronous Transfer Mode|ATM packets]] are always 424 bits long.
 
Often we wish a message took less time to send or less space to store. So we use [[data compression]].
 
One kind of data compression is to use a different code -- one that uses fewer bits per letter.
 
If one uses [[ASCII]], for example, one really needs only 7 bits per letter (assuming the standard English alphabet with upper and lower case and no accents).
 
=== variable-length codes with a comma ===
 
One can compress typical text into even fewer bits if one uses a code with a variable number of bits per letter.
 
If we used a custom code
0 a
1 d
01 space
then the phrase "add a dad" could be compressed to
0,1,1,01,0,01,1,0,1,
 
[[Morse code]] is an example of a variable-length code with a comma. The long spaces between letters, and even longer spaces between words, help people recognize where one letter/word ends, and the next begins.
 
Unfortunately, if we remove the commas, the resulting message
01101001101
is ambiguous. Does a "0" followed by a "1" represent a space character, or 2 different letters ?
 
The ambiguity is caused because one complete code (in this case "0" for "a") is just the first part -- the prefix -- of another code (in this case, "01" for space).
 
=== variable-length comma-free codes ===
 
It is possible to specially design a variable-length code such that there is never any ambiguity.
Such a specially designed code is called a "variable-length code" or a "prefix-free code".
 
There are many variable-length codes.
 
==Prefix codes in use today==
Examples of prefix codes include:
* variable-length [[Huffman coding|Huffman codes]]
* [[country calling codes]]
* [[Chen–Ho encoding]]
* the country and publisher parts of [[ISBN]]s
* the Secondary Synchronization Codes used in the [[UMTS]] [[W-CDMA]] 3G Wireless Standard
* [[VCR Plus|VCR Plus+ codes]]
* [[Unicode Transformation Format]], in particular the [[UTF-8]] system for encoding [[Unicode]] characters, which is both a prefix-free code and a [[self-synchronizing code]]<ref>{{cite web
* [[country calling codes]]
| url = http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
* [[Universal code (data compression)|Universal code]]s such as
| title = UTF-8 history
** [[Elias delta coding]]
| first = Rob
** [[Elias gamma coding]]
| last = Pike
** [[Elias omega coding]]
| date = 2003-04-03
** [[Fibonacci coding]]
}}</ref>
** [[Golomb coding]]
** [[Unaryvariable-length codingquantity]]
* [[Shannon-Fano coding|Shannon-Fano codes]]
* [[Huffman coding|Huffman codes]]
 
When compressing data, we wonder -- which one is the best code ? (Which code compresses the file into the fewest number of bits ?) Or in other words, which does the best [[entropy encoding]]?
 
If one knows ahead of time all the letters that could possibly be used, and has a good estimate of the [[letter frequencies]], the best possible comma-free code is a [[Huffman coding|Huffman code]]. (Usually the Huffman process generates a variable-length code. But when all the letters have the same frequency, such as previously compressed or encrypted data, and additionally the number of codewords is a power of 2, the Huffman process will generate a fixed-length code.)
 
All other codes (both variable-length and fixed-length) use at least as many bits than a Huffman code. (Usually there are several Huffman codes. All of them compress the file into exactly the same number of bits).
 
== non-codes ==
 
Some data compression algorithms can compress files even smaller than [[Huffman coding|Huffman compression]]. Generally this is because they don't use a code at all.
* They may represent "a" by one pattern of bits in one place in the compressed file, then use that same pattern of bits to represent a completely different letter later on, as in [[Adaptive Huffman coding|adaptive Huffman]] compression.
* They may use a pattern of bits to represent several letters, as in [[LZW]] compression -- changing any one of those bits may completely change that block of letters.
* Or they may avoid mapping particular bits to particular letters (the definition of a code) in other creative ways, as in [[range encoding]].
 
=== existence of prefix codes / Krafts inequality ===
 
If we have a fixed number of symbols <math>|X|</math>(the alphabet size), then for any given list of codeword lengths <math>(l_i)_{i=1...n}</math> a prefix code exists if and only if <math>\sum_{i=1}^{n}|X|^{-l_i}\le 1</math>. This is known as ''Krafts inequality''.
 
== error handling ==
 
Many communication systems are not completely error-free.
There are occasional a single bit errors (toggling a bit, losing a bit, or gaining a bit).
 
With [[fixed-length code]]s, an error toggling a bit causes just that one code to be received in error,
but all other codes are received OK. However, losing or gaining a bit (a [[framing error]]) turns the rest of the message into gibberish.
(This is why most communication protocols periodically re-synchronize.
ASCII over RS-232 uses 20% of its bandwidth re-synchronizing after each character).
See
[[Synchronization]]
[[Link protocol]]
 
With [[Fibonacci code]]s and [[unary coding|unary code]]s, all single-bit errors cause one or two erroneous codes,
but all other codes are received OK. (These codes are "self-synchronizing").
 
With most other variable-length codes, any kind of single-bit error turns the rest of the message into gibberish.
 
== See also ==
 
===Techniques===
* [[binary symmetric channel]]
Commonly used techniques for constructing prefix codes include [[Huffman coding|Huffman codes]] and the earlier [[Shannon–Fano coding|Shannon–Fano codes]], and [[universal code (data compression)|universal code]]s such as:
* [[communications protocol]]
* [[characterElias encodingdelta coding]]
* [[Elias gamma coding]]
* [[Elias omega coding]]
* [[Fibonacci coding]]
* [[Levenshtein coding]]
* [[Unary coding]]
* [[Golomb Rice code]]
* [[Straddling checkerboard]] (simple cryptography technique which produces prefix codes)
* binary coding<ref>{{citation|doi=10.25209/2079-3316-2018-9-4-239-252|last1=Shevchuk|first1=Y. V.|author1-link=Yury V. Shevchuk|title=Vbinary: variable length integer coding revisited|journal=Program Systems: Theory and Applications|volume=9|issue=4|pages=239–252|year=2018|url=http://psta.psiras.ru//read/psta2018_4_239-252.pdf|doi-access=free}}</ref>
 
== References Notes==
{{Reflist}}
 
==References==
* P. Elias, Universal codeword sets and representations of integers, IEEE Trans. Inform. Theory 21 (2) (1975) 194-203.
* {{cite book | last1=Berstel | first1=Jean | last2=Perrin | first2=Dominique | last3=Reutenauer | first3=Christophe | title=Codes and automata | series=Encyclopedia of Mathematics and its Applications | volume=129 | ___location=Cambridge | publisher=[[Cambridge University Press]] | year=2010 | url=http://www-igm.univ-mlv.fr/~berstel/LivreCodes/Codes.html | isbn=978-0-521-88831-8 | zbl=1187.94001 }}
* {{cite journal | last=Elias | first=Peter | author-link=Peter Elias | title=Universal codeword sets and representations of the integers | journal=IEEE Trans. Inf. Theory | volume=21 | number=2 | year=1975 | pages=194–203 | issn=0018-9448 | zbl=0298.94011 | doi=10.1109/tit.1975.1055349}}
* D.A. Huffman, "A method for the construction of minimum-redundancy codes", Proceedings of the I.R.E., Sept. 1952, pp.&nbsp;1098–1102 (Huffman's original article)
* [https://web.archive.org/web/20070220234037/http://www.huffmancoding.com/david/scientific.html Profile: David A. Huffman], [[Scientific American]], Sept. 1991, pp.&nbsp;54–58 (Background story)
* [[Thomas H. Cormen]], [[Charles E. Leiserson]], [[Ronald L. Rivest]], and [[Clifford Stein]]. ''[[Introduction to Algorithms]]'', Second Edition. MIT Press and McGraw-Hill, 2001. {{ISBN|0-262-03293-7}}. Section 16.3, pp.&nbsp;385–392.
* {{FS1037C}}
 
==External links==
* [http://plus.maths.org/issue10/features/infotheory/index.html Codes, trees and the prefix property] by Kona Macphee
{{Compression methods}}
[[Category:Coding theory]]
[[Category:Prefixes|code]]
[[Category:Data compression]]
[[Category:Lossless compression algorithms]] <!-- do I really need both categories? -->
[[cs:Prefixov%C3%BD k%C3%B3d]]
[[pl:Kod prefiksowy]]