Prefix code

This is an old revision of this page, as edited by 86.130.116.102 (talk) at 14:04, 14 May 2006. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

You must add a |reason= parameter to this Cleanup template – replace it with {{Cleanup|November 2005|reason=<Fill reason here>}}, or remove the Cleanup template.

Introduction

A prefix code is a code which meets the "prefix property", which is that no code word is a prefix of any other code word in the set. A code which uses code words {0,10,11} meets the prefix property; a code whose set is {0,1,10,11} does not because "1" is a prefix of both "10" and "11".

Prefix codes are also known as prefix-free codes, comma-free codes or instantaneous codes; even though Huffman coding is only one algorithm for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes" (even, confusingly, when the codes were not produced by a Huffman algorithm.)

The prefix property permits code words to be transmitted and properly framed without the need of out-of-band markers (assuming that the receiver can correctly identify the start of the transmission and that there are no uncorrected errors in the symbol stream.) This is not possible with codes that lack the prefix property, such as our example of {0,1,10,11}: a receiver which read a "1" at the start of a code word would not know whether that was the complete code word "1" or merely the prefix of the code word "10" or "11".

Examples of prefix codes are the variable-length Huffman codes, country calling codes, ISBNs and the Secondary Synchronization Codes used in the UMTS W-CDMA 3G Wireless Standard. prefix codes are also a form of entropy encoding used in lossless data compression.

This article is partly derived from Federal Standard 1037C, which uses the term comma-free code.

Primer to encoding

A key problem with encoding is knowing where one symbols start and ends. In this section, we look at more and more sophisticated ways of doing this, which win prefix coding.

Using a "comma"

The first and most obvious way to denote where one symbol starts and another ends is to use a special "comma" symbol. This is somewhat analogous to the period at the end of the sentence; it marks where one sentence ends and another begins. So the word "dada" could be transmitted

   00101110,10000110,00101110,10000110,

where the "," represents the symbol which has to be different from "1" or "0".

However, modern communication systems send everything as sequences of "1" or "0" -- adding a "third symbol" would be expensive. In general, we call this third symbol a "comma").

Fixed-length comma-free codes

Fortunately, the "third symbol" turns out to be unnecessary. It's possible for a machine to receive the comma-free sequence

   00101110100001100010111010000110

and correctly decode the word "dada".

The simplest method is to make every letter the same length. This is called a "fixed-length code". For example, ISO 8859-15 letters are always 8 bits long. [UTF-32/UCS-4]] letters are always 32 bits long. ATM packets are always 424 bits long.

Variable-length codes with a comma

The problem with using encoding of the same size for each characters is that it's not very efficient. Letters that occur frequently, like "e", have the same encoding length as letters you almost never see, like "z".

It is much more efficient if you use different length code words for different letters. That way, you can assign the letter "e" a small code word like "01" and "z" a code like "01011010101".

One way to do this is to assign a code to each letter according to their probability of accurence in the message.

As an example, if we used a custom code:

 0 a
 1 d
 01 space


then the phrase "add a dad" could be compressed to

 0,1,1,01,0,01,1,0,1

Normally, this would take 72 bits to store because each character of the sentence encoded in ASCII takes eight bits and there are nine words. With our code it takes the equivelent of 31 bits to store. This is because we have 19 characters and each character can take three possible values. We have achieved some shortening of the message due to the choice of encoding!

Morse code is an everyday example of a variable-length code with a comma. The long spaces between letters, and even longer spaces between words, help people recognize where one letter/word ends, and the next begins.

The problem with codes that use commas is that if we want to achieve further compression by removing the comma then the message becomes ambiguous. Consider:

 01101001101

Does a "0" followed by a "1" represent a space character, or 2 different letters ?

The ambiguity is caused because one complete code (in this case "0" for "a") is just the first part -- the prefix -- of another code which in this case, "01" for space. This brings us nicely to the next section.

Variable-length comma-free codes

It is possible to specially design a variable-length code such that there is never any ambiguity. Such a specially designed code is called a "variable-length code" or a "prefix-free code".

The question quickly arises whether it is possible to design an optimal variable length code. I.e A code compresses a transmission into the fewest number of bits?

If one knows ahead of time all the letters that could possibly be used, and has a good estimate of the letter frequencies then it is indeed possible to construct such a code. They are called Huffman code. It has been shown that all other codes, both variable-length and fixed-length, use at least as many bits than a Huffman code.

Usually the Huffman process generates a variable-length code but this isn't always the case. For example, when all the letters have the same frequency, such as in previously compressed or encrypted data, and the number of codewords is a power of 2, the Huffman process will generate a fixed-length code. Even in this case, the code will still be optimal.

Non-codes

Some data compression algorithms can compress files even smaller than Huffman compression. Generally this is because they don't use a code at all.

  • They may represent "a" by one pattern of bits in one place in the compressed file, then use that same pattern of bits to represent a completely different letter later on, as in adaptive Huffman compression.
  • They may use a pattern of bits to represent several letters, as in LZW compression -- changing any one of those bits may completely change that block of letters.
  • Or they may avoid mapping particular bits to particular letters (the definition of a code) in other creative ways, as in range encoding.

Existence of prefix codes / Kraft's inequality

If we have a fixed number of symbols  (the alphabet size), then for any given list of codeword lengths   a prefix code exists if and only if  . This is known as Kraft's inequality.

Uses in Error handling

Many communication systems are not completely error-free. There are occasionally a single bit errors (toggling a bit, losing a bit, or gaining a bit).

With fixed-length codes, an error toggling a bit causes just that one code to be received in error, but all other codes are received OK. However, with variable-length codes, losing or gaining a bit (a framing error) turns the rest of the message into gibberish. (This is why most communication protocols periodically re-synchronize. ASCII over RS-232 uses 20% of its bandwidth re-synchronizing after each character). See Synchronization Link protocol

With Fibonacci codes and unary codes, all single-bit errors cause one or two erroneous codes, but all other codes are received OK. (These codes are "self-synchronizing").

With most other variable-length codes, any kind of single-bit error turns the rest of the message into gibberish.

Prefix codes in use today


See also

References

  • P. Elias, Universal codeword sets and representations of integers, IEEE Trans. Inform. Theory 21 (2) (1975) 194-203.