Kolmogorov complexity: Difference between revisions

Content deleted Content added
+ example of a binary encoding, significance of the constant C, and a straightforward way to get upper bound on I(s)
m Automated conversion
Line 1:
'''Algorithmic information theory''' is a field of study which attempts to capture the concept of complexity using tools from theoretical computer science. The chief idea is to define the complexity (or Kolmogorov complexity) of a [[string]] as the length of the shortest program which, when run without any input, outputs that string. Strings that can be produced by short programs are not considered to be not very complex. This notion is surprisingly deep and can be used to state and prove impossibility results akin to [[Goedels Incompleteness Theorem|Gödel's incompleteness theorem]] and [[halting problem|Turing's halting problem]].
 
 
 
The field was developed by [[Kolmogorov|Andrey Kolmogorov]] and [[Gregory Chaitin]] starting in the late 1960s.
 
 
 
To formalize the above definition of complexity, one has to specify exactly what types of programs are allowed.
Fortunately, it doesn't really matter: one could take a particular notation for [[Turing machine]]s, or [[lambda calculus]] [[recursive|Turing functionmachines]]s, or [[LISP]] programs, or [[Pascal programming language|Pascal]] programs, or [[Java virtual machine]] bytecode.
 
If we agree to measure the lengths of all objects consistently in [[bit|bits]]s, then the resulting notions of complexity will only differ by a constant factor: if ''I''<sub>1</sub>(''Ts'') and ''I''<sub>2</sub>(''s'') isare the complexitycomplexitites of the string ''s'' whenaccording Turingto machinestwo aredifferent usedprogramming for the definition, andlanguages ''IL''<sub>1</sub> and ''L''<sub>2</sub>(''s''), isthen thethere complexityare ofconstants the''C'' stringand ''sD'' when(which LISPonly programsdepend areon used,the thenlanguages therechosen, isbut anot constanton ''Cs'') such that for all strings ''s'':
Fortunately, it doesn't really matter: one could take a particular notation for [[Turing machine]]s, or [[lambda calculus]] [[recursive function]]s, or [[LISP]] programs, or [[Pascal programming language|Pascal]] programs, or [[Java virtual machine]] bytecode.
:''I''<sub>''L''1</sub>(''s'') -&le; ''C'' &le; ''I''<sub>''T''2</sub>(''s'') &le; ''I''<sub>''L''</sub>(s) + ''CD''.
 
Here, ''D'' is the length of an interpreter for ''L''<sub>2</sub> written in ''L''<sub>1</sub>, and ''C'' describes the overhead of storing programs written in ''L''<sub>2</sub> as part of programs in ''L''<sub>1</sub>.
If we agree to measure the lengths of all objects consistently in [[bit]]s, then the resulting notions of complexity will only differ by a constant: if ''I''<sub>''T''</sub>(''s'') is the complexity of the string ''s'' when Turing machines are used for the definition, and ''I''<sub>''L''</sub>(''s'') is the complexity of the string ''s'' when LISP programs are used, then there is a constant ''C'' such that for all strings ''s'':
 
:''I''<sub>''L''</sub>(''s'') - ''C'' &le; ''I''<sub>''T''</sub>(''s'') &le; ''I''<sub>''L''</sub>(s) + ''C''
 
''C'' equals the length of the interpreter for one language in the other and does not depend on ''s''.
 
 
 
In the following, we will fix one definition and simply write ''I''(''s'') for the complexity of the string ''s''.
 
The first surprising result is that ''I''(''s'') can notcannot be computed: there is no general [[algorithm]] which takes a string ''s'' as input and produces the number ''I''(''s'') as output. The proof is a formalization of the amusing [[Berry paradox]]: "Let ''n'' be the smallest number that cannot be defined in less than twenty English words. Well, I just defined it in less than twenty English words."
 
However,It itis ishowever straightforward to produce ancompute upper boundbounds onfor ''I''(''s''): simply implement the [[gzipcompression|gunzipcompress]] programthe string ''s'' with some method, implement the corresponding decompressor in the chosen language, concatenate itthe decompressor to gzip'sthe outputcompressed when given ''s''string, and measure the resulting string's length.
 
The next important result is about randomness of strings. Most strings are complex in the sense that they cannot be significantly compressed: ''I''(''s'') is not much smaller than |''s''|, the length of ''s'' in bits. The precise statement is as follows: there is a constant ''K'' (which depends only on the particular specification of "program" used in the definition of complexity) such that for every ''n'', the [[probability]] that a random string ''s'' has complexity less than |''s''| - ''n'' is smaller than ''K'' 2<sup>-''n''</sup> for all ''n''. The proof is a counting argument: you count the programs and the strings, and compare. This theorem is the justification for Mike Goldman's challenge in the [http://www.faqs.org/faqs/compression-faq/ comp.compression FAQ]:
The first surprising result is that ''I''(''s'') can not be computed: there is no general algorithm which takes a string ''s'' as input and produces the number ''I''(''s'') as output. The proof is a formalization of the amusing [[Berry paradox]]: "Let ''n'' be the smallest number that cannot be defined in less than twenty English words. Well, I just defined it in less than twenty English words."
 
However, it is straightforward to produce an upper bound on ''I''(''s''): simply implement the [[gzip|gunzip]] program in the chosen language, concatenate it to gzip's output when given ''s'', and measure the resulting string.
 
 
 
The next important result is about randomness of strings. Most strings are complex in the sense that they cannot be significantly compressed: ''I''(''s'') is not much smaller than |''s''|, the length of ''s'' in bits. The precise statement is as follows: there is a constant ''K'' (which depends only on the particular specification of "program" used in the definition of complexity) such that the [[probability]] that a random string ''s'' has complexity less than |''s''| - ''n'' is smaller than ''K'' 2<sup>-''n''</sup> for all ''n''. The proof is a counting argument: you count the programs and the strings, and compare. This theorem is the justification for Mike Goldman's challenge in the [http://www.faqs.org/faqs/compression-faq/ comp.compression FAQ]:
 
:''I will attach a prize of $5,000 to anyone who successfully meets this challenge. First, the contestant will tell me HOW LONG of a data file to generate. Second, I will generate the data file, and send it to the contestant. Last, the contestant will send me a decompressor and a compressed file, which will together total in size less than the original data file, and which will be able to restore the compressed file to the original state.''
 
 
 
:''With this offer, you can tune your algorithm to my data. You tell me the parameters of size in advance. All I get to do is arrange the bits within my file according to the dictates of my whim. As a processing fee, I will require an advance deposit of $100 from any contestant. This deposit is 100% refundable if you meet the challenge.''
 
Now for Chaitin's incompleteness result: though we know that most strings are complex in the above sense, the fact that a specific string is complex can never be proven (if the string's length is above a certain threshold). The precise formalization is as follows. Suppose we fix a particular consistent [[axiomatic system]] for the [[natural number|natural numbers]]s, say [[Peanos axioms|Peano's axioms]]. Then there exists a constant ''L'' (which only depends on the particular axiomatic system and the choice of definition of complexity) such that there is no string ''s'' for which the statement
 
 
Now for Chaitin's incompleteness result: though we know that most strings are complex in the above sense, the fact that a specific string is complex can never be proven (if the string's length is above a certain threshold). The precise formalization is as follows. Suppose we fix a particular consistent [[axiomatic system]] for the [[natural number]]s, say [[Peanos axioms|Peano's axioms]]. Then there exists a constant ''L'' (which only depends on the particular axiomatic system and the choice of definition of complexity) such that there is no string ''s'' for which the statement
 
:''I''(''s'') &ge; ''L''
 
can be proven within the axiomatic system (even though, as we know, the vast majority of those statements must be true). Again, the proof of this result is a formalization of Berry's paradox.
 
 
 
Similar ideas are used to prove the properties of [[Chaitins constant|Chaitin's constant]].
 
 
 
----
 
'''See also:'''
 
* [http://www.cs.umaine.edu/~chaitin/ Chaitin's online publications]