Criticism of the C programming language

This is an old revision of this page, as edited by Marc W. Abel (talk | contribs) at 17:50, 31 August 2006 (Standardization). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The C programming language is a very widely used programming language, minimalistic and low-level by design. Despite its popularity, C's characteristics have led to much criticism of the language.

Many beginning programmers have difficulty learning C's syntax and peculiarities, and even many expert programmers find C programs difficult to maintain and debug. A popular saying, repeated by such notable language designers as Bjarne Stroustrup, is that "C makes it easy to shoot yourself in the foot." [1] In other words, C permits many operations that are generally not desirable, and thus many simple programming errors are not detected by the compiler and may not even be readily apparent at runtime. This potentially leads to programs with unpredictable behavior and security holes, if sufficient care and discipline are not used in programming and maintenance.

The designers wanted to avoid compile- and run-time checks that were too expensive when C was first implemented. With time, external tools were developed to perform some of these checks. Nothing prevents an implementation from providing such checks, but nothing requires it to, either. The safe C dialect Cyclone addresses some of these concerns.

Kernighan and Ritchie made reference to the basic design philosophy of C in their response to criticism of C not being a strongly-typed language[1]: "Nevertheless, C retains the basic philosophy that programmers know what they are doing; it only requires that they state their intentions explicitly."[2]

Memory allocation

One issue to be aware of when using C is that automatically and dynamically allocated objects are not necessarily initialized (depending on what facility is used to allocate memory); they initially have an indeterminate value (typically whatever values are present in the memory space they occupy, which might not even be a legal bit pattern for that type). This value is highly unpredictable and can vary between two machines, two program runs, or even two calls to the same function. If the program attempts to use such an uninitialized value, the results are undefined. Many modern compilers try to detect and warn about this problem, but both false positives and false negatives occur.

Another common problem is that heap memory has to be manually synchronized with its actual usage in any program for it to be correctly reused as much as possible. For example, if an automatic pointer variable goes out of scope or has its value overwritten while still referencing a particular allocation that is not freed via a call to free(), then that memory cannot be recovered for later reuse and is essentially lost to the program, a phenomenon known as memory leak. Conversely, it is possible to release memory too soon, and in some cases continue to be able to use it, but since the allocation system can re-allocate the memory at any time for unrelated reasons, this results in unpredictable behavior, typically manifested in portions of the program far removed from the erroneously written segment. Such issues are ameliorated in languages with automatic garbage collection or RAII.

Pointers

Pointers are a primary source of potential danger. Because they are typically unchecked, a pointer can be made to point to any arbitrary ___location (even within code), causing unpredictable effects. Although properly-used pointers point to safe places, they can be moved to unsafe places using pointer arithmetic; the memory they point to may be deallocated and reused (dangling pointers); they may be uninitialized (wild pointers); or they may be directly assigned a value using a cast, union, or through another corrupt pointer. In general, C is permissive in allowing manipulation of and conversion between pointer types, although compilers typically provide options for various levels of checking. Other languages attempt to address these problems by using more restrictive reference types.

Arrays

Although C supports static arrays, it is not required that array indexes be validated (bounds checking). For example, one can write to the sixth element of an array with five elements, yielding generally undesirable results. This type of bug, called a buffer overflow, has been notorious as the source of a number of security problems. On the other hand, since bounds checking elimination technology was largely nonexistent when C was defined, bounds checking came with a severe performance penalty, particularly in numerical computation. By comparison, a few years earlier, Fortran compilers had a switch to toggle bounds checking on or off.

Multidimensional arrays are commonly used in numerical algorithms (mainly from applied linear algebra) to store matrices. The structure of the C array is particularly well suited to this particular task, provided one remembers to count indices starting from 0 instead of 1. This issue is discussed in the book Numerical Recipes in C, chapter 1.2, page 20ff (read online). In that book there is also a solution based on negative indexing which introduces other dangers. Starting indices at 0 has been assimilated into the computing culture, and is no longer as alien a notion as it seemed when C was first introduced.

Variadic functions

Another source of bugs is variadic functions, which take a variable number of arguments. Unlike other prototyped C functions, checking the types of arguments to variadic functions at compile-time is, in general, impossible without additional information. If the wrong type of data is passed, the effect is unpredictable, and often fatal. Variadic functions also handle null pointer constants in a way which is often surprising to those unfamiliar with the language semantics. For example, NULL must be cast to the desired pointer type when passed to a variadic function. The printf family of functions supplied by the standard library, used to generate formatted text output, has been noted for its error-prone variadic interface, which relies on a format string to specify the number and type of trailing arguments.

However, type-checking of variadic functions from the standard library is a quality-of-implementation issue; many modern compilers do type-check printf calls, producing warnings if the argument list is inconsistent with the format string. Even so, not all printf calls can be checked statically since the format string can be built at runtime, and other variadic functions typically remain unchecked.

Syntax

Although mimicked by many languages because of its widespread familiarity, C's syntax has been often targeted as one of its weakest points. For example, Kernighan and Ritchie say in the second edition of The C Programming Language, "C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better." Bjarne Stroustrup said of C++ (which is superficially similar to C): "Within C++, there is a much smaller and cleaner language struggling to get out. […] the C++ semantics is much cleaner than its syntax." [2] Some specific problems worth noting are:

  • A function prototype with an empty parameter list allows any set of parameters, a syntax problem introduced for backward compatibility with K&R C, which lacked prototypes.
  • Some questionable choices of operator precedence, as mentioned by Kernighan and Ritchie above, such as == binding more tightly than & and | in expressions like x & 1 == 0.
  • The use of the = operator, used in mathematics for equality, to indicate assignment. Ritchie made this syntax design decision consciously, based primarily on the argument that assignment occurs more often than comparison. However, as explained by computer scientist Damian Conway in his "Seven Deadly Sins of Introductory Programming Language Design": "Many students, when confronted with this operator, become confused as to the nature of assignment and its relationship to equality. […] [A different syntax] seems to evoke less confusion, [because it] reinforces the notion of procedural transfer of value, rather than transitive equality of value.".[3]
  • Similarly, the similarity of the assignment and equality operators (= and ==) makes it easy to substitute one for the other, and C's weak type system permits each to be used in the context of the other without a compilation error (although some compilers produce warnings).[3] [4]
  • A lack of infix operators for complex objects, particularly for string operations, making programs which rely heavily on these operations difficult to read. The Lisp language, with no infix operators whatsoever, exhibits this problem to an even greater extent.
  • Heavy reliance on punctuation-based symbols even where this is arguably less clear, such as "&&" and "||" instead of "and" and "or," respectively. Some are also confused about the difference between bit-wise operators ("&" and "|") and logical operators ("&&" and "||").
  • Unintuitive declaration syntax, particularly for function pointers. In the words of language researcher Damian Conway speaking about the very similar C++ declaration syntax:

Specifying a type in C++ is made difficult by the fact that some of the components of a declaration (such as the pointer specifier) are prefix operators while others (such as the array specifier) are postfix. These declaration operators are also of varying precedence, necessitating careful bracketing to achieve the desired declaration. Furthermore, if the type ID is to apply to an identifier, this identifier ends up at somewhere between these operators, and is therefore obscured in even moderately complicated examples (see Appendix A for instance). The result is that the clarity of such declarations is greatly diminished. Ben Werther & Damian Conway. A Modest Proposal: C++ Resyntaxed. Section 3.1.1. 1996.

Economy of expression[4]

One occasional criticism of C is that it can be concise to the point of being cryptic. A classic example that appears in K&R[5] is the following function to copy the contents of string t to string s:

void strcpy(char *s, char *t)
{
    while (*s++ = *t++);
}

In this example, t is a pointer to a null-terminated array of characters, s is a pointer to an array of characters. Every loop of the single while statement does the following:

  • Copies the character pointed to by t (initially set to point to the first character of the string to be copied) to the corresponding character position pointed to by s (initially set to point to the first character of the character array to be copied to)
  • Advances the pointers s and t to point to the next character. Note that the values of s and t can safely be changed, because they are local copies of the pointers to the corresponding arrays
  • Tests whether the character copied (the result of the assignment statement) is a null character signifying the end of the string. Note that the test could have been written "((*s++ = *t++) != '\0')" (where '\0' is the null character); however, in C, a Boolean test is actually a test for any non-zero value; consequently the test is true as long as the character is any character other than a string-terminating null
  • As long as the character is not a null, the condition is true, causing the while loop to repeat. (In particular, because the character copy occurs before the condition is evaluated, the final terminating null is guaranteed to be copied as well)
  • The repeatedly executed body of the while loop is an empty statement, signified by the semicolon (which despite appearances is not part of the while syntax). (It is not uncommon for the body of while or for loops to be empty.)

The above code is functionally equivalent to:

void strcpy(char *s, char *t)
{
    char aux;
    do {
        *s = *t;
        aux = *s;
        s++;
        t++;
    } while (aux != '\0');
}

In a modern optimising compiler, these two pieces of code produce identical assembly code, so the smaller code does not produce smaller output. In more verbose languages such as Pascal, a similar iteration would require several statements. For C programmers, the economy of style is idiomatic and leads to shorter expressions; for critics, being able to do too much with a single line of C code can lead to problems in comprehension.

Internal Consistency

Some features of C, its preprocessor, and/or implementation are inconsistently implemented. One of C's features is three distinct classes of non-wide string literals. One is for programs, one is for include files with quotation marks around the filename, and the third is for include filenames in right angle brackets. The allowed symbol set, and the interpretation of them, is not consistent between the three.

Standardization

The C programming language was standardized by ANSI in 1989. Since then, ANSI and ISO have heavily revised the language, adding many requirements and features which have questionable user demand such as first trigraphs and now digraphs to accommodate character sets without braces, hexadecimal floating point, complex arithmetic, and such. Microsoft, Borland, and even GNU have not written a conforming C99 compiler in the years following the standard's publication; yet the international standards committees continue to amend the language. Some of even C's devout fans feel that C has gone the path of HTML.

The national and international standards for C make demands which are very difficult to defend technically; for example, conforming source files must end with a newline. Although an implementation can trivially append a newline to a translation unit if its parser relies on it, it is an error for the user to omit the newline, which compliant compilers will discard anyhow after a point the standards call "translation phase four". Similarly, it is an error for the final newline of a file to be preceded with a backslash, notwithstanding the fact both symbols will be deleted upon detection during "translation phase one".

Maintenance

There are other problems in C that don't directly result in bugs or errors, but make it harder for programmers to build a robust, maintainable, large-scale system. Examples of these include:

  • A fragile system for importing definitions (#include) that relies on literal text inclusion and redundantly keeping prototypes and function definitions in sync, and drastically increases build times.
  • A cumbersome compilation model that forces manual dependency tracking and inhibits compiler optimizations between modules (except by link-time optimization).
  • A weak type system that lets many clearly erroneous programs compile without errors.

Tools for mitigating issues with C

Tools have been created to help C programmers avoid these problems in many cases.

Automated source code checking and auditing are beneficial in any language, and for C many such tools exist, such as Lint. A common practice is to use Lint to detect questionable code when a program is first written. Once a program passes Lint, it is then compiled using the C compiler.

There are also compilers, libraries and operating system level mechanisms for performing array bounds checking, buffer overflow detection, and automatic garbage collection, that are not a standard part of C.

Many compilers, most notably Visual C++, deal with the long compilation times inflicted by header file inclusion using precompiled headers, a system where declarations are stored in an intermediate format that is quick to parse. Building the precompiled header files in the first place is expensive, but this is generally done only for system header files, which are larger and more numerous than most application header files and also change much less often.

Cproto is a program that will read a C source file and output prototypes of all the functions within the source file. This program can be used in conjunction with the "make" command to create new files containing prototypes each time the source file has been changed. These prototype files can be included by the original source file (e.g., as "filename.p"), which reduces the problems of keeping function definitions and source files in agreement.

It should be recognized that these tools are not a panacea. Because of C's flexibility, some types of errors involving misuse of variadic functions, out-of-bounds array indexing, and incorrect memory management cannot be detected on some architectures without incurring a significant performance penalty. However, some common cases can be recognized and accounted for.

See also

Footnotes

  1. ^ Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, 2nd ed., Prentice Hall, 1988, p. 3.
  2. ^ Dennis Ritchie. "The Development of the C Language". Retrieved 2006-07-26.
  3. ^ For example, the conditional expression if (a=b) is only true if b is not zero.
  4. ^ The heading of this section is borrowed from the first sentence of the preface to the first edition of Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, reprinted in 2nd ed., p. xi.
  5. ^ Brian W. Kernighan and Dennis M. Ritchie: The C Programming Language, 2nd ed., p. 106. Note that this example fails if the array t be larger than s, a complication that is handled by the safer library function strncpy().