Schlemiel the Painter's algorithm: Difference between revisions

Content deleted Content added
Wookie2u (talk | contribs)
m +{{Redirect category shell}} for multiple-{{R}} #Rs using AWB
 
(81 intermediate revisions by 41 users not shown)
Line 1:
#REDIRECT [[Joel Spolsky#Schlemiel the Painter's algorithm]]
{{essay-like}}{{unencyclopedic}}
{{underconstruction}}
 
{{Redirect category shell|1=
== Introduction ==
{{R from Merge}}
 
{{R to Section}}
In computer science "Schlemiel The Painter's Algorithm" is a name for any algorithm which repeatedly reprocesses content. This '''re'''processing is extremely inefficient. The problem most commonly appears in string and buffer processing code.
}}
 
== So who was Schlemiel? ==
 
So who was Schlemiel The Painter? Well, Schlemiel got a job painting the white line down the middle of the road. On his first day Schlemiel painted miles of white line, and his boss rejoiced. On his second day, Schlemiel was "a bit too slow". By the end of the third day Schlemiel was going nowhere. On the morning of Schlemiel's fourth day the boss called Schlemiel into the office for some "counselling".
 
The boss said "Schlemiel, I think you are good worker. You are bright and you work hard. So why is that the longer you work the less you get done?". "Well" Schlemiel replied "yesterday afternoon it took me two hours to run from my paint tin to the end of the line, and two hours to run back again to replenish my brush, so I knocked off and went home." The boss thought for a few seconds, and managed to say (quit calmly) "Schlemiel '''you need to take your paint with you'''."
 
== The Problem ==
 
Repeated calls to [[ANSI_C|ANSI-C]]'s [[strcat]] function is a classic example of Schlemiel's algorithm.
 
An implementation of ''strcat'' might look something like this:
<source lang="c">
char *strcat(char *dest, const char *src) {
// remember the start of the destination string
char *startOfDest = dest;
// make dest point at the null terminator at the end of the destination string
while(*dest) { // <-- EXPENSIVE
dest++;
}
// append the source string to the destination string
while(*src) {
dest++ = src++;
}
// null terminate the destination string
dest = '\0';
// return the start of the destination string
return startOfDest;
}
</source>
 
The ''strcat'' function looks innocious enough, and it is, as long as you don't call it '''repeatedly''' to build a large string.
 
So lets examine an example of how '''not''' to use strcat:
<source lang="c">
// builds a 1Meg string of "word, word, word, ....";
void testPerformanceOfStrcat() {
final int BUFFER_SIZE = 1024*1024;
char word[] = "word";
char buffer[BUFFER_SIZE] = "";
while( strlen(buffer) < BUFFER_SIZE-strlen(word) ) { // <-- LUDICROUSLY EXPENSIVE
if(*buffer) strcat(buffer, ", "); // <-- LUDICROUSLY EXPENSIVE
strcat(buffer, word); // <-- LUDICROUSLY EXPENSIVE
}
}
</source>
 
The estimated cost of testPerformanceOfStrcat is:
* NOTE: I am ignoring "ancillary expenses" like the cost of making a function call.
* The number of times through the loop is int(1024*1024/6) = 174,762 // strlen("word, ")==6
* The cost of ''strlen(buffer)'' in the ''while'' condition is [[Fibonacci_number|fib]](174,762) = 15,270,965,703
* The cost of finding the end of the buffer (again) in ''strcat(buffer, ", ")'' is 15,270,965,703 (again).
* The cost of finding the end of the buffer (again) in ''strcat(buffer, word)'' is 15,270,965,703 (again).
* So the TOTAL COST is 15,270,965,710 * 3 = 45,812,897,130 or about 45 billion clock ticks.
* 45 billion cycles at 2 billion cycles per second (my CPU is 2.01GHz) is about 22 seconds.
* 22 seconds to build a buffer! Ouch !!!!! !!!!! !!!!! !!!!! !!
 
== The Solution ==
 
Yep, Schlemiel would be well advised to "take your paint with you"... So how can we apply that principle to ''strcat''? Easy, we just have to remember the end of the string, to avoid that expensive seek every time we append to it. First lets write another version of ''strcat'' called ''szcat'' which returns the-new-end-of-the-destination-string (note: the mnemonic ''sz'' denotes a null terminated string).
 
<source lang="c">
char *szcat(char *dest, const char *src) {
// make dest point at the null terminator at the end of the destination string
while(*dest) { // <-- EXPENSIVE, SO AVOID UYSING IT REPEATEDLY.
dest++;
}
// append the source string to the destination string
while(*src) {
dest++ = src++;
}
// null terminate the destination string
dest = '\0';
// return a pointer to the null terminator at the end of the destination string
return dest;
}
</source>
 
Our new ''szcat'' function is basically the same, it just returns the-new-end-of-the-destination-string instead of the-start-of-the-destination-string... So what? Well so the '''next time''' we call ''szcat'' we can pass it the-end-of-the-destination-string, so ''szcat'' doesn't have to waste time finding it. Pretty clever huh!
 
Now lets write a test harness to examine ''szcat'''s (hopefully improved) performance.
 
<source lang="c">
// builds a 1Meg string of "word, word, word, ...."
char *testPerformanceOfSzcat() {
final int BUFFER_SIZE = 1024*1024;
char word[] = "word";
char buffer[BUFFER_SIZE] = "";
char *last = buffer+BUFFER_SIZE-strlen(word)-strlen(", "); // <-- Do this once!
char *endOfBuffer = szcat(buffer, word); // <-- Do this once!
while( endOfBuffer < last ) { // <-- 15 billion times cheaper
endOfBuffer = strcat(endOfBuffer, word); // <-- 30 billion times cheaper
}
}
</source>
 
The estimated cost of testPerformanceOfStrcat is:
* The number of times through the loop is still int(1024*1024/6) = 174,762 //strlen("word, ")=6
* We've eliminated the overhead of finding the-end-of-the-destination-string in the ''while'' condition, so we're down to one operation per loop = 174,762 clock ticks.
* We've eliminated the overhead of finding the-end-of-the-destination-string in ''strcat(endOfBuffer, word)'', which leaves is with just the cost of copying the string, which is 6 cycles per loop = 174,762 * 6 = 1,048,572
* about 1.2 million (with an m) cycles at 2 billion cycles per second (my CPU speed is 2.01GHz) is about a tenth of a second.
* That's more like it!
 
== Conclusion ==
 
"Schlemiel The Painter" algorithms are a common cause of slow unscalable software implementations. "Take your paint with you" is up there in the panthion of things every budding computer scientist needs to know, along with [[KISS_principle|the KISS principle]], and [http://www.joelonsoftware.com/articles/Unicode.html|character-encoding].