When Optimization Pays Off
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” – Don Knuth
If you are doing image processing, you are firmly in that remaining 3%. I had some code that I had factored into two major cases: 1 bit per pixel and n bits per pixel (where n is a multiple of 8). Each chunk of code was more or less parallel in that it did something along the lines of this:
bool PredicateInRow(BYTE *imagePtr, int x, int bytesPerPixel, fPredicate predicate, void *predicateData)
{
BYTE *pixelPtr = imagePtr + x * bytesPerPixel;
return predicate(pixelPtr, bytesPerPixel, predicateData);
}
void ProcessRange(BYTE *imagePtr, int height, int width, int bytesPerPixel, int bytesPerRow, fPredicate predicate, void *predicateData)
{
for (int y=0; y < height; y++) {
for (int x=0; x < width; x++) {
if (PredicateInRow(imagePtr, x, bytesPerPixel, predicate, predicateData)) {
// perform processing
}
}
imagePtr += bytesPerRow;
}
}
This does some image processing across the whole image based on a predicate using the current pixel’s data. The predicate has to be vectored out to other code instead of being done inline (technically, there are other ways I can do this, but way of doing it avoids a lot of duplication of effort). I liked the factoring in that it made the code fairly easy to read, but I knew when I wrote it that I would regret this decision.
After noting that client code of this routine was slower than I preferred it to be, I fired up a profiler and found that PredicateInRow was chewing up a fair amount of the total time and it was being called 50 million times over the comparatively short life of my test application.
I refactored it to this instead:
void ProcessRange(BYTE *imagePtr, int height, int width, int bytesPerPixel, int bytesPerRow, fPredicate predicate, void *predicateData)
{
for (int y=0; y < height; y++) {
BYTE *pixelPtr = imagePtr;
for (int x=0; x < width; x++) {
if (predicate(pixelPtr, bytesPerPixel, predicateData)) {
// perform processing
}
pixelPtr += bytesPerPixel;
}
imagePtr += bytesPerRow;
}
}
I’ve now lost the PredicateInRow routine, took a hit in readability but the code runs twice as fast now that I eliminated 50 million function calls and 50 million multiplies.
I should note that I tried inlining PredicateInRow which resulted in a negligible increase in performance. In my actual code, the structure of ProcessRange is much more complex – there are three small state machines that run inside of the y-based for loop and each state machine ended up looking at rows above and below in addition to the current row. The refactoring was more complicated than the example above, which resulted in two bugs, both of which I found by reading the code closely.
The code of a multiply is tiny, but when you multiply it by 50 million, it becomes significant. When I wrote this, I knew that the per-pixel accessing was going to slow it down. Nonetheless, I took the high road of “make it work, then make it fast(er)”.
If need be, I will change the code the work with specific byte per pixel values, removing the predicate call entirely, but that will take a 2x duplication of code (1-bit and n-bit) and turn it into a pentiplication (1 bit, 8 bit, 16 bit, 24 bit, 32 bit)