On Wed, 2007-08-29 at 22:15 -0300, bulia byak wrote:
Hey, I'm a little jealous :) For the last few versions I've been fighting hard to gain each single % of speed, and now you come and make it so much faster without even suspecting it :)
Cleaning up old code pays. :)
I noticed you replaced address assignments with memcpy in compose function, and I think it's where the speedup comes from, even if the memcpy only copies 4 bytes in each run. Perhaps it's getting replaced with some optimized single-tick operation.
Unfortunately I realized that memcpy isn't correct for the premultiplied case, which is probably the most common. I've just committed a fix for that issue which does the required premultiplication when the target is supposed to be premultiplied.
I've not had a chance to create a good test file, but I suspect we'll still come out a little bit ahead even with the fix: the big wins are likely due to inlining and the reduction of branches in the inner loop, rather than optimization of memcpy. I guess we'll find out, though.
-mental