All,

I have ported the IIR filter implementation to use SSE intrinsics and have achieved a 2.5x speedup so far. But I'm having accuracy problems for large blurs due to using single precision instead of double.

I've attached examples outputs that blurs a box with several sizes. For sigma=120, the annoying mach  bands start getting noticeable. At sigma=200, it looks very bad, and at 350, you might as well use it as a sharpen filter :)

This is the first graphics application I've seen where single precision isn't enough. Is the reason due to the recursive behavior of IIR (more accumulation of error) ?

Basically, for large blurs (stdev > 100), 24 bits of precision isn't sufficient and you end up getting very annoying mach bands.  For sigma = 100, the filter coefficients are:

b[0] = 0.00000387430191040
b[1] = 2.96391582489013672
b[2] = -2.92843437194824219
b[3] = 0.96451467275619507

If I add b[0] to the other terms last (b[1] + b[2] + b[3]) + b[0],    (earlier, when I added the b[0] term to b[1], the accuracy was even worse as expected), the alignment of the operands in binary is approximately:

1.00000000000000000000000
0.00000000000000000100000

meaning the addition loses a lot of the b[0] term's precision.

Can anyone suggest how to proceed next and how useful you think this speedup is? (for me it's a moderate improvement because in my comic, I use marble and brick textures for the walls and floors - I also vectorized the turbulence generator).

I would like to avoid using double precision since SSE supports only 2 doubles per operation instead of 4 floats.

The things I thought about of trying next are:
1. Use int32 arithmetic  (that would add 9 extra bits of precision)
2. Use 4 taps instead of 3. Would this really help? I don't know much about IIR filter design. Can someone give me the coefficient formula for 4 taps?

Thanks for helping,

Yale