All,
I have ported the IIR filter implementation to use SSE intrinsics and have achieved a 2.5x speedup so far. But I'm having accuracy problems for large blurs due to using single precision instead of double.
I've attached examples outputs that blurs a box with several sizes. For sigma=120, the annoying mach bands start getting noticeable. At sigma=200, it looks very bad, and at 350, you might as well use it as a sharpen filter :)
This is the first graphics application I've seen where single precision isn't enough. Is the reason due to the recursive behavior of IIR (more accumulation of error) ?
Basically, for large blurs (stdev > 100), 24 bits of precision isn't sufficient and you end up getting very annoying mach bands. For sigma = 100, the filter coefficients are:
b[0] = 0.00000387430191040
b[1] = 2.96391582489013672
b[2] = -2.92843437194824219
b[3] = 0.96451467275619507
If I add b[0] to the other terms last (b[1] + b[2] + b[3]) + b[0], (earlier, when I added the b[0] term to b[1], the accuracy was even worse as expected), the alignment of the operands in binary is approximately:
1.00000000000000000000000
0.00000000000000000100000
meaning the addition loses a lot of the b[0] term's precision.
Can anyone suggest how to proceed next and how useful you think this speedup is? (for me it's a moderate improvement because in my comic, I use marble and brick textures for the walls and floors - I also vectorized the turbulence generator).
I would like to avoid using double precision since SSE supports only 2 doubles per operation instead of 4 floats.
The things I thought about of trying next are:
1. Use int32 arithmetic (that would add 9 extra bits of precision)
2. Use 4 taps instead of 3. Would this really help? I don't know much about IIR filter design. Can someone give me the coefficient formula for 4 taps?
Thanks for helping,
Yale