Faster gaussian blur

2 Dec 2006

I have modified Inkscape's blur filter to operate in linear time using a 
method that gives a very good approximation of a gaussian blur using a 
small IIR filter (falling back on a heavily modified version of the old 
code for very small kernel sizes, as the IIR filter doesn't handle those 
well). This means that the time needed to compute a blur doesn't depend 
on the size of the kernel (except for the fact that larger borders are 
needed for larger kernels).
I'm posting it here for feedback, if people think this is a good idea 
I'll clean up some last things and submit it to the patch tracker. Also, 
I wouldn't mind some testing, I only tested with a simple image with a 
blurred rectangle and gallardo.svgz (the latter has quite a nice 
collection of blurs).
When rendering gallardo.svgz using the new code cuts the total time 
spent in FilterGaussian::render roughly in half (on my computer it is 
30-40%, so 50% should be quite attainable), compared to the SVN 
implementation using average(!) quality, while being virtually 
indistinguishable from best(!) quality rendering. I have seen some 
differences on certain test drawings in my test application, but for a 
rendering of gallardo.svgz the maximum difference was 1 and didn't occur 
often (and could be due to slightly different rounding, numerical 
errors, etc.).
In my patch I completely eliminated the old resampling stuff, but I 
could also imagine keeping it to let people speed up gaussian blur even 
more.
Some things I know of that could be done to make it faster/smaller/etc.:
  - Currently it uses either the IIR filter or the FIR filter for both 
the X and Y directions, which could in theory result in bad performance 
on blurs where the deviation is high in only one direction. There is no 
technical reason preventing this decision to be made separately for the 
X and Y directions though (I just haven't come around to doing that).
  - Code for the X and Y directions could be merged, as they are so 
similar that only the bounds and some strides differ (might have a 
negative impact on performance though, would have to be timed).
  - A fixed-point format could be used for the intermediate values in 
the IIR filter (I already experimented with this in a test application, 
it works quite well and is slightly faster, probably more so on older 
hardware).
  - Currently temporary storage is used to store the intermediate values 
for an entire row/column for the IIR filter. This isn't strictly 
necessary (but probably provides slightly better quality and without it 
the values would need to be clipped when storing them, at least in my 
tests).
BTW, I also made a small change to how the normal gaussian kernel is 
calculated which results in slightly higher quality blurs but can make 
direct comparison with the old code harder (the step size wasn't very 
accurate for small kernel sizes, resulting in a blur that could be quite 
visibly different from the correct one). So when evaluating the 
performance of the IIR filter it should be compared to the output of the 
FIR filter in the new code.
Index: src/display/nr-filter-gaussian.cpp
===================================================================

--- src/display/nr-filter-gaussian.cpp	(revision 13436)
+++ src/display/nr-filter-gaussian.cpp	(working copy)
@@ -6,13 +6,16 @@
  * Authors:
  *   Niko Kiirala <niko@...1267...>
  *   bulia byak
+ *   Jasper van de Gronde <th.v.d.gronde@...528...>
  *
  * Copyright (C) 2006 authors
  *
  * Released under GNU GPL, read the file 'COPYING' for more information
  */
+#include <algorithm>
 #include <cmath>
+#include <complex>
 #include <glib.h>
using std::isnormal;
@@ -24,6 +27,37 @@
 #include "libnr/nr-matrix.h"
 #include "prefs-utils.h"
+// IIR filtering method based on:
+// L.J. van Vliet, I.T. Young, and P.W. Verbeek, Recursive Gaussian Derivative Filters,
+// in: A.K. Jain, S. Venkatesh, B.C. Lovell (eds.),
+// ICPR'98, Proc. 14th Int. Conference on Pattern Recognition (Brisbane, Aug. 16-20),
+// IEEE Computer Society Press, Los Alamitos, 1998, 509-514.
+//
+// Using the backwards-pass initialization procedure from:
+// Boundary Conditions for Young - van Vliet Recursive Filtering
+// Bill Triggs, Michael Sdika
+// IEEE Transactions on Signal Processing, Volume 54, Number 5 - may 2006 
+
+// Number of IIR filter coefficients used. Currently only 3 is supported.
+// "Recursive Gaussian Derivative Filters" says this is enough though (and
+// some testing indeed shows that the quality doesn't improve much if larger
+// filters are used).
+const size_t N = 3;
+
+template<typename InIt, typename OutIt, typename Size>
+void copy_n(InIt beg_in, Size N, OutIt beg_out) {
+    std::copy(beg_in, beg_in+N, beg_out);
+}
+
+typedef double Value; // Type used for filter coefficients (can be 10.21 fixed point)
+template<typename T> inline T sqr(T const v) { return v*v; }
+
+template<typename Tt, typename Ts>
+Tt round_cast(Ts const& v) {
+	static Ts const rndoffset(.5);
+	return static_cast<Tt>(v+rndoffset);
+}
+
 namespace NR {
FilterGaussian::FilterGaussian()
@@ -45,211 +79,39 @@
 {
     int length_x = _effect_area_scr_x(trans);
     int length_y = _effect_area_scr_y(trans);
-    return _max(length_x, length_y) * 2 + 1;
+    return std::max(length_x, length_y) + 1;
 }
void FilterGaussian::_make_kernel(double *kernel, double deviation, double expansion)
 {
-    double length = deviation * 3.0;
-    int scr_len = (int)std::floor(length * expansion);
-    if(scr_len < 1) scr_len = 1;
-    double d_sq = deviation * deviation * 2;
-    double step = length / scr_len;
+    int const scr_len = (int)std::ceil(deviation * 3.0 * expansion);
+    double const d_sq = sqr(deviation * expansion) * 2;
+    // Compute kernel and sum of coefficients
+    // Note that actually only half the kernel is computed, as it is symmetric
     double sum = 0;
-    for ( int i = 0; i < scr_len * 2 + 1 ; i++ ) {
-        double i_sq = (step * i - length) * (step * i - length);
-        sum += (std::exp(-i_sq / d_sq) / std::sqrt(M_PI * d_sq));
+    for ( int i = 0; i <= scr_len ; i++ ) {
+        kernel[i] = std::exp(-sqr(i) / d_sq);
+        sum += kernel[i];
     }
+    sum = 2*sum - kernel[0]; // the sum of the complete kernel is twice as large (minus the center element to avoid counting it twice)
-    for ( int i = 0; i < scr_len * 2 + 1 ; i++ ) {
-        double i_sq = (step * i - length) * (step * i - length);
-        kernel[i] = (std::exp(-i_sq / d_sq) / std::sqrt(M_PI * d_sq)) / sum;
+    // Normalize kernel
+    for ( int i = 0; i <= scr_len ; i++ ) {
+        kernel[i] /= sum;
     }
 }
int FilterGaussian::_effect_area_scr_x(Matrix const &trans)
 {
-    int ret = (int)std::floor(_deviation_x * 3.0 * trans.expansionX());
-    if(ret < 1) ret = 1;
-    return ret;
+    return (int)std::ceil(_deviation_x * 3.0 * trans.expansionX());
 }
int FilterGaussian::_effect_area_scr_y(Matrix const &trans)
 {
-    int ret = (int)std::floor(_deviation_y * 3.0 * trans.expansionY());
-    if(ret < 1) ret = 1;
-    return ret;
+    return (int)std::ceil(_deviation_y * 3.0 * trans.expansionY());
 }
-int FilterGaussian::_effect_subsample_step(int scr_len_x, int quality)
-{
-    switch (quality) {
-        case BLUR_QUALITY_WORST:
-            if (scr_len_x < 8) {
-                return 1;
-            } else if (scr_len_x < 32) {
-                return 4;
-            } else if (scr_len_x < 64) {
-                return 8;
-            } else if (scr_len_x < 128) {
-                return 32;
-            } else if (scr_len_x < 256) {
-                return 128;
-            } else if (scr_len_x < 512) {
-                return 512;
-            } else if (scr_len_x < 1024) {
-                return 4096; 
-            } else {
-                return 65536;
-            }
-            break;
-        case BLUR_QUALITY_WORSE:
-            if (scr_len_x < 16) {
-                return 1;
-            } else if (scr_len_x < 64) {
-                return 4;
-            } else if (scr_len_x < 120) {
-                return 8;
-            } else if (scr_len_x < 200) {
-                return 32;
-            } else if (scr_len_x < 400) {
-                return 64;
-            } else if (scr_len_x < 800) {
-                return 256;
-            } else if (scr_len_x < 1200) {
-                return 1024; 
-            } else {
-                return 65536;
-            }
-            break;
-        case BLUR_QUALITY_BETTER:
-            if (scr_len_x < 32) {
-                return 1;
-            } else if (scr_len_x < 160) {
-                return 4;
-            } else if (scr_len_x < 320) {
-                return 8;
-            } else if (scr_len_x < 640) {
-                return 32;
-            } else if (scr_len_x < 1280) {
-                return 64;
-            } else if (scr_len_x < 2560) {
-                return 256;
-            } else {
-                return 1024;
-            }
-            break;
-        case BLUR_QUALITY_BEST:
-            return 1; // no subsampling at all
-            break;
-        case BLUR_QUALITY_NORMAL:
-        default: 
-            if (scr_len_x < 16) {
-                return 1;
-            } else if (scr_len_x < 80) {
-                return 4;
-            } else if (scr_len_x < 160) {
-                return 8;
-            } else if (scr_len_x < 320) {
-                return 32;
-            } else if (scr_len_x < 640) {
-                return 64;
-            } else if (scr_len_x < 1280) {
-                return 256;
-            } else if (scr_len_x < 2560) {
-                return 1024; 
-            } else {
-                return 65536;
-            }
-            break;
-    }
-}
-
-int FilterGaussian::_effect_subsample_step_log2(int scr_len_x, int quality)
-{
-    switch (quality) {
-        case BLUR_QUALITY_WORST:
-            if (scr_len_x < 8) {
-                return 0;
-            } else if (scr_len_x < 32) {
-                return 2;
-            } else if (scr_len_x < 64) {
-                return 3;
-            } else if (scr_len_x < 128) {
-                return 5;
-            } else if (scr_len_x < 256) {
-                return 7;
-            } else if (scr_len_x < 512) {
-                return 9;
-            } else if (scr_len_x < 1024) {
-                return 12; 
-            } else {
-                return 16;
-            }
-            break;
-        case BLUR_QUALITY_WORSE:
-            if (scr_len_x < 16) {
-                return 0;
-            } else if (scr_len_x < 64) {
-                return 2;
-            } else if (scr_len_x < 120) {
-                return 3;
-            } else if (scr_len_x < 200) {
-                return 5;
-            } else if (scr_len_x < 400) {
-                return 6;
-            } else if (scr_len_x < 800) {
-                return 8;
-            } else if (scr_len_x < 1200) {
-                return 10; 
-            } else {
-                return 16;
-            }
-            break;
-        case BLUR_QUALITY_BETTER:
-            if (scr_len_x < 32) {
-                return 0;
-            } else if (scr_len_x < 160) {
-                return 2;
-            } else if (scr_len_x < 320) {
-                return 3;
-            } else if (scr_len_x < 640) {
-                return 5;
-            } else if (scr_len_x < 1280) {
-                return 6;
-            } else if (scr_len_x < 2560) {
-                return 8;
-            } else {
-                return 10;
-            }
-            break;
-        case BLUR_QUALITY_BEST:
-            return 0; // no subsampling at all
-            break;
-        case BLUR_QUALITY_NORMAL:
-        default:
-            if (scr_len_x < 16) {
-                return 0;
-            } else if (scr_len_x < 80) {
-                return 2;
-            } else if (scr_len_x < 160) {
-                return 3;
-            } else if (scr_len_x < 320) {
-                return 5;
-            } else if (scr_len_x < 640) {
-                return 6;
-            } else if (scr_len_x < 1280) {
-                return 8;
-            } else if (scr_len_x < 2560) {
-                return 10; 
-            } else {
-                return 16;
-            }
-            break;
-    }
-}
-
 /**
  * Sanity check function for indexing pixblocks.
  * Catches reading and writing outside the pixblock area.
@@ -264,6 +126,155 @@
     }
 }
+static void calcFilter(double const sigma, double b[N]) {
+    assert(N==3);
+    std::complex<double> const d1_org(1.40098,  1.00236);
+    double const d3_org = 1.85132;
+    double qbeg = 1; // Don't go lower than sigma==2 (we'd probably want a normal convolution in that case anyway)
+    double qend = 2*sigma;
+    double const sigmasqr = sqr(sigma);
+    double s;
+    do { // Binary search for right q (a linear interpolation scheme is suggested, but this should work fine as well)
+        double const q = (qbeg+qend)/2;
+        // Compute scaled filter coefficients
+        std::complex<double> const d1 = pow(d1_org, 1.0/q);
+        double const d3 = pow(d3_org, 1.0/q);
+        double const absd1sqr = std::norm(d1);
+        double const re2d1 = 2*d1.real();
+        double const bscale = 1.0/(absd1sqr*d3);
+        b[2] = -bscale;
+        b[1] =  bscale*(d3+re2d1);
+        b[0] = -bscale*(absd1sqr+d3*re2d1);
+        // Compute actual sigma^2
+        double const ssqr = 2*(2*(d1/sqr(d1-1.)).real()+d3/sqr(d3-1.));
+        if ( ssqr < sigmasqr ) {
+            qbeg = q;
+        } else {
+            qend = q;
+        }
+        s = sqrt(ssqr);
+    } while(qend-qbeg>(sigma/(1<<30)));
+    printf("real sigma: %g, q: %g\n", s, qend);
+}
+
+static void calcTriggsSdikaM(double const b[N], double M[N*N]) {
+    assert(N==3);
+    double a1=b[0], a2=b[1], a3=b[2];
+    double const Mscale = 1.0/((1+a1-a2+a3)*(1-a1-a2-a3)*(1+a2+(a1-a3)*a3));
+    M[0] = 1-a2-a1*a3-sqr(a3);
+    M[1] = (a1+a3)*(a2+a1*a3);
+    M[2] = a3*(a1+a2*a3);
+    M[3] = a1+a2*a3;
+    M[4] = (1-a2)*(a2+a1*a3);
+    M[5] = a3*(1-a2-a1*a3-sqr(a3));
+    M[6] = a1*(a1+a3)+a2*(1-a2);
+    M[7] = a1*(a2-sqr(a3))+a3*(1+a2*(a2-1)-sqr(a3));
+    M[8] = a3*(a1+a2*a3);
+    for(size_t i=0; i<9; i++) M[i] *= Mscale;
+}
+
+template<unsigned int SIZE>
+static void calcTriggsSdikaInitialization(double const M[N*N], Value const uold[N][SIZE], Value const uplus[SIZE], Value const vplus[SIZE], Value const alpha, Value vold[N][SIZE]) {
+    for(size_t c=0; c<SIZE; c++) {
+        double uminp[N];
+        for(size_t i=0; i<N; i++) uminp[i] = uold[i][c] - uplus[c];
+        for(size_t i=0; i<N; i++) {
+            double voldf = 0;
+            for(size_t j=0; j<N; j++) {
+                voldf += uminp[j]*M[i*N+j];
+            }
+            // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
+            // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
+            // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
+            voldf *= alpha; 
+            vold[i][c] = voldf;
+            vold[i][c] += vplus[c];
+        }
+    }
+}
+
+// TODO: Make filterX and filterY the same function by doing something with strides
+template<typename PT, unsigned int PC>
+void filterX(NRPixBlock *const in, NRPixBlock *const out, Value const b[N], double const M[N*N], Value *const tmpdata) {
+    size_t const w=in->area.x1-in->area.x0, h=in->area.y1-in->area.y0;
+    // Blur image (x)
+    for ( size_t y = 0 ; y < h ; y++ ) {
+        // corresponding line in the source and output buffer
+        PT const *const srcimg = reinterpret_cast<PT const *>(NR_PIXBLOCK_PX(in)  + y*in->rs);
+        PT       *const dstimg = reinterpret_cast<PT       *>(NR_PIXBLOCK_PX(out) + y*out->rs);
+        // Border constants
+        Value imin[PC];  copy_n(srcimg + (0)*PC, PC, imin);
+        Value iplus[PC]; copy_n(srcimg + (w-1)*PC, PC, iplus);
+        // Forward pass
+        Value u[N+1][PC];
+        for(size_t i=0; i<N; i++) copy_n(imin, PC, u[i]);
+        for ( size_t x = 0 ; x < w ; x++ ) {
+            for(size_t i=N; i>0; i--) copy_n(u[i-1], PC, u[i]);
+            copy_n(srcimg+x*PC, PC, u[0]);
+            for(size_t c=0; c<PC; c++) {
+                u[0][c] *= b[0];
+                for(size_t i=1; i<N+1; i++) u[0][c] += u[i][c]*b[i];
+            }
+            copy_n(u[0], PC, tmpdata+x*PC);
+        }
+        // Backward pass
+        Value v[N+1][PC];
+        calcTriggsSdikaInitialization(M, u, iplus, iplus, b[0], v);
+        for(size_t c=0; c<PC; c++) dstimg[(w-1)*PC+c] = round_cast<PT>(v[0][c]);
+        size_t x=w-1;
+        while(x-->0) {
+            for(size_t i=N; i>0; i--) copy_n(v[i-1], PC, v[i]);
+            copy_n(tmpdata+x*PC, PC, v[0]);
+            for(size_t c=0; c<PC; c++) {
+                v[0][c] *= b[0];
+                for(size_t i=1; i<N+1; i++) v[0][c] += v[i][c]*b[i];
+            }
+            for(size_t c=0; c<PC; c++) dstimg[x*PC+c] = round_cast<PT>(v[0][c]);
+        }
+    }
+}
+
+template<typename PT, unsigned int PC>
+void filterY(NRPixBlock *const in, NRPixBlock *const out, Value const b[N], double const M[N*N], Value *const tmpdata) {
+    size_t const w=in->area.x1-in->area.x0, h=in->area.y1-in->area.y0;
+    size_t const in_rs=in->rs/sizeof(PT), out_rs=out->rs/sizeof(PT);
+    // Blur image (y)
+    for ( size_t x = 0 ; x < w ; x++ ) {
+        // corresponding line in the source and output buffer
+        PT const *const srcimg = reinterpret_cast<PT const *>(NR_PIXBLOCK_PX(in)  + x*PC);
+        PT       *const dstimg = reinterpret_cast<PT       *>(NR_PIXBLOCK_PX(out) + x*PC);
+        // Border constants
+        Value imin[PC];  copy_n(srcimg + (0)*in_rs, PC, imin);
+        Value iplus[PC]; copy_n(srcimg + (h-1)*in_rs, PC, iplus);
+        // Forward pass
+        Value u[N+1][PC];
+        for(size_t i=0; i<N; i++) copy_n(imin, PC, u[i]);
+        for ( size_t y = 0 ; y < h ; y++ ) {
+            for(size_t i=N; i>0; i--) copy_n(u[i-1], PC, u[i]);
+            copy_n(srcimg+y*in_rs, PC, u[0]);
+            for(size_t c=0; c<PC; c++) {
+                u[0][c] *= b[0];
+                for(size_t i=1; i<N+1; i++) u[0][c] += u[i][c]*b[i];
+            }
+            copy_n(u[0], PC, tmpdata+y*PC);
+        }
+        // Backward pass
+        Value v[N+1][PC];
+        calcTriggsSdikaInitialization(M, u, iplus, iplus, b[0], v);
+        for(size_t c=0; c<PC; c++) dstimg[(h-1)*out_rs+c] = round_cast<PT>(v[0][c]);
+        size_t y=h-1;
+        while(y-->0) {
+            for(size_t i=N; i>0; i--) copy_n(v[i-1], PC, v[i]);
+            copy_n(tmpdata+y*PC, PC, v[0]);
+            for(size_t c=0; c<PC; c++) {
+                v[0][c] *= b[0];
+                for(size_t i=1; i<N+1; i++) v[0][c] += v[i][c]*b[i];
+            }
+            for(size_t c=0; c<PC; c++) dstimg[y*out_rs+c] = round_cast<PT>(v[0][c]);
+        }
+    }
+}
+
 int FilterGaussian::render(FilterSlot &slot, Matrix const &trans)
 {
     /* in holds the input pixblock */
@@ -288,262 +299,313 @@
         return 0;
     }
-    /* Blur radius in screen units (pixels) */
-    int scr_len_x = _effect_area_scr_x(trans);
-    int scr_len_y = _effect_area_scr_y(trans);
+    // TODO: Find the best threshold (should not be lower than 1)
+    // TODO: Split into separate x and y parts.
+    //if (_deviation_x * trans.expansionX() >= 2 && _deviation_y * trans.expansionY() >= 2 && std::abs(_deviation_x * trans.expansionX()-2.37749)>0.00001) {
+    // This condition was chosen based on the old code that was used to decide on how much subsampling to use.
+    // (Using the condition below the old code would never have used subsampling.)
+    if (3*std::min(_deviation_x * trans.expansionX(), _deviation_y * trans.expansionY()) >= 8) {
+        // new buffer for the final output, same resolution as the in buffer
+        NRPixBlock *out = new NRPixBlock;
+        nr_pixblock_setup_fast(out, in->mode, in->area.x0, in->area.y0,
+                                              in->area.x1, in->area.y1, true);
+        if (out->data.px == NULL) {
+            // alas, we've accomplished a lot, but ran out of memory - so abort
+            return 0;
+        }
+        size_t numChannels;
+        switch(in->mode) {
+        case NR_PIXBLOCK_MODE_A8:        ///< Grayscale
+            numChannels = 1;
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8:    ///< 8 bit RGB
+            numChannels = 3;
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8A8N: ///< Normal 8 bit RGBA
+            numChannels = 4;
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8A8P:  ///< Premultiplied 8 bit RGBA
+            numChannels = 4;
+            break;
+        default:
+            assert(false);
+        }
+        Value *const tmpdata = new Value[std::max(in->area.x1-in->area.x0,in->area.y1-in->area.y0)*numChannels];
+        if (tmpdata == NULL) {
+            nr_pixblock_release(out);
+            delete out;
+            return 0;
+        }
-    // subsampling step; it depends on the radius, but somewhat nonlinearly, to make high zooms
-    // workable; is adjustable by quality in -2..2; 0 is the default; 2 is the best quality with no
-    // subsampling
-    int quality = prefs_get_int_attribute ("options.blurquality", "value", 0);
-    int stepx = _effect_subsample_step(scr_len_x, quality);
-    int stepx_l2 = _effect_subsample_step_log2(scr_len_x, quality);
-    int stepy = _effect_subsample_step(scr_len_y, quality);
-    int stepy_l2 = _effect_subsample_step_log2(scr_len_y, quality);
-    int stepx2 = stepx >> 1;
-    int stepy2 = stepy >> 1;
+        // Filter variables
+        Value b[N+1];  // scaling coefficient + filter coefficients (can be 10.21 fixed point)
+        double bf[N];  // computed filter coefficients
+        double M[N*N]; // matrix used for initialization procedure (has to be double)
-    /* buffer for x-axis blur */
-    NRPixBlock *bufx = new NRPixBlock;
-    /* buffer for y-axis blur */
-    NRPixBlock *bufy = new NRPixBlock;
+        // Compute filter (x)
+        calcFilter(_deviation_x * trans.expansionX(), bf);
+        for(size_t i=0; i<N; i++) bf[i] = -bf[i];
+        b[0] = 1; // b[0] == alpha (scaling coefficient)
+        for(size_t i=0; i<N; i++) {
+            b[i+1] = bf[i];
+            b[0] -= b[i+1];
+        }
-    // boundaries of the subsampled (smaller, unless step==1) buffers
-    int xd0 = (in->area.x0 >> stepx_l2);
-    int xd1 = (in->area.x1 >> stepx_l2) + 1;
-    int yd0 = (in->area.y0 >> stepy_l2);
-    int yd1 = (in->area.y1 >> stepy_l2) + 1;
+        // Compute initialization matrix (x)
+        calcTriggsSdikaM(bf, M);
-    // set up subsampled buffers
-    nr_pixblock_setup_fast(bufx, in->mode, xd0, yd0, xd1, yd1, true);
-    nr_pixblock_setup_fast(bufy, in->mode, xd0, yd0, xd1, yd1, true);
-    if (bufx->data.px == NULL || bufy->data.px == NULL) { // no memory
-        return 0;
-    }
+        // Filter (x)
+        switch(in->mode) {
+        case NR_PIXBLOCK_MODE_A8:        ///< Grayscale
+            filterX<unsigned char,1>(in, out, b, M, tmpdata);
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8:    ///< 8 bit RGB
+            filterX<unsigned char,3>(in, out, b, M, tmpdata);
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8A8N: ///< Normal 8 bit RGBA
+            filterX<unsigned char,4>(in, out, b, M, tmpdata);
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8A8P:  ///< Premultiplied 8 bit RGBA
+            filterX<unsigned char,4>(in, out, b, M, tmpdata);
+            break;
+        default:
+            assert(false);
+        };
-    //mid->visible_area = in->visible_area;
-    //out->visible_area = in->visible_area;
+        if ( std::abs(_deviation_x*trans.expansionX() - _deviation_y*trans.expansionY()) > 0.000001 ) {
+            // Compute filter (y)
+            calcFilter(_deviation_y * trans.expansionY(), bf);
+            for(size_t i=0; i<N; i++) bf[i] = -bf[i];
+            b[0] = 1; // b[0] == alpha (scaling coefficient)
+            for(size_t i=0; i<N; i++) {
+                b[i+1] = bf[i];
+                b[0] -= b[i+1];
+            }
-    /* Array for filter kernel, big enough to fit kernels for both X and Y
-     * direction kernel, one at time */
-    double kernel[_kernel_size(trans)];
-    
-    /* 1. Blur in direction of X-axis, from in to bufx (they have different resolution)*/
-    _make_kernel(kernel, _deviation_x, trans.expansionX());
+            // Compute initialization matrix (y)
+            calcTriggsSdikaM(bf, M);
+        }
-    for ( int y = bufx->area.y0 ; y < bufx->area.y1; y++ ) {
+        // Filter (y)
+        switch(in->mode) {
+        case NR_PIXBLOCK_MODE_A8:        ///< Grayscale
+            filterY<unsigned char,1>(out, out, b, M, tmpdata);
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8:    ///< 8 bit RGB
+            filterY<unsigned char,3>(out, out, b, M, tmpdata);
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8A8N: ///< Normal 8 bit RGBA
+            filterY<unsigned char,4>(out, out, b, M, tmpdata);
+            break;
+        case NR_PIXBLOCK_MODE_R8G8B8A8P:  ///< Premultiplied 8 bit RGBA
+            filterY<unsigned char,4>(out, out, b, M, tmpdata);
+            break;
+        default:
+            assert(false);
+        };
-        // corresponding line in the source buffer
-        int in_line;
-        if ((y << stepy_l2) >= in->area.y1) {
-            in_line = (in->area.y1 - in->area.y0 - 1) * in->rs;
-        } else {
-            in_line = ((y << stepy_l2) - (in->area.y0))  * in->rs;
-            if (in_line < 0)
-                in_line = 0;
+        delete[] tmpdata;
+
+        out->empty = FALSE;
+        slot.set(_output, out);
+    } else {
+		/* Blur radius in screen units (pixels) */
+        int scr_len_x = _effect_area_scr_x(trans);
+        int scr_len_y = _effect_area_scr_y(trans);
+
+        int const PC = NR_PIXBLOCK_BPP(in);
+        typedef unsigned char PT;
+
+        /* buffer for blur */
+        NRPixBlock *out = new NRPixBlock;
+        nr_pixblock_setup_fast(out, in->mode, in->area.x0, in->area.y0,
+                                              in->area.x1, in->area.y1, true);
+        if (out->data.px == NULL) { // no memory
+            return 0;
         }
-        // current line in bufx
-        int bufx_line = (y - yd0) * bufx->rs;
+        /* Array for filter kernel, big enough to fit kernels for both X and Y
+        * direction kernel, one at time */
+        double kernel[_kernel_size(trans)];
-        int skipbuf[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};
+        // Past pixels seen (to enable in-place operation)
+        PT history[_kernel_size(trans)][PC];
+        
+        int const width=in->area.x1-in->area.x0, height=in->area.y1-in->area.y0;
-        for ( int x = bufx->area.x0 ; x < bufx->area.x1 ; x++ ) {
+        /* 1. Blur in direction of X-axis, from in to bufx (they have different resolution)*/
+        _make_kernel(kernel, _deviation_x, trans.expansionX());
-            // for all bytes of the pixel
-            for ( int byte = 0 ; byte < NR_PIXBLOCK_BPP(in) ; byte++) {
+        for ( int y = 0 ; y < height; y++ ) {
-                if(skipbuf[byte] > x) continue;
+            // corresponding line in the source buffer
+            int const in_line = y * in->rs;
-                double sum = 0;
-                int last_in = -1;
-                int different_count = 0;
+            // current line in bufx
+            int const bufx_line = y * out->rs;
-                // go over our point's neighborhood on x axis in the in buffer, with stepx increment
-                for ( int i = -scr_len_x ; i <= scr_len_x ; i += stepx ) {
+            int skipbuf[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};
-                    // the pixel we're looking at
-                    int x_in = (((x << stepx_l2) + i + stepx2) >> stepx_l2) << stepx_l2;
+            // history initialization
+            PT imin[PC]; copy_n(NR_PIXBLOCK_PX(in) + in_line, PC, imin);
+            for(int i=0; i<scr_len_x; i++) copy_n(imin, PC, history[i]);
-                    // distance from it to the current x,y
-                    int dist = x_in - (x << stepx_l2);
-                    if (dist < -scr_len_x) 
-                        dist = -scr_len_x;
-                    if (dist > scr_len_x) 
-                        dist = scr_len_x;
+            for ( int x = 0 ; x < width ; x++ ) {
-                    if (x_in >= in->area.x1) {
-                        x_in = (in->area.x1 - in->area.x0 - 1);
-                    } else {
-                        x_in = (x_in - in->area.x0);
-                        if (x_in < 0)
-                            x_in = 0;
-                    }
+                int const disp = PC * x;
-                    // value at the pixel
-                    _check_index(in, in_line + NR_PIXBLOCK_BPP(in) * x_in + byte, __LINE__);
-                    unsigned char in_byte = NR_PIXBLOCK_PX(in)[in_line + NR_PIXBLOCK_BPP(in) * x_in + byte];
+                // update history
+                for(int i=scr_len_x; i>0; i--) copy_n(history[i-1], PC, history[i]);
+                copy_n(NR_PIXBLOCK_PX(in) + in_line + disp, PC, history[0]);
-                    // is it the same as last one we saw?
-                    if(in_byte != last_in) different_count++;
-                    last_in = in_byte;
+                // for all bytes of the pixel
+                for ( int byte = 0 ; byte < PC ; byte++) {
-                    // sum pixels weighted by the kernel; multiply by stepx because we're skipping stepx pixels
-                    sum += stepx * in_byte * kernel[scr_len_x + dist];
-                }
+                    if(skipbuf[byte] > x) continue;
-                // store the result in bufx
-                _check_index(bufx, bufx_line + NR_PIXBLOCK_BPP(bufx) * (x - xd0) + byte, __LINE__);
-                NR_PIXBLOCK_PX(bufx)[bufx_line + NR_PIXBLOCK_BPP(bufx) * (x - xd0) + byte] = (unsigned char)sum;
+                    double sum = 0;
+                    int last_in = -1;
+                    int different_count = 0;
-                // optimization: if there was no variation within this point's neighborhood, 
-                // skip ahead while we keep seeing the same last_in byte: 
-                // blurring flat color would not change it anyway
-                if (different_count <= 1) {
-                    int pos = x + 1;
-                    while(((pos << stepx_l2) + scr_len_x) < in->area.x1 &&
-                          NR_PIXBLOCK_PX(in)[in_line + NR_PIXBLOCK_BPP(in) * ((pos << stepx_l2) + scr_len_x - in->area.x0) + byte] == last_in)
-                    {
-                        _check_index(in, in_line + NR_PIXBLOCK_BPP(in) * ((pos << stepx_l2) + scr_len_x - in->area.x0) + byte, __LINE__);
-                        _check_index(bufx, bufx_line + NR_PIXBLOCK_BPP(bufx) * (pos - xd0) + byte, __LINE__);
-                        NR_PIXBLOCK_PX(bufx)[bufx_line + NR_PIXBLOCK_BPP(bufx) * (pos - xd0) + byte] = last_in;
-                        pos++;
+                    // go over our point's neighbours in the history
+                    for ( int i = 0 ; i <= scr_len_x ; i++ ) {
+                        // value at the pixel
+                        unsigned char in_byte = history[i][byte];
+
+                        // is it the same as last one we saw?
+                        if(in_byte != last_in) different_count++;
+                        last_in = in_byte;
+
+                        // sum pixels weighted by the kernel
+                        sum += in_byte * kernel[i];
                     }
-                    skipbuf[byte] = pos;
-                }
-            }
-        }
-    }
+                    // go over our point's neighborhood on x axis in the in buffer
+                    for ( int i = 1 ; i <= scr_len_x ; i++ ) {
-    /* 2. Blur in direction of Y-axis, from bufx to bufy (they have the same resolution) */
-    _make_kernel(kernel, _deviation_y, trans.expansionY());
+                        // the pixel we're looking at
+                        int x_in = x + i;
+                        if (x_in >= width) {
+                            x_in = width - 1;
+                        } else if (x_in < 0) {
+                            x_in = 0;
+                        }
-    for ( int x = bufy->area.x0 ; x < bufy->area.x1; x++ ) {
+                        // value at the pixel
+                        _check_index(in, in_line + PC * x_in + byte, __LINE__);
+                        unsigned char in_byte = NR_PIXBLOCK_PX(in)[in_line + PC * x_in + byte];
-        int bufy_disp = NR_PIXBLOCK_BPP(bufy) * (x - xd0);
-        int bufx_disp = NR_PIXBLOCK_BPP(bufx) * (x - xd0);
+                        // is it the same as last one we saw?
+                        if(in_byte != last_in) different_count++;
+                        last_in = in_byte;
-        int skipbuf[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};
+                        // sum pixels weighted by the kernel
+                        sum += in_byte * kernel[i];
+                    }
-        for ( int y = bufy->area.y0; y < bufy->area.y1; y++ ) {
+                    // store the result in bufx
+                    _check_index(out, bufx_line + disp + byte, __LINE__);
+                    NR_PIXBLOCK_PX(out)[bufx_line + disp + byte] = (unsigned char)sum;
-            int bufy_line = (y - yd0) * bufy->rs;
+                    // optimization: if there was no variation within this point's neighborhood, 
+                    // skip ahead while we keep seeing the same last_in byte: 
+                    // blurring flat color would not change it anyway
+                    if (different_count <= 1) {
+                        int pos = x + 1;
+                        while(pos + scr_len_x < width &&
+                              NR_PIXBLOCK_PX(in)[in_line + PC * (pos + scr_len_x) + byte] == last_in)
+                        {
+                            _check_index(in, in_line + PC * (pos + scr_len_x) + byte, __LINE__);
+                            _check_index(out, bufx_line + PC * pos + byte, __LINE__);
+                            NR_PIXBLOCK_PX(out)[bufx_line + PC * pos + byte] = last_in;
+                            pos++;
+                        }
+                        skipbuf[byte] = pos;
+                    }
+                }
+            }
+        }
-            for ( int byte = 0 ; byte < NR_PIXBLOCK_BPP(bufx) ; byte++) {
-                if (skipbuf[byte] > y) continue;
+        /* 2. Blur in direction of Y-axis, from bufx to bufy (they have the same resolution) */
+        _make_kernel(kernel, _deviation_y, trans.expansionY());
-                double sum = 0;
-                int last_in = -1;
-                int different_count = 0;
+        for ( int x = 0 ; x < width; x++ ) {
-                for ( int i = -scr_len_y ; i <= scr_len_y ; i += stepy ) {
+            int const disp = PC * x;
-                    int y_in = ((((y << stepy_l2) + i + stepy2) >> stepy_l2) - yd0);
+            int skipbuf[4] = {INT_MIN, INT_MIN, INT_MIN, INT_MIN};
-                    int dist = ((y_in + yd0) << stepy_l2) - (y << stepy_l2);
-                    if (dist < -scr_len_y) 
-                        dist = -scr_len_y;
-                    if (dist > scr_len_y) 
-                        dist = scr_len_y;
+            PT imin[PC]; copy_n(NR_PIXBLOCK_PX(out) + disp, PC, imin);
+            for(int i=0; i<scr_len_y; i++) copy_n(imin, PC, history[i]);
-                    if (y_in >= (yd1 - yd0)) y_in = (yd1 - yd0) - 1;
-                    if (y_in < 0) y_in = 0;
+            for ( int y = 0; y < height; y++ ) {
-                    _check_index(bufx, y_in * bufx->rs + NR_PIXBLOCK_BPP(bufx) * (x - xd0) + byte, __LINE__);
-                    unsigned char in_byte = NR_PIXBLOCK_PX(bufx)[y_in * bufx->rs + NR_PIXBLOCK_BPP(bufx) * (x - xd0) + byte];
-                    if(in_byte != last_in) different_count++;
-                    last_in = in_byte;
-                    sum += stepy * in_byte * kernel[scr_len_y + dist];
-                }
+                int bufy_line = y * out->rs;
-                _check_index(bufy, bufy_line + bufy_disp + byte, __LINE__);
-                NR_PIXBLOCK_PX(bufy)[bufy_line + bufy_disp + byte] = (unsigned char)sum;
+                for(int i=scr_len_y; i>0; i--) copy_n(history[i-1], PC, history[i]);
+                copy_n(NR_PIXBLOCK_PX(out) + bufy_line + disp, PC, history[0]);
-                if (different_count <= 1) {
-                    int pos = y + 1;
-                    while((pos + (scr_len_y >> stepy_l2) + 1) < yd1 &&
-                          NR_PIXBLOCK_PX(bufx)[(pos + (scr_len_y >> stepy_l2) + 1 - yd0) * bufx->rs + bufx_disp + byte] == last_in)
-                    {
-                        _check_index(bufx, (pos + (scr_len_y >> stepy_l2) + 1 - yd0) * bufx->rs + bufx_disp + byte, __LINE__);
-                        _check_index(bufy, (pos - yd0) * bufy->rs + bufy_disp + byte, __LINE__);
-                        NR_PIXBLOCK_PX(bufy)[(pos - yd0) * bufy->rs + bufy_disp + byte] = last_in;
-                        pos++;
-                    }
-                    skipbuf[byte] = pos;
-                }
+                for ( int byte = 0 ; byte < PC ; byte++) {
-            }
-        }
-    }
+                    if (skipbuf[byte] > y) continue;
-    // we don't need bufx anymore
-    nr_pixblock_release(bufx);
-    delete bufx;
+                    double sum = 0;
+                    int last_in = -1;
+                    int different_count = 0;
-    // interpolation will need to divide by stepx * stepy
-    int divisor = stepx_l2 + stepy_l2;
+                    for ( int i = 0 ; i <= scr_len_y ; i++ ) {
+                        // value at the pixel
+                        unsigned char in_byte = history[i][byte];
-    // new buffer for the final output, same resolution as the in buffer
-    NRPixBlock *out = new NRPixBlock;
-    nr_pixblock_setup_fast(out, in->mode, in->area.x0, in->area.y0,
-                           in->area.x1, in->area.y1, true);
-    if (out->data.px == NULL) {
-        // alas, we've accomplished a lot, but ran out of memory - so abort
-        return 0;
-    }
+                        // is it the same as last one we saw?
+                        if(in_byte != last_in) different_count++;
+                        last_in = in_byte;
-    for ( int y = yd0 ; y < yd1 - 1; y++ ) {
-        for ( int x = xd0 ; x < xd1 - 1; x++ ) {
-            for ( int byte = 0 ; byte < NR_PIXBLOCK_BPP(bufy) ; byte++) {
+                        // sum pixels weighted by the kernel
+                        sum += in_byte * kernel[i];
+                    }
-                // get 4 values at the corners of the pixel from bufy
-                _check_index(bufy, ((y - yd0) * bufy->rs) + NR_PIXBLOCK_BPP(bufy) + (x - xd0) + byte, __LINE__);
-                unsigned char a00 = NR_PIXBLOCK_PX(bufy)[((y - yd0) * bufy->rs) + NR_PIXBLOCK_BPP(bufy) * (x - xd0) + byte];
-                if (stepx == 1 && stepy == 1) { // if there was no subsampling, just use a00
-                    _check_index(out, ((y - yd0) * out->rs) + NR_PIXBLOCK_BPP(out) * (x - xd0) + byte, __LINE__);
-                    NR_PIXBLOCK_PX(out)[((y - yd0) * out->rs) + NR_PIXBLOCK_BPP(out) * (x - xd0) + byte] = a00;
-                    continue;
-                }
-                _check_index(bufy, ((y - yd0) * bufy->rs) + NR_PIXBLOCK_BPP(bufy) * (x + 1 - xd0) + byte, __LINE__);
-                unsigned char a10 = NR_PIXBLOCK_PX(bufy)[((y - yd0) * bufy->rs) + NR_PIXBLOCK_BPP(bufy) * (x + 1 - xd0) + byte];
-                _check_index(bufy, ((y + 1 - yd0) * bufy->rs) + NR_PIXBLOCK_BPP(bufy) * (x - xd0) + byte, __LINE__);
-                unsigned char a01 = NR_PIXBLOCK_PX(bufy)[((y + 1 - yd0) * bufy->rs) + NR_PIXBLOCK_BPP(bufy) * (x - xd0) + byte];
-                _check_index(bufy, ((y + 1 - yd0) * bufy->rs) + NR_PIXBLOCK_BPP(bufy) * (x + 1 - xd0) + byte, __LINE__);
-                unsigned char a11 = NR_PIXBLOCK_PX(bufy)[((y + 1 - yd0) * bufy->rs) + NR_PIXBLOCK_BPP(bufy) * (x + 1 - xd0) + byte];
+                    for ( int i = 1 ; i <= scr_len_y ; i++ ) {
-                // iterate over the rectangle to be interpolated
-                for ( int yi = 0 ; yi < stepy; yi++ ) {
-                    int iy = stepy - yi;
-                    int y_out = (y << stepy_l2) + yi;
-                    if ((y_out < out->area.y0) || (y_out >= out->area.y1))
-                        continue;
-                    int out_line  = (y_out - out->area.y0) * out->rs;
+                        int y_in = y + i;
+                        if (y_in >= height) {
+                            y_in = height - 1;
+                        } else if (y_in < 0) {
+                            y_in = 0;
+                        }
-                    for ( int xi = 0 ; xi < stepx; xi++ ) {
-                        int ix = stepx - xi;
-                        int x_out = (x << stepx_l2) + xi;
-                        if ((x_out < out->area.x0) || (x_out >= out->area.x1))
-                            continue;
+                        _check_index(out, y_in * out->rs + disp + byte, __LINE__);
+                        unsigned char in_byte = NR_PIXBLOCK_PX(out)[y_in * out->rs + disp + byte];
+                        if(in_byte != last_in) different_count++;
+                        last_in = in_byte;
+                        sum += in_byte * kernel[i];
+                    }
-                        // simple linear interpolation
-                        int a = (a00*ix*iy + a10*xi*iy + a01*ix*yi + a11*xi*yi) >> divisor;
+                    _check_index(out, bufy_line + disp + byte, __LINE__);
+                    NR_PIXBLOCK_PX(out)[bufy_line + disp + byte] = (unsigned char)sum;
-                        _check_index(out, out_line + NR_PIXBLOCK_BPP(out) * (x_out - out->area.x0) + byte, __LINE__);
-                        NR_PIXBLOCK_PX(out)[out_line + NR_PIXBLOCK_BPP(out) * (x_out - out->area.x0) + byte] = (unsigned char) a;
+                    if (different_count == 0) {
+                        int pos = y + 1;
+                        while((pos + scr_len_y) < height &&
+                            NR_PIXBLOCK_PX(out)[(pos + scr_len_y) * out->rs + disp + byte] == last_in)
+                        {
+                            _check_index(out, (pos + scr_len_y) * out->rs + disp + byte, __LINE__);
+                            _check_index(out, pos * out->rs + disp + byte, __LINE__);
+                            NR_PIXBLOCK_PX(out)[pos * out->rs + disp + byte] = last_in;
+                            pos++;
+                        }
+                        skipbuf[byte] = pos;
                     }
+
                 }
             }
         }
+
+        out->empty = FALSE;
+        slot.set(_output, out);
     }
-    nr_pixblock_release(bufy);
-    delete bufy;
-
-    out->empty = FALSE;
-    slot.set(_output, out);
-
     return 0;
 }
@@ -551,7 +613,7 @@
 {
     int area_x = _effect_area_scr_x(trans);
     int area_y = _effect_area_scr_y(trans);
-    return _max(area_x, area_y);
+    return std::max(area_x, area_y);
 }
void FilterGaussian::set_deviation(double deviation)
Index: src/display/nr-filter-gaussian.h
===================================================================
--- src/display/nr-filter-gaussian.h	(revision 13436)
+++ src/display/nr-filter-gaussian.h	(working copy)
@@ -7,6 +7,7 @@
  * Authors:
  *   Niko Kiirala <niko@...1267...>
  *   bulia byak
+ *   Jasper van de Gronde <th.v.d.gronde@...528...>
  *
  * Copyright (C) 2006 authors
  *
@@ -62,17 +63,6 @@
     void _make_kernel(double *kernel, double deviation, double expansion);
     int _effect_area_scr_x(Matrix const &trans);
     int _effect_area_scr_y(Matrix const &trans);
-    int _effect_subsample_step(int scr_len_x, int quality);
-    int _effect_subsample_step_log2(int scr_len_x, int quality);
-
-    inline int _min(int const a, int const b)
-    {
-        return ((a < b) ? a : b);
-    }
-    inline int _max(int const a, int const b)
-    {
-        return ((a > b) ? a : b);
-    }
 };

    

Jasper van de Gronde

MenTaLguY

Jasper van de Gronde

bulia byak

bulia byak

yann.papouin

tags (0)

participants (4)