performance problems and possible remedies?

older
YAML (libyaml-dev, libyaml-0-2)...

Yale Zhang

3 Mar 2013 3 Mar '13

10:26 a.m.

Hi. I'm using Inkscape to author a comic and the slow speed for certain things is very annoying. I'm already have OpenMP turned on and using 4 cores.

1. *large blurs are slow* - I'm an expert with writing SIMD code, so I was thinking about vectorizing the Gaussian IIR filter with SIMD intrinsics, even though it's harder than for a FIR. But I noticed there isn't any SIMD code in Inkscape so does that mean it's something to avoid. I'm pretty sure no current compiler is smart enough to vectorize it, and besides, Inkscape is compiled with -O2, meaning -ftree-vectorize isn't on by default.

2. *when there's a large image (raster based) background - scrolling in a zoomed region is very slow* I compiled the latest 0.49 code with GCC profiling and it shows this:

33.98 22.47 22.47 exp2l 21.29 36.55 14.08 log2l 17.57 48.17 11.62 pow 7.12 52.88 4.71 658 0.01 0.01 ink_cairo_surface_srgb_to_linear(_cairo_surface*) 6.72 57.32 4.44 563 0.01 0.01 ink_cairo_surface_linear_to_srgb(_cairo_surface*) 5.51 60.96 3.64 1216 0.00 0.00 Inkscape::Filters::FilterGaussian::~FilterGaussian() 5.23 64.42 3.46 internal_modf 0.59 64.81 0.39 _mcount_private 0.41 65.08 0.27 __fentry__ 0.12 65.16 0.08 GC_mark_from 0.09 65.22 0.06 5579 0.00 0.00 Geom::parse_svg_path(char const*, Geom::SVGPathSink&) 0.06 65.26 0.04 35320 0.00 0.00 bounds_exact_transformed(std::vector<Geom::Path, std::allocatorGeom::Path > const&, Geom::Affine const&) 0.06 65.30 0.04 8 0.01 0.01 convert_pixbuf_normal_to_argb32(_GdkPixbuf*) 0.05 65.33 0.03 885444 0.00 0.00 std::vector<Geom::Linear, std::allocatorGeom::Linear

...

::_M_fill_insert(__gnu_cxx::__normal_iterator<Geom::Linear*,

std::vector<Geom::Linear, std::allocatorGeom::Linear > >, unsigned long long, Geom::Linear const&)

The cost is absolutely dominated by ink_cairo_surface_srgb_to_linear() and ink_cairo_surface_linear_to_srgb(). My first instinct was to optimize those 2 functions, but then I thought why are those even being called every time I scroll through the image? Why not convert the images up front to linear and stay that way in memory?

If that can't be done, then my optimization approach is: 1. replace ink_cairo_surface_srgb_to_linear() with a simple 3rd degree polynomial approximation (0.902590573087882 - 0.010238759806148x + 0.002825455367280x^2 + 0.000004414767235x^3) and vectorize with SSE intrinsics. The approximation was calculated by minimizing the square error (maxError = 0.313) over the range [10, 255]. For x < 10, it uses simple scaling.

2. replace ink_surface_linear_to_srgb() with a vectorized implementation of pow(). Unlike srgb_to_linear(), a low degree polynomial can't be used due to the curve having larger high order derivatives. An alternative would be piece wise, low order polynomials.

The main question I have is what degree of accuracy is desired? Certainly, it doesn't need double precision pow() since the input is only 8 bits! Is +- 0.5 from the true value (before quantization) OK or do people depend on getting pixel perfect results?

Attachments:

attachment.htm (text/html — 3.9 KB)

Show replies by date

Tavmjong Bah

3 Mar 3 Mar

11:26 a.m.

On Sun, 2013-03-03 at 02:26 -0800, Yale Zhang wrote:

...

Hi. I'm using Inkscape to author a comic and the slow speed for certain things is very annoying. I'm already have OpenMP turned on and using 4 cores.

large blurs are slow - I'm an expert with writing SIMD code, so I

was thinking about vectorizing the Gaussian IIR filter with SIMD intrinsics, even though it's harder than for a FIR. But I noticed there isn't any SIMD code in Inkscape so does that mean it's something to avoid. I'm pretty sure no current compiler is smart enough to vectorize it, and besides, Inkscape is compiled with -O2, meaning -ftree-vectorize isn't on by default.

I can't comment on vectorizing the code.

...

when there's a large image (raster based) background - scrolling in

a zoomed region is very slow I compiled the latest 0.49 code with GCC profiling and it shows this:

33.98 22.47 22.47 exp2l 21.29 36.55 14.08 log2l 17.57 48.17 11.62 pow 7.12 52.88 4.71 658 0.01 0.01 ink_cairo_surface_srgb_to_linear(_cairo_surface*) 6.72 57.32 4.44 563 0.01 0.01 ink_cairo_surface_linear_to_srgb(_cairo_surface*) 5.51 60.96 3.64 1216 0.00 0.00 Inkscape::Filters::FilterGaussian::~FilterGaussian() 5.23 64.42 3.46 internal_modf 0.59 64.81 0.39 _mcount_private 0.41 65.08 0.27 __fentry__ 0.12 65.16 0.08 GC_mark_from 0.09 65.22 0.06 5579 0.00 0.00 Geom::parse_svg_path(char const*, Geom::SVGPathSink&) 0.06 65.26 0.04 35320 0.00 0.00 bounds_exact_transformed(std::vector<Geom::Path, std::allocatorGeom::Path > const&, Geom::Affine const&) 0.06 65.30 0.04 8 0.01 0.01 convert_pixbuf_normal_to_argb32(_GdkPixbuf*) 0.05 65.33 0.03 885444 0.00 0.00 std::vector<Geom::Linear, std::allocatorGeom::Linear

...
::_M_fill_insert(__gnu_cxx::__normal_iterator<Geom::Linear*,

std::vector<Geom::Linear, std::allocatorGeom::Linear > >, unsigned long long, Geom::Linear const&)

The cost is absolutely dominated by ink_cairo_surface_srgb_to_linear() and ink_cairo_surface_linear_to_srgb(). My first instinct was to optimize those 2 functions, but then I thought why are those even being called every time I scroll through the image? Why not convert the images up front to linear and stay that way in memory?

You should be able to avoid the conversion to and from linearRGB by setting the attribute color-interpolation-filters to sRGB. There is no UI for setting this other than using the XML editor. I thought, that by default, Inkscape created filters already set this to sRGB.

...

If that can't be done, then my optimization approach is:

replace ink_cairo_surface_srgb_to_linear() with a simple 3rd

degree polynomial approximation (0.902590573087882 - 0.010238759806148x + 0.002825455367280x^2 + 0.000004414767235x^3) and vectorize with SSE intrinsics. The approximation was calculated by minimizing the square error (maxError = 0.313) over the range [10, 255]. For x < 10, it uses simple scaling.

replace ink_surface_linear_to_srgb() with a vectorized

implementation of pow(). Unlike srgb_to_linear(), a low degree polynomial can't be used due to the curve having larger high order derivatives. An alternative would be piece wise, low order polynomials.

The main question I have is what degree of accuracy is desired? Certainly, it doesn't need double precision pow() since the input is only 8 bits! Is +- 0.5 from the true value (before quantization) OK or do people depend on getting pixel perfect results?

You could experiment, and see if any differences are visible. One has to be a bit careful as different filter primitives might have different sensitivities to the accuracy.

...

Tav

~suv

11:52 a.m.

On 2013-03-03 12:26 +0100, Tavmjong Bah wrote:

...

On Sun, 2013-03-03 at 02:26 -0800, Yale Zhang wrote:

...

...
The cost is absolutely dominated by ink_cairo_surface_srgb_to_linear() and ink_cairo_surface_linear_to_srgb(). My first instinct was to optimize those 2 functions, but then I thought why are those even being called every time I scroll through the image? Why not convert the images up front to linear and stay that way in memory?

You should be able to avoid the conversion to and from linearRGB by setting the attribute color-interpolation-filters to sRGB. There is no UI for setting this other than using the XML editor. I thought, that by default, Inkscape created filters already set this to sRGB.

Apparently this does not happen in recent trunk when adding the blur via 'Fill&Stroke', see also (otherwise not related) regression reported in

- Bug #1127103 “Colour change on blurred elements with a transform” https://bugs.launchpad.net/inkscape/+bug/1127103

(a workaround for that bug - not a fix though AFAIU - is to manually edit the file in an external editor and add style="color-interpolation-filters:sRGB" to the blur filter definition which had been created via Fill&Stroke)

Tavmjong Bah

2:39 p.m.

Just check-in a patch that should use OpenMP when converting between sRGB and linearRGB. On my laptop with 8 threads it results in about an eight fold increase in speed with a Gaussian blur of radius 100.

Of course, further speed increases are welcomed.

Tav

On Sun, 2013-03-03 at 02:26 -0800, Yale Zhang wrote:

...

Hi. I'm using Inkscape to author a comic and the slow speed for certain things is very annoying. I'm already have OpenMP turned on and using 4 cores.

large blurs are slow - I'm an expert with writing SIMD code, so I

was thinking about vectorizing the Gaussian IIR filter with SIMD intrinsics, even though it's harder than for a FIR. But I noticed there isn't any SIMD code in Inkscape so does that mean it's something to avoid. I'm pretty sure no current compiler is smart enough to vectorize it, and besides, Inkscape is compiled with -O2, meaning -ftree-vectorize isn't on by default.

when there's a large image (raster based) background - scrolling in

a zoomed region is very slow I compiled the latest 0.49 code with GCC profiling and it shows this:

33.98 22.47 22.47 exp2l 21.29 36.55 14.08 log2l 17.57 48.17 11.62 pow 7.12 52.88 4.71 658 0.01 0.01 ink_cairo_surface_srgb_to_linear(_cairo_surface*) 6.72 57.32 4.44 563 0.01 0.01 ink_cairo_surface_linear_to_srgb(_cairo_surface*) 5.51 60.96 3.64 1216 0.00 0.00 Inkscape::Filters::FilterGaussian::~FilterGaussian() 5.23 64.42 3.46 internal_modf 0.59 64.81 0.39 _mcount_private 0.41 65.08 0.27 __fentry__ 0.12 65.16 0.08 GC_mark_from 0.09 65.22 0.06 5579 0.00 0.00 Geom::parse_svg_path(char const*, Geom::SVGPathSink&) 0.06 65.26 0.04 35320 0.00 0.00 bounds_exact_transformed(std::vector<Geom::Path, std::allocatorGeom::Path > const&, Geom::Affine const&) 0.06 65.30 0.04 8 0.01 0.01 convert_pixbuf_normal_to_argb32(_GdkPixbuf*) 0.05 65.33 0.03 885444 0.00 0.00 std::vector<Geom::Linear, std::allocatorGeom::Linear

...
::_M_fill_insert(__gnu_cxx::__normal_iterator<Geom::Linear*,

std::vector<Geom::Linear, std::allocatorGeom::Linear > >, unsigned long long, Geom::Linear const&)

The cost is absolutely dominated by ink_cairo_surface_srgb_to_linear() and ink_cairo_surface_linear_to_srgb(). My first instinct was to optimize those 2 functions, but then I thought why are those even being called every time I scroll through the image? Why not convert the images up front to linear and stay that way in memory?

If that can't be done, then my optimization approach is:

replace ink_cairo_surface_srgb_to_linear() with a simple 3rd

degree polynomial approximation (0.902590573087882 - 0.010238759806148x + 0.002825455367280x^2 + 0.000004414767235x^3) and vectorize with SSE intrinsics. The approximation was calculated by minimizing the square error (maxError = 0.313) over the range [10, 255]. For x < 10, it uses simple scaling.

replace ink_surface_linear_to_srgb() with a vectorized

implementation of pow(). Unlike srgb_to_linear(), a low degree polynomial can't be used due to the curve having larger high order derivatives. An alternative would be piece wise, low order polynomials.

The main question I have is what degree of accuracy is desired? Certainly, it doesn't need double precision pow() since the input is only 8 bits! Is +- 0.5 from the true value (before quantization) OK or do people depend on getting pixel perfect results?

UJ

Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Yale Zhang

4 Mar 4 Mar

5:44 a.m.

Tavmjong, Good, that was an effortless speedup, but I'll still go ahead with the vectorized implementation.

Before I set the filters back to SRGB, I'd like to know which is more intuitive to the user (e.g. linear response)? According to your explanation,

http://tavmjong.free.fr/blog/?p=765

you say for color interpolation, clearly SRGB is better. But I couldn't tell for filters which is better?

On Sun, Mar 3, 2013 at 6:39 AM, Tavmjong Bah <tavmjong@...8...> wrote:

...

Just check-in a patch that should use OpenMP when converting between sRGB and linearRGB. On my laptop with 8 threads it results in about an eight fold increase in speed with a Gaussian blur of radius 100.

Of course, further speed increases are welcomed.

Tav

On Sun, 2013-03-03 at 02:26 -0800, Yale Zhang wrote:

...
Hi. I'm using Inkscape to author a comic and the slow speed for certain things is very annoying. I'm already have OpenMP turned on and using 4 cores.

large blurs are slow - I'm an expert with writing SIMD code, so I

was thinking about vectorizing the Gaussian IIR filter with SIMD intrinsics, even though it's harder than for a FIR. But I noticed there isn't any SIMD code in Inkscape so does that mean it's something to avoid. I'm pretty sure no current compiler is smart enough to vectorize it, and besides, Inkscape is compiled with -O2, meaning -ftree-vectorize isn't on by default.

when there's a large image (raster based) background - scrolling in

a zoomed region is very slow I compiled the latest 0.49 code with GCC profiling and it shows this:

33.98 22.47 22.47 exp2l 21.29 36.55 14.08 log2l 17.57 48.17 11.62 pow 7.12 52.88 4.71 658 0.01 0.01 ink_cairo_surface_srgb_to_linear(_cairo_surface*) 6.72 57.32 4.44 563 0.01 0.01 ink_cairo_surface_linear_to_srgb(_cairo_surface*) 5.51 60.96 3.64 1216 0.00 0.00 Inkscape::Filters::FilterGaussian::~FilterGaussian() 5.23 64.42 3.46 internal_modf 0.59 64.81 0.39 _mcount_private 0.41 65.08 0.27 __fentry__ 0.12 65.16 0.08 GC_mark_from 0.09 65.22 0.06 5579 0.00 0.00 Geom::parse_svg_path(char const*, Geom::SVGPathSink&) 0.06 65.26 0.04 35320 0.00 0.00 bounds_exact_transformed(std::vector<Geom::Path, std::allocatorGeom::Path > const&, Geom::Affine const&) 0.06 65.30 0.04 8 0.01 0.01 convert_pixbuf_normal_to_argb32(_GdkPixbuf*) 0.05 65.33 0.03 885444 0.00 0.00 std::vector<Geom::Linear, std::allocatorGeom::Linear

...
::_M_fill_insert(__gnu_cxx::__normal_iterator<Geom::Linear*,

std::vector<Geom::Linear, std::allocatorGeom::Linear > >, unsigned long long, Geom::Linear const&)

The cost is absolutely dominated by ink_cairo_surface_srgb_to_linear() and ink_cairo_surface_linear_to_srgb(). My first instinct was to optimize those 2 functions, but then I thought why are those even being called every time I scroll through the image? Why not convert the images up front to linear and stay that way in memory?

If that can't be done, then my optimization approach is:

replace ink_cairo_surface_srgb_to_linear() with a simple 3rd

degree polynomial approximation (0.902590573087882 - 0.010238759806148x + 0.002825455367280x^2 + 0.000004414767235x^3) and vectorize with SSE intrinsics. The approximation was calculated by minimizing the square error (maxError = 0.313) over the range [10, 255]. For x < 10, it uses simple scaling.

replace ink_surface_linear_to_srgb() with a vectorized

implementation of pow(). Unlike srgb_to_linear(), a low degree polynomial can't be used due to the curve having larger high order derivatives. An alternative would be piece wise, low order polynomials.

The main question I have is what degree of accuracy is desired? Certainly, it doesn't need double precision pow() since the input is only 8 bits! Is +- 0.5 from the true value (before quantization) OK or do people depend on getting pixel perfect results?

UJ

...
Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Johan Engelen

6 Mar 6 Mar

7:23 p.m.

On 4-3-2013 6:44, Yale Zhang wrote:

...

... but I'll still go ahead with the vectorized implementation.

Yes please do! I'm very interested in how to include SIMD code and how to go about vectorizing code; great if we have an example in our codebase.

Cheers, Johan

Yale Zhang

19 Sep 19 Sep

12:25 p.m.

I'm proud to announce that I have completed vectorizing both the IIR and FIR filters. Speed ups are about 5x for IIR and ~20x for FIR (see spreadsheet). The code is a monstrosity, but given all the different cases ( {FIR, IIR} x {int16, float, double} x {RGBA, grayscale} as Jasper pointed out, it's expected. Earlier, I had only worked on the IIR, RGBA case, so it wasn't complete. It's unfortunate I let it languish for 3 years, but now hopefully, everyone can get a smoother experience.

Please test it out and send some feedback and a path for checking in.

A unit test & benchmark function is included. The accuracy is always +- 1 intensity level, which should be quite safe.

To build it, please add this to your cmake command: -DCMAKE_CXX_FLAGS="-DWIN32 -mavx2 -mfma -fpermissive -flax-vector-conversions"

You'll also need to turn on -std=c++14 if compiling the unit test & benchmark

Todo list:

*support processors with SSE2 only? current version only works on an AVX2 processor *dynamic dispatch for different CPU types - I guess there should be 3 versions of the functions (SSE2, AVX1, AVX2) *better ways to deal with SIMD remainders? The current code never writes past the end of a row, but liberally uses vector loads that go past the end. This will render memory error checking tools like AddressSanitizer useless with false positives. Ideally, I'd like all the Cairo images to be padded to either 16 (SSE2) or 32 bytes (AVX), but that might not be practical. *document all the arcane optimization methods (cache blocking, do-while loops, Duff's device) for posterity. Preferably with SVG animations

-Yale

On Wed, Mar 6, 2013 at 11:23 AM, Johan Engelen <jbc.engelen@...2592...> wrote:

...

On 4-3-2013 6:44, Yale Zhang wrote:

...
... but I'll still go ahead with the vectorized implementation.

Yes please do! I'm very interested in how to include SIMD code and how to go about vectorizing code; great if we have an example in our codebase.

Cheers, Johan

Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Yale Zhang

1 p.m.

My spreadsheet (benchmark.xlsx) didn't get accepted. Here it is as base64

UEsDBBQABgAIAAAAIQCeS3eMkQEAAIQGAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAAC AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADE VclOwzAQvSPxD5GvKHHpASHUtAeWI1QqfICxp41Vx7Y8U2j/nklaKoRKq6iRuGSz5y2TzMtosq5d 9gEJbfCluC4GIgOvg7F+UYq316f8VmRIyhvlgodSbADFZHx5MXrdRMCMqz2WoiKKd1KirqBWWIQI nlfmIdWK+DYtZFR6qRYgh4PBjdTBE3jKqcEQ49EDzNXKUfa45sdbJe/Wi+x+u6+hKoWK0VmtiIXK D29+keRhPrcaTNCrmqELjAmUwQqAalfEZJkxzYCIjaGQBzkTOOxGunNVcGUrDCsb8Yqt/8HQrPzt alf3wq8jWQPZVCV6VjV7l2snP0NavoewLI6DdG1N26KiVtZ/6z7C325G2Z6uexbS+GuBO+oY/pMO 4m8dZHs8vxUtzAnjSBsH2LPbLegp5kolMDPiKVr0LuAn9gkdJqnPRoLcXZzf9x3QCV6tnL6veER6 bv4e9xg/R9o0hYicmgm6C/iOqKY6jwwEiSzsQ+rQsO8ZOXK7E/4KYmgy3YA5wC3bf8j4CwAA//8D AFBLAwQUAAYACAAAACEAtVUwI/UAAABMAgAACwAIAl9yZWxzLy5yZWxzIKIEAiigAAIAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIySz07DMAzG 70i8Q+T76m5ICKGlu0xIuyFUHsAk7h+1jaMkQPf2hAOCSmPb0fbnzz9b3u7maVQfHGIvTsO6KEGx M2J712p4rZ9WD6BiImdpFMcajhxhV93ebF94pJSbYtf7qLKLixq6lPwjYjQdTxQL8exypZEwUcph aNGTGahl3JTlPYa/HlAtPNXBaggHeweqPvo8+bK3NE1veC/mfWKXToxAnhM7y3blQ2YLqc/bqJpC y0mDFfOc0xHJ+yJjA54m2lxP9P+2OHEiS4nQSODzPN+Kc0Dr64Eun2ip+L3OPOKnhOFNZPhhwcUP VF8AAAD//wMAUEsDBBQABgAIAAAAIQD+aepXCgEAAMwDAAAaAAgBeGwvX3JlbHMvd29ya2Jvb2su eG1sLnJlbHMgogQBKKAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC8k09LxDAQxe+C3yHM 3aatuohsugdF2KuuHyCk06Zsm5TM+Kff3lCxdWGpl+Jx3pD3fjwy291n14p3DNR4pyBLUhDojC8b Vyt4PTxd3YEg1q7UrXeoYECCXXF5sX3GVnN8RLbpSUQXRwosc38vJRmLnabE9+jipvKh0xzHUMte m6OuUeZpupHhtwcUJ55iXyoI+/IaxGHoY/Lf3r6qGoOP3rx16PhMhOTIhdFQhxpZwTh+i1kSQUGe Z8jXZPjw4UgWkWeOSSI5bvIlmOyfYRab2awJY3RrHqxu3NzMJC01crsmBFkdsHzhEC+AZpATeQnm ZlUYHtp4cNOHpXH+iZcnN1h8AQAA//8DAFBLAwQUAAYACAAAACEAlCzE2l8BAABRAgAADwAAAHhs L3dvcmtib29rLnhtbIySTU/DMAyG70j8hyh3ljbrPpjWTkKA2AVxGNs5NO4aLU2qJKXbv8dtVRgS B06Jnfix3zdZb86VJp/gvLImpfEkogRMbqUyx5S+757vlpT4IIwU2hpI6QU83WS3N+vWutOHtSeC AONTWoZQrxjzeQmV8BNbg8GTwrpKBAzdkfnagZC+BAiVZjyK5qwSytCBsHL/YdiiUDk82rypwIQB 4kCLgOP7UtWeZutCadgPioio61dR4dxnTYkWPjxJFUCmdIahbeFXwjX1Q6M0nt5Poyll2bfIN0ck FKLRYYfyRjr6xRPO593Nzoq9gtb/FHUhOR+UkbZNKRp7GffLhJK2zx+UDGVKk5hzvDDkXkAdy4Dw 6WLRs9kVvLcPm/QrMb02j4+lgYSy8xckPljn8RZ1xJS4lcKN28q4G/O6rEI56o8iflXEuyI2Ns2F ztGKbunpyWzOeywbf0P2BQAA//8DAFBLAwQUAAYACAAAACEAPSrV2CwDAADgCQAADQAAAHhsL3N0 eWxlcy54bWy8Vltv2jAUfp+0/xD5nSZQYIBIqpGSqVJXVWon7dVJnGDVl8h2Oti0/75jJ0C2Qse6 dS/gHJ/z+TtXe36x5sx7JEpTKULUPwuQR0QmcyrKEH26T3oT5GmDRY6ZFCREG6LRRfT2zVybDSN3 K0KMBxBCh2hlTDXzfZ2tCMf6TFZEwE4hFccGPlXp60oRnGtrxJk/CIKxzzEVqEGY8ewUEI7VQ131 MskrbGhKGTUbh4U8ns2uSiEVThlQXfeHONtiu48n8JxmSmpZmDOA82VR0Iw8ZTn1pz4gRfNCCqO9 TNbChGgI0PaE2YOQX0RityCArVY011+9R8xA0kd+NM8kk8ozEBkg5iQCc9JoxJjRVFGrVmBO2aYR D6zABbPV4xRcs0Lf8mjYHD5HlWmIkgTi2w8Ca/EfDpvG49G/PSy1/r96FF0wNUSTMtbJbSOI5lBj hiiRwK7Xru83FSRRQDs0yYCt32qXCm/6g9HpBloymttiKmNXOm1K4/EyiZcOpsPMlsQpLI6AJsly MY2PgjpsiFAqVQ6DYlv/ttQbUTRnpDCQLUXLlf03soLfVBoDXRXNc4pLKTCDpb+1aBcAmxHG7uww +VzssM8Be114ouYJN1d5iGAs2aLfLsHZdtngNR+Af8yoD/at0QB5XSMPVxXb3NQ8JSpxs8qd5qQL 5/H++z2jpeDEdj8Qciq3ShqSGTc9XasdYwDHbhmcvy4DvxvRJr6d0I5fFFpvXRyOMYy3EzKzMz8U bTcOj4XNJv5wtl+O+UwxNG6CwoGqeJbnM+ltMEHhTzGhTk7z3V4+7bXikg/p7vTVT121qw3P3gkh +iBlDheZyy64ndaUGSpsrgdjOxF+1b8htVGYbU1sM3VMJgdNbFftLCChHQvXMvuCBdr5ej8G3K6x t7kbEDtHACMnBa6Zud9thmi//khyWnPg1mrd0kdpHESI9utrO636zkuyNtcarlT492pFQ/RtuXg3 vVwmg94kWEx6w3My6k1Hi8veaBgvLi+TaTAI4u+dx8VfPC3cEwhGV3840wweIKp1tiV/t5eFqPPR 0HdTG2jDaN064evd0yz6AQAA//8DAFBLAwQUAAYACAAAACEAOTG1kdsAAADQAQAAIwAAAHhsL3dv cmtzaGVldHMvX3JlbHMvc2hlZXQxLnhtbC5yZWxzrJHNasMwDIDvg76D0b120sMYo04vY9Dr2j2A ZyuJWSIbS1vXt593KCylsMtu+kGfPqHt7mue1CcWjokstLoBheRTiDRYeD0+rx9AsTgKbkqEFs7I sOtWd9sXnJzUIR5jZlUpxBZGkfxoDPsRZ8c6ZaTa6VOZndS0DCY7/+4GNJumuTflNwO6BVPtg4Wy DxtQx3Oum/9mp76PHp+S/5iR5MYKE4o71csq0pUBxYLWlxpfglZXZTC3bdr/tMklkmA5oEiV4oXV Vc9c5a1+i/QjaRZ/6L4BAAD//wMAUEsDBBQABgAIAAAAIQCvU3GXKAcAACceAAAYAAAAeGwvd29y a3NoZWV0cy9zaGVldDIueG1slJldj9o4FIbvV9r/gLhvEttxPkYzVC0M0IuV0Go/rjMQBjRA2CTT af/9vo4DcY4TSFtpAL/Hx/bj42M7efz843gYfU/zYp+dnsbM8caj9LTONvvT69P477/mn6LxqCiT 0yY5ZKf0afwzLcafJ7//9viR5W/FLk3LETyciqfxrizPD65brHfpMSmc7JyeoGyz/JiU+Jm/usU5 T5NNVel4cLnnBe4x2Z/G2sNDPsRHtt3u1+ksW78f01OpneTpISnR/2K3PxcXb8f1EHfHJH97P39a Z8czXLzsD/vyZ+V0PDquH769nrI8eTlg3D+Yn6wvvqsflvvjfp1nRbYtHbhzdUftMcdu7MLT5HGz xwgU9lGebp/GX9jDirOxO3msAP2zTz8K4/tI8X7JsjclfNs8jT24KNJDulYjHxX/XZwsKx/u1cnk sfl+cTivJmWVjzbpNnk/lH9mH8t0/7orEQG+42OYarQPm5+ztFgDM1pzhOraOjvAB/6OjnsVLqCU /Kg+P/abcodvzGFeLEI5Hq3fizI7/luX17V1PV7Xw+elnudEURCy6E5N9K1qEZ+XmtwRgnuC8dtt BnVNfF5qSicMQ+Hf621Y18TnpSa7WdPVmKopmCVlMnnMs48Rwhu8inOiFgt7YOhIJ2bwVbZflPHT GIMC/gKT/X3C2KP7HfO5rk2+dph4bZOpbRK1LWa2Rdy2eLYtwrbF3LZgvG2y6DARbRNErjVi/2ri guEVJAJnOEgYXxESPF9NjdCdmhoZzczUyDCeeTVZniOvfa8mdK7LmSTtLOALE61VISTp4lLLAk63 kwV35+jKVkUD5w6LpC88PxAyDJgkk7LSFXU6MOGJX4EH4154pkYGNTU1Cs/UKDw1TkQ6QVeX8qBd voCnKzoeRBSdlit01Sh2SZ5uxjrfLsXDEvl2VOyrXLoQ7hx90WSZw4M48tn1b7vZlfZrk0VWGh6W MO4la2qExdTUKFlTo2T9iiypMdelUeCQRb+AqwYtFxStli20miZ2CsSndIKAYVK8mHsyjgSZvJV2 YVNEyhtO0cyPpI9fTY1SNDXCZGZqlKKsKDZpSS9tXSo5XfMLuLpSZMKiqOU7FEXrH+nsSruwKWJ7 GU4Rxr2xaGqUoqmRjs1MjfB6DiqKdorU5UyS1LmArytGwQMy0Ust38SIk63gXEZe4EsWME52wZV2 YWPEpj8cI4x7MZoaxWhqFKOpUYxhZ7KsS61kCU9XiDy09hkt34EYcSni4PKXrmjtwoao7hCDzz0w 7oVoahSiqVGIpkYhRp15UZdG0iGuFnDVUOTWlqPlmxQDJww8GQgReTxCgqShqF3YFONfoQjjXoqm RimaGhn6zNQoxbgzL+pS5EXSzAKurhQZju/tDXap5ZsUJY79Moq5ZL6IEZQkUa+0C5siww13eDAq 616OLZGMcNoSST6btUTS82eIav8kdeZ1MfPJtCyUswYms2DW+i2aIW5foeAi9gX27DgkTaxqFx00 1Qlq8NJmsO6naYoWTVMkZGYttzQwIXbS1MUdNNWILsdxxjwamqox6Ldphn7gh1ek9GBeu+igqY7s w2nCup+mKZI8PWWmSOZ61hKt2Oy73KCWwuzjTt1ey4takfR+slQN3SHJ1B0nYhFSZYhlHvFmdlv3 QaZO5MPBwbofnCla4EzRAmeKFjgVMPbFBh1XxSKyzt+1IkPiaVmN9XYA4iFOKDwvDJEMIs4D3vho Y1NH8OHYYN2PzRQtbKZoYTPFpqPVwfqZdd9aLsWhdeCuFWkMufK0hHA32mLHY4i26//mTtTGps7c w7HBuh+bKVrYTNHCZooWtu5rCtPFLHCiuPWPLlltJzhZy0s4GABR+DFj3Mc5XUjp9S1ZdeIeDhHW /RBN0YJoihZEU2w6Wsde3y0Fj+7qXEeObotakXQDXaqnffdznR/FWK9xgG2YC9aMpB196pR9AYez 4u2HiDDu52aKTWvV6KfMFC1upmhx676WwGGV6mLrMFgrUjbrrV6zaOYeNd8BsCCULMaylZFsxtGm pk7Vg6nBuJ+aKTat1dRM0aJmiha17nsIq4uFY8WaVnCbba/eJarco8aZgwczQYi9ASEXyqA5b7Wp qVP0YGow7qdmiqTHU2aKFjVTtKh13zvgsNptfXt/0IoQzYjrWEMzd2INT1wjXJyjgHvAFkjZdLVF jZtXDEzGzRWqjHuptURKrSXSM3FLpNsDRIWH1JnXxX7oEMyLWuEeXaGqGVBTu/S2Gkb9iNV4DogX SngCCHCe53MW+w34NjR1rh4aaurBbT80U7SgmSIBMGu5JQyeIXZC08W+/QS1rsAZvUKoZu5DC/FM H7g8HkfSaxa/hqbf6emXS+fkNf0jyV/3p2J0SLfg4jlInLl+oVd9L7NzVYrt+yUr8Vru8muHF7Ip 3jap93ujbZaVlx94hOteX/FO/gcAAP//AwBQSwMEFAAGAAgAAAAhAPtipW2UBgAApxsAABMAAAB4 bC90aGVtZS90aGVtZTEueG1s7FlPb9s2FL8P2HcgdG9tJ7YbB3WK2LGbrU0bxG6HHmmZllhTokDS SX0b2uOAAcO6YZcBu+0wbCvQArt0nyZbh60D+hX2SEqyGMtL0gYb1tWHRCJ/fP/f4yN19dqDiKFD IiTlcdurXa56iMQ+H9M4aHt3hv1LGx6SCsdjzHhM2t6cSO/a1vvvXcWbKiQRQbA+lpu47YVKJZuV ivRhGMvLPCExzE24iLCCVxFUxgIfAd2IVdaq1WYlwjT2UIwjIHt7MqE+QUNN0tvKiPcYvMZK6gGf iYEmTZwVBjue1jRCzmWXCXSIWdsDPmN+NCQPlIcYlgom2l7V/LzK1tUK3kwXMbVibWFd3/zSdemC 8XTN8BTBKGda69dbV3Zy+gbA1DKu1+t1e7WcngFg3wdNrSxFmvX+Rq2T0SyA7OMy7W61Ua27+AL9 9SWZW51Op9FKZbFEDcg+1pfwG9VmfXvNwRuQxTeW8PXOdrfbdPAGZPHNJXz/SqtZd/EGFDIaT5fQ 2qH9fko9h0w42y2FbwB8o5rCFyiIhjy6NIsJj9WqWIvwfS76ANBAhhWNkZonZIJ9iOIujkaCYs0A bxJcmLFDvlwa0ryQ9AVNVNv7MMGQEQt6r55//+r5U/Tq+ZPjh8+OH/50/OjR8cMfLS1n4S6Og+LC l99+9ufXH6M/nn7z8vEX5XhZxP/6wye//Px5ORAyaCHRiy+f/PbsyYuvPv39u8cl8G2BR0X4kEZE olvkCB3wCHQzhnElJyNxvhXDEFNnBQ6Bdgnpngod4K05ZmW4DnGNd1dA8SgDXp/dd2QdhGKmaAnn G2HkAPc4Zx0uSg1wQ/MqWHg4i4Ny5mJWxB1gfFjGu4tjx7W9WQJVMwtKx/bdkDhi7jMcKxyQmCik 5/iUkBLt7lHq2HWP+oJLPlHoHkUdTEtNMqQjJ5AWi3ZpBH6Zl+kMrnZss3cXdTgr03qHHLpISAjM SoQfEuaY8TqeKRyVkRziiBUNfhOrsEzIwVz4RVxPKvB0QBhHvTGRsmzNbQH6Fpx+A0O9KnX7HptH LlIoOi2jeRNzXkTu8Gk3xFFShh3QOCxiP5BTCFGM9rkqg+9xN0P0O/gBxyvdfZcSx92nF4I7NHBE WgSInpmJEl9eJ9yJ38GcTTAxVQZKulOpIxr/XdlmFOq25fCubLe9bdjEypJn90SxXoX7D5boHTyL 9wlkxfIW9a5Cv6vQ3ltfoVfl8sXX5UUphiqtGxLba5vOO1rZeE8oYwM1Z+SmNL23hA1o3IdBvc4c Okl+EEtCeNSZDAwcXCCwWYMEVx9RFQ5CnEDfXvM0kUCmpAOJEi7hvGiGS2lrPPT+yp42G/ocYiuH xGqPj+3wuh7Ojhs5GSNVYM60GaN1TeCszNavpERBt9dhVtNCnZlbzYhmiqLDLVdZm9icy8HkuWow mFsTOhsE/RBYuQnHfs0azjuYkbG2u/VR5hbjhYt0kQzxmKQ+0nov+6hmnJTFypIiWg8bDPrseIrV CtxamuwbcDuLk4rs6ivYZd57Ey9lEbzwElA7mY4sLiYni9FR22s11hoe8nHS9iZwVIbHKAGvS91M YhbAfZOvhA37U5PZZPnCm61MMTcJanD7Ye2+pLBTBxIh1Q6WoQ0NM5WGAIs1Jyv/WgPMelEKlFSj s0mxvgHB8K9JAXZ0XUsmE+KrorMLI9p29jUtpXymiBiE4yM0YjNxgMH9OlRBnzGVcONhKoJ+ges5 bW0z5RbnNOmKl2IGZ8cxS0KclludolkmW7gpSLkM5q0gHuhWKrtR7vyqmJS/IFWKYfw/U0XvJ3AF sT7WHvDhdlhgpDOl7XGhQg5VKAmp3xfQOJjaAdECV7wwDUEFd9TmvyCH+r/NOUvDpDWcJNUBDZCg sB+pUBCyD2XJRN8pxGrp3mVJspSQiaiCuDKxYo/IIWFDXQObem/3UAihbqpJWgYM7mT8ue9pBo0C 3eQU882pZPnea3Pgn+58bDKDUm4dNg1NZv9cxLw9WOyqdr1Znu29RUX0xKLNqmdZAcwKW0ErTfvX FOGcW62tWEsarzUy4cCLyxrDYN4QJXCRhPQf2P+o8Jn94KE31CE/gNqK4PuFJgZhA1F9yTYeSBdI OziCxskO2mDSpKxp09ZJWy3brC+40835njC2luws/j6nsfPmzGXn5OJFGju1sGNrO7bS1ODZkykK Q5PsIGMcY76UFT9m8dF9cPQOfDaYMSVNMMGnKoGhhx6YPIDktxzN0q2/AAAA//8DAFBLAwQUAAYA CAAAACEATtLniTQHAAA0HQAAGAAAAHhsL3dvcmtzaGVldHMvc2hlZXQxLnhtbJSZ246jOBCG71fa d4i4H4JtDibq9Gg6BzIXu1rt7OGaTkgHTRKyQHfPvP2WbRKKMih0RupJ/JfL9ke5XMDD5x+n4+Qt K6u8OM8d5nrOJDtvi11+fpk7f/+1/iSdSVWn5116LM7Z3PmZVc7nx19/eXgvyu/VIcvqCXg4V3Pn UNeX2XRabQ/ZKa3c4pKdQdkX5Smt4Wf5Mq0uZZbudKfTcco9L5ye0vzsGA+zcoyPYr/Pt9my2L6e snNtnJTZMa1h/tUhv1RXb6ftGHentPz+evm0LU4XcPGcH/P6p3bqTE7b2deXc1Gmz0dY9w/mp9ur b/3Dcn/Kt2VRFfvaBXdTM1F7zfE0noKnx4ddDitQ2Cdltp87X9hsw5kzfXzQgP7Js/cKfZ/U6fO3 7Jht62wH18mZKP7PRfFdGX6FJg9cVtpAuUy3df6WLbLjce78zgK4hv/pUdR3GGJ6GwN/v4631tfs j3Kyy/bp67H+s3jfZPnLoYaBfdcHCgrGbPdzmVVbuAowuCuU221xBB/wd3LKIZrA8pT+0P+/57v6 oLuHoe+FHGa0fa3q4vRvIzTdTcew6Qj/XztGLvNiEfX2m5px9ZqWaZ0+PpTF+wTCCTBVl1QFJ5tB HPdOG+arTL8o27kD/mE5FbB8e2TsYfoGfLaNyVOPidc1WdgmsmuxtC3irsXKtoi6FmvbgvGuSdJj Irommx4T/2YyBYQ3jvwDHMH2RpDQecIagbvAGlnMEmtkFSuur5XnBrep6+u5Nu1+6JIOCTiDC21k 5gVkjhsjC/C6f0z4dA1z2ato4NwNI8lkEPkB8z0h28vWYSU+wApsB1lhjbLCGmWFNbL0lVoVxDUh ZVp9z+Ux/pCgS4xZHITd7hsYEHhqYHo1h7TMdo5Jahuhk9qkynWCSsR0DXMyPIUbce5JX4QR94Ow ddvBCTlk9BYG20GcWCMAFlijOLFGcfoaJ+mxblqlS4TECBFrV6ojdQND9AE0zCCrQuwFrog9L/Ji zmIZSNZemw4slefH5juc6cgeeMIahYU1ssIl1iisQMNqE4zZpaaVRW7o4Q8ZNGk6i3bdBh0MeA+d 70oeB/7tXzuvDjo4akajA9vBOMMaWcUCaxQd1gikVajR2SnOtPuBSzok4KxNcYyRy7sxsrVjccDB Me/HQsQcgs6XfhuyHWjRB6CB7SA0rFFoWKPQsEYYrKLeXGdaIdeRYRIjxBFxs4Eh7kcYlzGPQulF sfQA1y1BdmCpQnrs5gTbQVhYI6tYYI3CwhpZ5Ur2ZrKmVVpHqBEi6yiAIe7BCt2IiUh6MZMRj2I2 sB3jD8AC20FYWKOwsEZhYY3CinszmWmFTCbwIRq3VYLOWEnTWbRBYjIZDHgPXeAyOCmBXiyYJ/yg nXMnzhjcwo0ONGU8CK8jUnodkRRfy47YXmK91BWI6lAjfdZNswhdyuyqoDRkoKlh7lCLXRGFIdw5 mD+RN0RNVcNjtycD42FqWLSoYZEQWHbc0qgDsZeaaRbRneqt6Q7F1i07NQzVwnvqN3QaxK4UMfM5 bFoWcMmCNni7kadq59EMcVFPTqgnhsX28NEzXnTE9nJqcdkRrcgbulmAXroyFnbsGUWEJCg3aqA7 3HjoBn4Ywd2ujGMuvUFuqoQezQ0X+BY3LFrcsGhxw6LFzdT+JJjXrGkmvpKmXXCLGQxyj1nk+ixE n6FYU1XzaGa4ireYYdFihkWyziXDosWs/+4A+qhIk27QPScI26Sx45zMaKMGvccwdP1QRJ7Hhc+g hBPtdejuV1U+j2aIi3uLIRbJjBfq8dMtW1oMsWgx7L9pAIc6FbqcdEgahXEyzEbN4R414QoexkEs /cgToeBDkafK59HUcF1vUcOiRQ2LZDlLhkXrpBi6X4BeOsuRDknTLiQZZqOGuUcNDh54JgKnw/Xv 7XzpxpoqpEdTw4W9RQ2LFjUskuUsGRYJhBWIGs5t+vo8WV+buXXH0CiCWycqDHOfmsd9KOSElD6P o6AN5C41VVGPpoYrfIsaFi1qWLSoYdGi1n/nwEwzZDkyVNIo3KYGw9yjBucpk2EMBYiMZOCherBL TRXTo6nhUt+ihkWylAXDokUNixa1/lsIcNjkNVIZJo3CaMLbqDncoybcdnOqjdrGa4ca/8h9gzK+ 5XRKrSNSah2RrHPZEds9obfiCkSNh+zQpplx6zFIo4SB2y5Zu9qocQCbOov3eh3NQ0tU9AZQUEsh 4ZlbGEVwjA4dovDOZnywKeNhbFi0sGHRwoZFGmwwZi820wzYiLek6RD6LjxzRB8yJfW26j5EVX+g T7tNTOyZF1Hmpc0lfcl+S8uX/FxNjtkeOHkupNLSvHnS3+violvhNH8uanh9dP11gBeLGbzGUS+i JvuiqK8/4Pmp8vstq18vk6LM4XWVflc4dy5FWZdpXsMIs3w3d8qvO/36bVem7/D2s23l+o3Z7X3n 4/8AAAD//wMAUEsDBBQABgAIAAAAIQDcFOVQDQEAABwCAAAUAAAAeGwvc2hhcmVkU3RyaW5ncy54 bWx8kc9KxDAQh++C7xBy0kOb1j/dRdouKqzswcuyPkBop22gmcROItWnN8sqQrp4zPdNfjPMlJtZ j+wDJlIGK56nGWeAjWkV9hV/O2yTNWfkJLZyNAgV/wTim/ryoiRyLPxFqvjgnH0QgpoBtKTUWMBg OjNp6cJz6gXZCWRLA4DTo7jJskJoqZCzxnh0oe+q4Myjevfw/EPueV2SqktX57erYs7vinUpXF2K IzyJ0WuFEhuIxXa3j5HxLjFdYke5LFd4nu+WMfuXp8c4mlSvZQzDCMxM7EzEb7ej/m+q0/7i3AnI jN6Fay1Nl4Z9tsCuXq2aBV3HFcY6pdUXtLEgC9B6+4dFuG79DQAA//8DAFBLAwQUAAYACAAAACEA AfRZwUEDAAD5CAAAGAAAAHhsL2RyYXdpbmdzL2RyYXdpbmcxLnhtbLxW33PaMAx+393+B5/fgSRA Crkm3QbX7mErvdJuzyJxIFfHzmyXH/3rJzuBthxry7YrD8aRrE/SJ1nJ6dm65GTJlC6kiKnf9ihh IpVZIeYxvb05bw0o0QZEBlwKFtMN0/Qs+fjhdJ2paKXHiiCA0BE+xnRhTBV1OjpdsBJ0W1ZMoDaX qgSDj2reyRSsELrkncDzwo6uFINMLxgz41pDGzz4C7QSCkETF5lZyRHj/LNIF1LVolzJst6lkieD 047NwG6dAW4meZ50B4Mw8HY6K3JqJVdJUIvtdiuz+qHf6zUWqHIWDvrRn5E7H4nf3YHvhNYm6A+9 wR8cDw879k/6QXjI89afrkgJqZIxpcSwteGFuMN97Vcsp9VVQ016ubxSpMhiGlAioMQ63+D5L3JN fNpp4rx0BsSsUYydYuUuz6dA2kFCtM4Vcg2RzHOCh8Mg6Pd72FqbmPb6J1h6aw0R+iAp6oOhP+iF qE/xgN8Nu8OBO9F5RKqUNhdMlsRuYqpYajARiGD5TRsbCkTbI1asJS+y84JjdfHB9iMbcUWWwGPK TR383ikuyCqmw37QxzjKCsnQYu58vAbmfCwgYzV+38Ofo+eJaxfhEyCMl4uGwZo1247abDizcFxc sxwrguR4dQypms9sCnjNkCi8nLjO7Fon7wysZY5JH2nbmFhrlufI7JH2OyPnX4pH+7IQUjXxP6tB dretQV6f31JRE2C5sI2WbSzkDP+xP3FKmQkuOZdYqJQXFSV4ux/2ZSsFFRbv1z0oRokyfCSx6kgX uGkQU1N3H9dmat25VqysI+UWdMXBjj8mWrdTHH8P2JT+tmdNUjLQ94plRAoC5CvoFU4a24DGFbkG eRmJzEAzvI540ZqrYBLSOznxyCfSbfcuvj7s4TGRXYGC65dis1fAunXL0QZvy71H0gUIwbgm4/F1 F8P1B2FIvu8F/B6hpLKsCs5UtEfVK4kcpP44CNcwJrkYjUjYDto4gv+59q1Jl7RKUOkiXhxoqPfg czKNyJjNChD/IZ9C6ALn4Y9CmXvg9kXSSH4WIpMrTfx91t7esO6ts50O7kFjy7u3OS+YMGMwYIei GyLPvwOczH61JL8BAAD//wMAUEsDBBQABgAIAAAAIQC7Y3W9QgEAAGMCAAARAAgBZG9jUHJvcHMv Y29yZS54bWwgogQBKKAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACMkl9LwzAUxd8Fv0PJ e5tmm8OFtgOVPTkQnCi+heRuCzZ/SKLdvr1pu9XKfPAx95z8cs4lxfKg6uQLnJdGl4hkOUpAcyOk 3pXoZbNKb1HiA9OC1UZDiY7g0bK6viq4pdw4eHLGggsSfBJJ2lNuS7QPwVKMPd+DYj6LDh3FrXGK hXh0O2wZ/2A7wJM8n2MFgQkWGG6BqR2I6IQUfEDaT1d3AMEx1KBAB49JRvCPN4BT/s8LnTJyKhmO NnY6xR2zBe/FwX3wcjA2TZM10y5GzE/w2/rxuauaSt3uigOqCsEpd8CCcdVacme82YYCj6btBmvm wzoueytB3B3HxksxErsCPRZEEiPRvsBZeZ3eP2xWqJrkZJ7mi5QsNiSnZEZnt+/t27/utxH7gTol +CeRtMTpzYh4BlQFvvgW1TcAAAD//wMAUEsDBBQABgAIAAAAIQAj8ln0DAEAAFgDAAAQAAAAeGwv Y2FsY0NoYWluLnhtbGTTzU7EIBSG4b2J90DO3qFUHX9SOgsT4wXoBZAWp00obYBMxruXlLGezLfs WwIPBJrDeXLiZEMcZ69J7SoS1ndzP/qjpq/P97tnEjEZ3xs3e6vpx0Y6tLc3TWdc9zaY0Ys8g4+a hpSWVyljN9jJxN28WJ//fM9hMil/hqOMS7Cmj4O1aXKyrqq9nPIE1DadCJo+akVi1FSTcJlCcusZ tfatqBco2Xk15gnKHsojlAco91Cy8GqtP/m/EMxABjGAwQtc0AIWrGUEO94yBwtlFRaKg4UiZaHs hYWyWxZUORBeyqHxUrC8gFYBV4FXAViBWAH5cqnY6jWYL1eUj1nvwhrk9ibaXwAAAP//AwBQSwME FAAGAAgAAAAhAIbc3buVAQAARQMAABAACAFkb2NQcm9wcy9hcHAueG1sIKIEASigAAEAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAnJNNaxsxEIbvhfyHRfdYGyeUYrQKxUnIoaUGO+lZ1c56RbSS 0IwXu7++s14cr9tCobfRzMvLMx9S9/vOFz1kdDFU4mZWigKCjbUL20q8bJ6uP4kCyYTa+BigEgdA ca+vPqhVjgkyOcCCLQJWoiVKCynRttAZnHE5cKWJuTPEz7yVsWmchYdodx0EkvOy/ChhTxBqqK/T u6EYHRc9/a9pHe3Ah6+bQ2JgrT6n5J01xF3qr87miLGh4nFvwSs5LSqmW4PdZUcHXSo5faq1NR6W bKwb4xGUPCfUM5hhaCvjMmrV06IHSzEX6H7y2Oai+GEQBpxK9CY7E4ixBtn4OMY+IWX9PeY3bAEI lWTBmDyGU+00dnd6fhRwcCkcDEYQLlwibhx5wG/NymT6F/GRYeQdcZBb9VBQm7ltqKeg78jdzpP7 m+IC6jeMZeySCYfzkpQ8pdQXF97wJW3igyE4reAyqdatyVDz1k71c0I98/SzH0yWrQlbqE+aPwvD wbyOv0Lf3M3K25JvYZJT8nz/+hcAAAD//wMAUEsDBBQABgAIAAAAIQDP4Ee0wQEAACwVAAAnAAAA eGwvcHJpbnRlclNldHRpbmdzL3ByaW50ZXJTZXR0aW5nczEuYmlu7FQxa9tgEH22ksb10gQKWTKE kKnY1MZy260xlpK6SJWQZOMlg4lVELiSkRVCYlIo/Rv9IRkzZuwPyNyhlPyALO37VIUUk8Gdcyfu u3vvzmd/D/lsRDhCigQz+kdk2IZLHCHO84ysYgzs4yErrWhPruE+116WoJ6f1aQyZnyGYbnMOCxr PC2EnJbxTB8a8p9cqehXsUxX8TftoOfrRSkPRu9DfwdX2NVqG28P55//rS3mqznxNJ+1WBP8OBS4 e6+Wue0Vm3w7eK9613GBORp4A53/kgaaPDuow8QrtMjV6QZe86mzp0XeZNYg1ombjF2iFto5OudE z/QNy0I/jtJwpjJ3NA1TPzoLYZlBYHpw0iiMs1EWJTFcxwu8Ti+AF86SyXHOMXWmKmuim0yS1E7G 4d9M/eJ7q20AA92w7+7+rTrd2WL5F12j35aciv7jxP56s/Zu87L95Ts5q6ihwqQw1avwiyIqvEcf KLwO3j/hnjnGJ+4AtVn63DdqG7gYMZvhhPUUYzYvdjqsxUv2djnjFFPO9/kJ9X1qk2XkxEQBUUAU EAVEAVFAFBAFRAFRQBQQBUQBUUAUWEaBPwAAAP//AwBQSwECLQAUAAYACAAAACEAnkt3jJEBAACE BgAAEwAAAAAAAAAAAAAAAAAAAAAAW0NvbnRlbnRfVHlwZXNdLnhtbFBLAQItABQABgAIAAAAIQC1 VTAj9QAAAEwCAAALAAAAAAAAAAAAAAAAAMoDAABfcmVscy8ucmVsc1BLAQItABQABgAIAAAAIQD+ aepXCgEAAMwDAAAaAAAAAAAAAAAAAAAAAPAGAAB4bC9fcmVscy93b3JrYm9vay54bWwucmVsc1BL AQItABQABgAIAAAAIQCULMTaXwEAAFECAAAPAAAAAAAAAAAAAAAAADoJAAB4bC93b3JrYm9vay54 bWxQSwECLQAUAAYACAAAACEAPSrV2CwDAADgCQAADQAAAAAAAAAAAAAAAADGCgAAeGwvc3R5bGVz LnhtbFBLAQItABQABgAIAAAAIQA5MbWR2wAAANABAAAjAAAAAAAAAAAAAAAAAB0OAAB4bC93b3Jr c2hlZXRzL19yZWxzL3NoZWV0MS54bWwucmVsc1BLAQItABQABgAIAAAAIQCvU3GXKAcAACceAAAY AAAAAAAAAAAAAAAAADkPAAB4bC93b3Jrc2hlZXRzL3NoZWV0Mi54bWxQSwECLQAUAAYACAAAACEA +2KlbZQGAACnGwAAEwAAAAAAAAAAAAAAAACXFgAAeGwvdGhlbWUvdGhlbWUxLnhtbFBLAQItABQA BgAIAAAAIQBO0ueJNAcAADQdAAAYAAAAAAAAAAAAAAAAAFwdAAB4bC93b3Jrc2hlZXRzL3NoZWV0 MS54bWxQSwECLQAUAAYACAAAACEA3BTlUA0BAAAcAgAAFAAAAAAAAAAAAAAAAADGJAAAeGwvc2hh cmVkU3RyaW5ncy54bWxQSwECLQAUAAYACAAAACEAAfRZwUEDAAD5CAAAGAAAAAAAAAAAAAAAAAAF JgAAeGwvZHJhd2luZ3MvZHJhd2luZzEueG1sUEsBAi0AFAAGAAgAAAAhALtjdb1CAQAAYwIAABEA AAAAAAAAAAAAAAAAfCkAAGRvY1Byb3BzL2NvcmUueG1sUEsBAi0AFAAGAAgAAAAhACPyWfQMAQAA WAMAABAAAAAAAAAAAAAAAAAA9SsAAHhsL2NhbGNDaGFpbi54bWxQSwECLQAUAAYACAAAACEAhtzd u5UBAABFAwAAEAAAAAAAAAAAAAAAAAAvLQAAZG9jUHJvcHMvYXBwLnhtbFBLAQItABQABgAIAAAA IQDP4Ee0wQEAACwVAAAnAAAAAAAAAAAAAAAAAPovAAB4bC9wcmludGVyU2V0dGluZ3MvcHJpbnRl clNldHRpbmdzMS5iaW5QSwUGAAAAAA8ADwDwAwAAADIAAAAA

On Mon, Sep 19, 2016 at 5:25 AM, Yale Zhang <yzhang1985@...400...> wrote:

...

I'm proud to announce that I have completed vectorizing both the IIR and FIR filters. Speed ups are about 5x for IIR and ~20x for FIR (see spreadsheet). The code is a monstrosity, but given all the different cases ( {FIR, IIR} x {int16, float, double} x {RGBA, grayscale} as Jasper pointed out, it's expected. Earlier, I had only worked on the IIR, RGBA case, so it wasn't complete. It's unfortunate I let it languish for 3 years, but now hopefully, everyone can get a smoother experience.

Please test it out and send some feedback and a path for checking in.

A unit test & benchmark function is included. The accuracy is always +- 1 intensity level, which should be quite safe.

To build it, please add this to your cmake command: -DCMAKE_CXX_FLAGS="-DWIN32 -mavx2 -mfma -fpermissive -flax-vector-conversions"

You'll also need to turn on -std=c++14 if compiling the unit test & benchmark

Todo list:

*support processors with SSE2 only? current version only works on an AVX2 processor *dynamic dispatch for different CPU types - I guess there should be 3 versions of the functions (SSE2, AVX1, AVX2) *better ways to deal with SIMD remainders? The current code never writes past the end of a row, but liberally uses vector loads that go past the end. This will render memory error checking tools like AddressSanitizer useless with false positives. Ideally, I'd like all the Cairo images to be padded to either 16 (SSE2) or 32 bytes (AVX), but that might not be practical. *document all the arcane optimization methods (cache blocking, do-while loops, Duff's device) for posterity. Preferably with SVG animations

-Yale

On Wed, Mar 6, 2013 at 11:23 AM, Johan Engelen <jbc.engelen@...2592...> wrote:

...
On 4-3-2013 6:44, Yale Zhang wrote:

...
... but I'll still go ahead with the vectorized implementation.

Yes please do! I'm very interested in how to include SIMD code and how to go about vectorizing code; great if we have an example in our codebase.

Cheers, Johan

Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Alexander Brock

4:52 p.m.

On 09/19/2016 02:25 PM, Yale Zhang wrote:

...

I'm proud to announce that I have completed vectorizing both the IIR and FIR filters. Speed ups are about 5x for IIR and ~20x for FIR (see spreadsheet). The code is a monstrosity, but given all the different cases ( {FIR, IIR} x {int16, float, double} x {RGBA, grayscale} as Jasper pointed out, it's expected. Earlier, I had only worked on the IIR, RGBA case, so it wasn't complete. It's unfortunate I let it languish for 3 years, but now hopefully, everyone can get a smoother experience.

Cool :-)

...

Please test it out and send some feedback and a path for checking in.

Here's my experience with your patch on Debian stretch (native, latest upgrades installed), gcc version 6.1.1 20160802 (Debian 6.1.1-11):

1. I needed to remove the "-DWIN32" flag since that triggered usage of the "Sleep" function which lead to a compiler error

2. I needed to place the keyword "inline" before the four PartialVectorMask* functions in lines 983, 988, 994, 1000 of file nr-filter-gaussian.cpp This problem was the reason:

http://stackoverflow.com/questions/13472341/inlining-failed-function-body-ca...

After that it compiled.

...

A unit test & benchmark function is included. The accuracy is always +- 1 intensity level, which should be quite safe.

To build it, please add this to your cmake command: -DCMAKE_CXX_FLAGS="-DWIN32 -mavx2 -mfma -fpermissive -flax-vector-conversions"

You'll also need to turn on -std=c++14 if compiling the unit test & benchmark

I ran cmake with -DCMAKE_CXX_FLAGS="-mavx2 -mfma -fpermissive -flax-vector-conversions -std=c++14".

How can I run the test and the benchmark? In the "bin" directory these files showed up: attributes-test, color-profile-test, dir-util-test inkscape, inkview, object-set-test, sp-object-test

Best Regards, Alexander

Yale Zhang

5:55 p.m.

fellow Debian user, thanks for giving it a test and pointing out the problems. I did most of the development in Visual Studio, so there might be some incompatibilities.

Unfortunately, the unit test isn't very well integrated. I just saw now there's a Google unit test, but I've never used that. For now, I guess you can,

1. rename the test function, main() in nr-filter-gaussian.cpp and call it from the real main() Also #define UNIT_TEST

2. compile nr-filter-gaussian.cpp as a standalone program: a. #define UNIT_TEST b. compile just that file to an EXE. This could be a little difficult because you'll need to strip out some unused functions or else there will be a ton of link errors. You'll also need to specify the libraries (at least -lcairo -lpng16). You can see the compile command with make VERBOSE=1 and use that as a starting point.

On Mon, Sep 19, 2016 at 9:52 AM, Alexander Brock <a.brock@...2965...> wrote:

...

On 09/19/2016 02:25 PM, Yale Zhang wrote:

...
I'm proud to announce that I have completed vectorizing both the IIR and FIR filters. Speed ups are about 5x for IIR and ~20x for FIR (see spreadsheet). The code is a monstrosity, but given all the different cases ( {FIR, IIR} x {int16, float, double} x {RGBA, grayscale} as Jasper pointed out, it's expected. Earlier, I had only worked on the IIR, RGBA case, so it wasn't complete. It's unfortunate I let it languish for 3 years, but now hopefully, everyone can get a smoother experience.

Cool :-)

...
Please test it out and send some feedback and a path for checking in.

Here's my experience with your patch on Debian stretch (native, latest upgrades installed), gcc version 6.1.1 20160802 (Debian 6.1.1-11):

I needed to remove the "-DWIN32" flag since that triggered usage of

the "Sleep" function which lead to a compiler error

I needed to place the keyword "inline" before the four

PartialVectorMask* functions in lines 983, 988, 994, 1000 of file nr-filter-gaussian.cpp This problem was the reason:

http://stackoverflow.com/questions/13472341/inlining-failed-function-body-ca...

After that it compiled.

...
A unit test & benchmark function is included. The accuracy is always +- 1 intensity level, which should be quite safe.

To build it, please add this to your cmake command: -DCMAKE_CXX_FLAGS="-DWIN32 -mavx2 -mfma -fpermissive -flax-vector-conversions"

You'll also need to turn on -std=c++14 if compiling the unit test & benchmark

I ran cmake with -DCMAKE_CXX_FLAGS="-mavx2 -mfma -fpermissive -flax-vector-conversions -std=c++14".

How can I run the test and the benchmark? In the "bin" directory these files showed up: attributes-test, color-profile-test, dir-util-test inkscape, inkview, object-set-test, sp-object-test

Best Regards, Alexander

Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Alexander Brock

11:03 p.m.

On 09/19/2016 07:55 PM, Yale Zhang wrote:

...

compile nr-filter-gaussian.cpp as a standalone program:

a. #define UNIT_TEST b. compile just that file to an EXE. This could be a little difficult because you'll need to strip out some unused functions or else there will be a ton of link errors. You'll also need to specify the libraries (at least -lcairo -lpng16). You can see the compile command with make VERBOSE=1 and use that as a starting point.

I did this. I had to move the main function outside the Inkscape::Filters namespace, otherwise the linker complains that there is no main function. Also I needed a ton of linker flags.

I can run the executable but it does nothing except printing "error loading". I think I need the file "../drmixx/rasterized/99_showdown_carcass.png", can you provide that?

Best Regards, Alexander

Yale Zhang

20 Sep 20 Sep

1:33 a.m.

Alex, sorry about the inconvenience. Here's an easy way to compile the unit test with minimal changes to the source code:

1. add extern "C" before int main() 2. add "-ffunction-sections -fdata-sections -fvisibility=hidden -Xlinker --gc-sections" to the compile & link command this will discard any unused functions, so that you won't see all those unresolved symbols

3. add "-lcairo -lpng16 -lglib-2.0" to the compile/link command

" "error loading". I think I need the file" Right, that's a panel from my comic. I'd rather not share it, but you can just use any 8bit color PNG. The resolution I used for the benchmark is 1467x1373.

May I ask how you're using blurs in your work? I use the texture filters that come with Inkscape like marble and brick a lot in background layers. Earlier, this was painfully slow (even at average quality). For a while, I just used a hotkey to toggle filters off when I'm drawing and on when I want to see the final render. This might still be a good idea since a lot of other filters are still pretty slow. I also vectorized the Perlin noise generator.

On Mon, Sep 19, 2016 at 4:03 PM, Alexander Brock <a.brock@...2965...> wrote:

...

On 09/19/2016 07:55 PM, Yale Zhang wrote:

...

compile nr-filter-gaussian.cpp as a standalone program:

a. #define UNIT_TEST b. compile just that file to an EXE. This could be a little difficult because you'll need to strip out some unused functions or else there will be a ton of link errors. You'll also need to specify the libraries (at least -lcairo -lpng16). You can see the compile command with make VERBOSE=1 and use that as a starting point.

I did this. I had to move the main function outside the Inkscape::Filters namespace, otherwise the linker complains that there is no main function. Also I needed a ton of linker flags.

I can run the executable but it does nothing except printing "error loading". I think I need the file "../drmixx/rasterized/99_showdown_carcass.png", can you provide that?

Best Regards, Alexander

Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Alexander Brock

11 a.m.

On 09/20/2016 03:33 AM, Yale Zhang wrote:

...

" "error loading". I think I need the file" Right, that's a panel from my comic. I'd rather not share it, but you can just use any 8bit color PNG. The resolution I used for the benchmark is 1467x1373.

I made a file with this resolution and ran the test. The program output is attached in the file log1

real 0m7.513s user 0m57.092s sys 0m0.284s

...

May I ask how you're using blurs in your work?

I use only Gaussian blurr and I use it very rarely.

I also ran the benchmark code but it only prints speed of the vectorized version, can I make it print also the reference speed? I attached my results and my memory configuration.

Best Regards, Alexander

Yale Zhang

7:54 p.m.

Thanks for the data. I realized there's been further problems with the benchmark on both our ends.

1. the speedups (Skylake i6700HQ vs Haswell 4770 in column I) isn't a valid comparison

Those numbers are almost certainly the multithreaded throughput for 4 cores. I forgot to say if you want to benchmark single thread throughput, you need to uncomment the line, // omp_set_num_threads(1)

If I compare your numbers with mine from the 2nd sheet, the speed ups range from 0.4x to 1.2x, average = 0.78x. This is believable since you're using a power sipping 2.6 GHz CPU compared to a 3.4 GHz for me. If I scale up your numbers by 3.4/2.6 (optimistic, since in a lot of cases, especially IIR, memory bandwidth is a bigger bottleneck than CPU), then the speedup becomes 1.02x.

That's about right. Haswell and Skylake should have almost the same performance per clock cycle.

2. the width & height for the benchmark (but not the accuracy checking test) is swapped due to calling IterateCombinations() with the width & height swapped but I made the same mistake, so no inconsistency.

3. my desktop memory is only dual channel. Must've gotten it mixed up with the desktop and servers I use at work, which are all quad channel.

4. multithreaded throughput unstable - > 10% run to run difference

Windows' anti-malware service?

Also, I found the performance of the optimized loops where I use "goto middle" to handle SIMD remainders while keeping code size to a minimum (so it fits in cache or even better, the uOP cache) is fragile. I found the Microsoft compiler reorders the basic blocks, resulting in a loop that does 2 branches/iteration instead of 1, dropping the performance by almost 2x. Luckily, GCC 6 doesn't do that, which is why I used it for the benchmark. If anyone knows how to discourage the compiler from reordering code like that, I'd like to know.

I just tried gcc 6.1.0, but saw no noticeable difference.

"can I make it print also the reference speed?" yes, set useRefCode in BenchmarkFunction() to true

-Yale

On Tue, Sep 20, 2016 at 4:00 AM, Alexander Brock <a.brock@...2965...> wrote:

...

On 09/20/2016 03:33 AM, Yale Zhang wrote:

...
" "error loading". I think I need the file" Right, that's a panel from my comic. I'd rather not share it, but you can just use any 8bit color PNG. The resolution I used for the benchmark is 1467x1373.

I made a file with this resolution and ran the test. The program output is attached in the file log1

real 0m7.513s user 0m57.092s sys 0m0.284s

...
May I ask how you're using blurs in your work?

I use only Gaussian blurr and I use it very rarely.

I also ran the benchmark code but it only prints speed of the vectorized version, can I make it print also the reference speed? I attached my results and my memory configuration.

Best Regards, Alexander

Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Yale Zhang

26 Sep 26 Sep

12:47 p.m.

Is no one interested in faster blurs any more? There at least has to be some users who want faster filters like this guy:

https://www.youtube.com/watch?v=NUqzYC_Ehtc @20:30

I've attached one of my comic panels that's quite slow to render with filters on, as an example.

Also, I've prepared a new version that supports dynamic dispatch (SSE2, AVX, AVX2) and doesn't require any compile flag changes. This was quite difficult to do - lots of trouble getting multiple versions of a function to co-exist. But now, this can be a drop in replacement.

Are there any other functions that people want to see become faster? For me, the 2nd biggest one is the turbulence filter.

Maybe Inkscape isn't the best place for optimized code. Lots of projects use FIR filters, so maybe OpenCV? There's also the Oil library, but I don't think think it should go there because filtering isn't that generic and its Intel IPP like approach of providing lots of low level functions to compose from isn't likely to have as good speedup as a fully custom one.

-Yale

On Tue, Sep 20, 2016 at 12:54 PM, Yale Zhang <yzhang1985@...400...> wrote:

...

Thanks for the data. I realized there's been further problems with the benchmark on both our ends.

the speedups (Skylake i6700HQ vs Haswell 4770 in column I) isn't a

valid comparison

Those numbers are almost certainly the multithreaded throughput for 4 cores. I forgot to say if you want to benchmark single thread throughput, you need to uncomment the line, // omp_set_num_threads(1)
If I compare your numbers with mine from the 2nd sheet, the speed
ups range from 0.4x to 1.2x, average = 0.78x. This is believable since you're using a power sipping 2.6 GHz CPU compared to a 3.4 GHz for me. If I scale up your numbers by 3.4/2.6 (optimistic, since in a lot of cases, especially IIR, memory bandwidth is a bigger bottleneck than CPU), then the speedup becomes 1.02x.

That's about right. Haswell and Skylake should have almost the same performance per clock cycle.

the width & height for the benchmark (but not the accuracy checking

test) is swapped due to calling IterateCombinations() with the width & height swapped but I made the same mistake, so no inconsistency.

my desktop memory is only dual channel. Must've gotten it mixed up

with the desktop and servers I use at work, which are all quad channel.

multithreaded throughput unstable - > 10% run to run difference

Windows' anti-malware service?

Also, I found the performance of the optimized loops where I use "goto middle" to handle SIMD remainders while keeping code size to a minimum (so it fits in cache or even better, the uOP cache) is fragile. I found the Microsoft compiler reorders the basic blocks, resulting in a loop that does 2 branches/iteration instead of 1, dropping the performance by almost 2x. Luckily, GCC 6 doesn't do that, which is why I used it for the benchmark. If anyone knows how to discourage the compiler from reordering code like that, I'd like to know.

I just tried gcc 6.1.0, but saw no noticeable difference.

"can I make it print also the reference speed?" yes, set useRefCode in BenchmarkFunction() to true

-Yale

On Tue, Sep 20, 2016 at 4:00 AM, Alexander Brock <a.brock@...2965...> wrote:

...
On 09/20/2016 03:33 AM, Yale Zhang wrote:

...
" "error loading". I think I need the file" Right, that's a panel from my comic. I'd rather not share it, but you can just use any 8bit color PNG. The resolution I used for the benchmark is 1467x1373.

I made a file with this resolution and ran the test. The program output is attached in the file log1

real 0m7.513s user 0m57.092s sys 0m0.284s

...
May I ask how you're using blurs in your work?

I use only Gaussian blurr and I use it very rarely.

I also ran the benchmark code but it only prints speed of the vectorized version, can I make it print also the reference speed? I attached my results and my memory configuration.

Best Regards, Alexander

Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Alexander Brock

12:53 p.m.

On 09/26/2016 02:47 PM, Yale Zhang wrote:

...

Is no one interested in faster blurs any more?

I ran some tests using your code to see how the speedup varies but I have no time to analyze the results before Saturday.

I'm very interested in having a faster Inkscape :-)

Best Regards, Alexander Brock

Yale Zhang

29 Sep 29 Sep

1:25 p.m.

OK, I've created a branch with the optimized gaussian blur here:

https://code.launchpad.net/~simdgenius/inkscape/inkscape

Can anyone approve merging this into trunk? -Yale

On Mon, Sep 26, 2016 at 5:47 AM, Yale Zhang <yzhang1985@...400...> wrote:

...

Is no one interested in faster blurs any more? There at least has to be some users who want faster filters like this guy:

https://www.youtube.com/watch?v=NUqzYC_Ehtc @20:30

I've attached one of my comic panels that's quite slow to render with filters on, as an example.

Also, I've prepared a new version that supports dynamic dispatch (SSE2, AVX, AVX2) and doesn't require any compile flag changes. This was quite difficult to do - lots of trouble getting multiple versions of a function to co-exist. But now, this can be a drop in replacement.

Are there any other functions that people want to see become faster? For me, the 2nd biggest one is the turbulence filter.

Maybe Inkscape isn't the best place for optimized code. Lots of projects use FIR filters, so maybe OpenCV? There's also the Oil library, but I don't think think it should go there because filtering isn't that generic and its Intel IPP like approach of providing lots of low level functions to compose from isn't likely to have as good speedup as a fully custom one.

-Yale

On Tue, Sep 20, 2016 at 12:54 PM, Yale Zhang <yzhang1985@...400...> wrote:

...
Thanks for the data. I realized there's been further problems with the benchmark on both our ends.

the speedups (Skylake i6700HQ vs Haswell 4770 in column I) isn't a

valid comparison

Those numbers are almost certainly the multithreaded throughput for 4 cores. I forgot to say if you want to benchmark single thread throughput, you need to uncomment the line, // omp_set_num_threads(1)
If I compare your numbers with mine from the 2nd sheet, the speed
ups range from 0.4x to 1.2x, average = 0.78x. This is believable since you're using a power sipping 2.6 GHz CPU compared to a 3.4 GHz for me. If I scale up your numbers by 3.4/2.6 (optimistic, since in a lot of cases, especially IIR, memory bandwidth is a bigger bottleneck than CPU), then the speedup becomes 1.02x.

That's about right. Haswell and Skylake should have almost the same performance per clock cycle.

the width & height for the benchmark (but not the accuracy checking

test) is swapped due to calling IterateCombinations() with the width & height swapped but I made the same mistake, so no inconsistency.

my desktop memory is only dual channel. Must've gotten it mixed up

with the desktop and servers I use at work, which are all quad channel.

multithreaded throughput unstable - > 10% run to run difference

Windows' anti-malware service?

Also, I found the performance of the optimized loops where I use "goto middle" to handle SIMD remainders while keeping code size to a minimum (so it fits in cache or even better, the uOP cache) is fragile. I found the Microsoft compiler reorders the basic blocks, resulting in a loop that does 2 branches/iteration instead of 1, dropping the performance by almost 2x. Luckily, GCC 6 doesn't do that, which is why I used it for the benchmark. If anyone knows how to discourage the compiler from reordering code like that, I'd like to know.

I just tried gcc 6.1.0, but saw no noticeable difference.

"can I make it print also the reference speed?" yes, set useRefCode in BenchmarkFunction() to true

-Yale

On Tue, Sep 20, 2016 at 4:00 AM, Alexander Brock <a.brock@...2965...> wrote:

...
On 09/20/2016 03:33 AM, Yale Zhang wrote:

...
" "error loading". I think I need the file" Right, that's a panel from my comic. I'd rather not share it, but you can just use any 8bit color PNG. The resolution I used for the benchmark is 1467x1373.

I made a file with this resolution and ran the test. The program output is attached in the file log1

real 0m7.513s user 0m57.092s sys 0m0.284s

...
May I ask how you're using blurs in your work?

I use only Gaussian blurr and I use it very rarely.

I also ran the benchmark code but it only prints speed of the vectorized version, can I make it print also the reference speed? I attached my results and my memory configuration.

Best Regards, Alexander

Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Martin Owens

1:40 p.m.

On Thu, 2016-09-29 at 06:25 -0700, Yale Zhang wrote:

...

OK, I've created a branch with the optimized gaussian blur here:

https://code.launchpad.net/~simdgenius/inkscape/inkscape

Can anyone approve merging this into trunk?

This is great. You need to hit the "Propose for merging" link on the code page and then code review / merging can happen.

Martin,

Jasper van de Gronde

3:12 p.m.

What's SimpleImage.h for? At first glance it seems a little strange to have "yet another" image representation just for computing blurs. Great work though!

On 09/29/2016 03:25 PM, Yale Zhang wrote:

...

OK, I've created a branch with the optimized gaussian blur here:

https://code.launchpad.net/~simdgenius/inkscape/inkscape

Can anyone approve merging this into trunk? -Yale

On Mon, Sep 26, 2016 at 5:47 AM, Yale Zhang <yzhang1985@...400...> wrote:

...
Is no one interested in faster blurs any more? There at least has to be some users who want faster filters like this guy:

https://www.youtube.com/watch?v=NUqzYC_Ehtc @20:30

I've attached one of my comic panels that's quite slow to render with filters on, as an example.

Also, I've prepared a new version that supports dynamic dispatch (SSE2, AVX, AVX2) and doesn't require any compile flag changes. This was quite difficult to do - lots of trouble getting multiple versions of a function to co-exist. But now, this can be a drop in replacement.

Are there any other functions that people want to see become faster? For me, the 2nd biggest one is the turbulence filter.

Maybe Inkscape isn't the best place for optimized code. Lots of projects use FIR filters, so maybe OpenCV? There's also the Oil library, but I don't think think it should go there because filtering isn't that generic and its Intel IPP like approach of providing lots of low level functions to compose from isn't likely to have as good speedup as a fully custom one.

-Yale

On Tue, Sep 20, 2016 at 12:54 PM, Yale Zhang <yzhang1985@...400...> wrote:

...
Thanks for the data. I realized there's been further problems with the benchmark on both our ends.

the speedups (Skylake i6700HQ vs Haswell 4770 in column I) isn't a

valid comparison

Those numbers are almost certainly the multithreaded throughput for 4 cores. I forgot to say if you want to benchmark single thread throughput, you need to uncomment the line, // omp_set_num_threads(1)
If I compare your numbers with mine from the 2nd sheet, the speed
ups range from 0.4x to 1.2x, average = 0.78x. This is believable since you're using a power sipping 2.6 GHz CPU compared to a 3.4 GHz for me. If I scale up your numbers by 3.4/2.6 (optimistic, since in a lot of cases, especially IIR, memory bandwidth is a bigger bottleneck than CPU), then the speedup becomes 1.02x.

That's about right. Haswell and Skylake should have almost the same performance per clock cycle.

the width & height for the benchmark (but not the accuracy checking

test) is swapped due to calling IterateCombinations() with the width & height swapped but I made the same mistake, so no inconsistency.

my desktop memory is only dual channel. Must've gotten it mixed up

with the desktop and servers I use at work, which are all quad channel.

multithreaded throughput unstable - > 10% run to run difference

Windows' anti-malware service?

Also, I found the performance of the optimized loops where I use "goto middle" to handle SIMD remainders while keeping code size to a minimum (so it fits in cache or even better, the uOP cache) is fragile. I found the Microsoft compiler reorders the basic blocks, resulting in a loop that does 2 branches/iteration instead of 1, dropping the performance by almost 2x. Luckily, GCC 6 doesn't do that, which is why I used it for the benchmark. If anyone knows how to discourage the compiler from reordering code like that, I'd like to know.

I just tried gcc 6.1.0, but saw no noticeable difference.

"can I make it print also the reference speed?" yes, set useRefCode in BenchmarkFunction() to true

-Yale

On Tue, Sep 20, 2016 at 4:00 AM, Alexander Brock <a.brock@...2965...> wrote:

...
On 09/20/2016 03:33 AM, Yale Zhang wrote:

...
" "error loading". I think I need the file" Right, that's a panel from my comic. I'd rather not share it, but you can just use any 8bit color PNG. The resolution I used for the benchmark is 1467x1373.

I made a file with this resolution and ran the test. The program output is attached in the file log1

real 0m7.513s user 0m57.092s sys 0m0.284s

...
May I ask how you're using blurs in your work?

I use only Gaussian blurr and I use it very rarely.

I also ran the benchmark code but it only prints speed of the vectorized version, can I make it print also the reference speed? I attached my results and my memory configuration.

Best Regards, Alexander

Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Yale Zhang

11:16 p.m.

Glad to finally hear from you all. I've submitted a merge proposal.

"What's SimpleImage.h for?"

It's to allow 2D indexing with image[y][x] instead of image[y * width + x].

The intention is to use a minimal representation (pointer and stride) that's completely reusable. Functions that process images will only use the minimal representation while functions that create images will use a more specific class that handles memory allocation.

Since templates make things generic anyways, it could be changed to directly use a specific image class if Inkscape already has one. My concern is that the [] operator should use a pointer sized integer instead of int or else the compiler will generate redundant sign or zero extend instructions:

http://stackoverflow.com/questions/36706721/is-a-sign-or-zero-extension-requ...

On Thu, Sep 29, 2016 at 8:12 AM, Jasper van de Gronde <th.v.d.gronde@...528...> wrote:

...

What's SimpleImage.h for? At first glance it seems a little strange to have "yet another" image representation just for computing blurs. Great work though!

On 09/29/2016 03:25 PM, Yale Zhang wrote:

...
OK, I've created a branch with the optimized gaussian blur here:

https://code.launchpad.net/~simdgenius/inkscape/inkscape

Can anyone approve merging this into trunk? -Yale

On Mon, Sep 26, 2016 at 5:47 AM, Yale Zhang <yzhang1985@...400...> wrote:

...
Is no one interested in faster blurs any more? There at least has to be some users who want faster filters like this guy:

https://www.youtube.com/watch?v=NUqzYC_Ehtc @20:30

I've attached one of my comic panels that's quite slow to render with filters on, as an example.

Also, I've prepared a new version that supports dynamic dispatch (SSE2, AVX, AVX2) and doesn't require any compile flag changes. This was quite difficult to do - lots of trouble getting multiple versions of a function to co-exist. But now, this can be a drop in replacement.

Are there any other functions that people want to see become faster? For me, the 2nd biggest one is the turbulence filter.

Maybe Inkscape isn't the best place for optimized code. Lots of projects use FIR filters, so maybe OpenCV? There's also the Oil library, but I don't think think it should go there because filtering isn't that generic and its Intel IPP like approach of providing lots of low level functions to compose from isn't likely to have as good speedup as a fully custom one.

-Yale

On Tue, Sep 20, 2016 at 12:54 PM, Yale Zhang <yzhang1985@...400...> wrote:

...
Thanks for the data. I realized there's been further problems with the benchmark on both our ends.

the speedups (Skylake i6700HQ vs Haswell 4770 in column I) isn't a

valid comparison

Those numbers are almost certainly the multithreaded throughput for 4 cores. I forgot to say if you want to benchmark single thread throughput, you need to uncomment the line, // omp_set_num_threads(1)
If I compare your numbers with mine from the 2nd sheet, the speed
ups range from 0.4x to 1.2x, average = 0.78x. This is believable since you're using a power sipping 2.6 GHz CPU compared to a 3.4 GHz for me. If I scale up your numbers by 3.4/2.6 (optimistic, since in a lot of cases, especially IIR, memory bandwidth is a bigger bottleneck than CPU), then the speedup becomes 1.02x.

That's about right. Haswell and Skylake should have almost the same performance per clock cycle.

the width & height for the benchmark (but not the accuracy checking

test) is swapped due to calling IterateCombinations() with the width & height swapped but I made the same mistake, so no inconsistency.

my desktop memory is only dual channel. Must've gotten it mixed up

with the desktop and servers I use at work, which are all quad channel.

multithreaded throughput unstable - > 10% run to run difference

Windows' anti-malware service?

Also, I found the performance of the optimized loops where I use "goto middle" to handle SIMD remainders while keeping code size to a minimum (so it fits in cache or even better, the uOP cache) is fragile. I found the Microsoft compiler reorders the basic blocks, resulting in a loop that does 2 branches/iteration instead of 1, dropping the performance by almost 2x. Luckily, GCC 6 doesn't do that, which is why I used it for the benchmark. If anyone knows how to discourage the compiler from reordering code like that, I'd like to know.

I just tried gcc 6.1.0, but saw no noticeable difference.

"can I make it print also the reference speed?" yes, set useRefCode in BenchmarkFunction() to true

-Yale

On Tue, Sep 20, 2016 at 4:00 AM, Alexander Brock <a.brock@...2965...> wrote:

...
On 09/20/2016 03:33 AM, Yale Zhang wrote:

...
" "error loading". I think I need the file" Right, that's a panel from my comic. I'd rather not share it, but you can just use any 8bit color PNG. The resolution I used for the benchmark is 1467x1373.

I made a file with this resolution and ran the test. The program output is attached in the file log1

real 0m7.513s user 0m57.092s sys 0m0.284s

...
May I ask how you're using blurs in your work?

I use only Gaussian blurr and I use it very rarely.

I also ran the benchmark code but it only prints speed of the vectorized version, can I make it print also the reference speed? I attached my results and my memory configuration.

Best Regards, Alexander

Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Jasper van de Gronde

4 Mar 4 Mar

8:43 a.m.

On 03-03-13 11:26, Yale Zhang wrote:

...

Hi. I'm using Inkscape to author a comic and the slow speed for certain things is very annoying. I'm already have OpenMP turned on and using 4 cores.

*large blurs are slow* - I'm an expert with writing SIMD code, so I

was thinking about vectorizing the Gaussian IIR filter with SIMD intrinsics, even though it's harder than for a FIR. But I noticed there isn't any SIMD code in Inkscape so does that mean it's something to avoid. I'm pretty sure no current compiler is smart enough to vectorize it, and besides, Inkscape is compiled with -O2, meaning -ftree-vectorize isn't on by default.

Vectorizing the IIR code would be great (although you may want to make sure that in the end it all still works on a wide variety of architectures). But the code is pretty tricky. It's perhaps one of the most efficient ways to implement a blur using an IIR filter, but it's also extremely prone to numerical issues and the like. One way this is reflected in the code is the absolute necessity of the clamping operations. Also, in the past we've had trouble with using 32bit floats... (Only in specific instances.)

In other words: test the resulting code quite well.

(I've been experimenting with different ways of approximating a Gaussian blur that are more stable, but haven't really found a completely convincing alternative yet.)

3196

Age (days ago)

4502

Last active (days ago)

List overview

Download

20 comments

7 participants

tags (0)

participants (7)

Alexander Brock
Jasper van de Gronde
Johan Engelen
Martin Owens
Tavmjong Bah
Yale Zhang
~suv