multithreaded rendering

newer
Re: [Inkscape-devel] Retrospective...

Yale Zhang

31 Dec 2016 31 Dec '16

11:21 p.m.

In my quest for a more fluid experience with fewer distractions, I've attempted to multithread the rendering. I see this has been a long standing discussion:

https://bugs.launchpad.net/inkscape/+bug/200415 https://bugs.launchpad.net/inkscape/+bug/330271

I'm using a Lenovo P40 tablet and the total frame rendering time for a simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2 layers with no filter effects - only alpha compositing). This slow rendering speed makes the touchscreen zooming I recently implemented very jerky.

So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP by splitting the rectangle into 2 and rendering them in ||. I used mutual exclusion for some obviously thread unsafe code like the call to markRect in SPCanvas::paintSingleBuffer(). The rendering would work for a few frames before it freezes (waiting threads timeout and then exit).

Then, I put other calls that I suspected were thread unsafe in mutually exclusive blocks until I discovered _root->render() isn't safe. No point in going further.

Excuse my naive attempt. Can anyone guess how feasible it is to multithread the rendering? For now, I don't care if it's pixel perfect. I just need something that's decent and doesn't crash/freeze.

I'm also wondering why the Cairo OpenGL backend isn't being used? GPU rendering on integrated GPUs should give a nice speedup since there should be no copying overhead.

The other place to optimize is pixman. I did some profiling (rapidly zooming in and out with touchscreen) and >= 25% of the time is spent in pixman rendering. I already went ahead and ported a few to AVX2 and got ~1.3x speedup (should get more since my laptop is bottlenecked by memory bandwidth owing to having only 1 memory channel).

Function Module Samples sse2_blt.part.0 libpixman-1-0.dll 4221 sse2_combine_in_u libpixman-1-0.dll 2189 sse2_fill libpixman-1-0.dll 1693 cairo_tor_scan_converter_generate libcairo-2.dll 1494 sse2_composite_over_8888_8888 libpixman-1-0.dll 1424 bits_image_fetch_separable_convolution_affine_none_a8r8g8b8 libpixman-1-0.dll 1104 feed_curve_to_cairo(_cairo*Geom::Curve const& libinkscape_base.dll 611 fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC libpixman-1-0.dll 475 fill_xrgb32_lerp_opaque_spans libcairo-2.dll 348 cairo_tor_scan_converter_add_polygon libcairo-2.dll 260 compute_face libcairo-2.dll 238 _dynamic_cast libstdc++-6.dll 209 outer_join libcairo-2.dll 179 cairo_polygon_add_edge libcairo-2.dll 178 g_hash_table_lookup libglib-2.0-0.dll 169 cairo_spline_decompose_into libcairo-2.dll 153 g_slice_alloc libglib-2.0-0.dll 138 cairo_spline_intersects libcairo-2.dll 131 feed_pathvector_to_cairo(_cairo*Geom::PathVector const&) libinkscape_base.dll 127 line_to libcairo-2.dll 119 void std::vector<Geom::Pointstd::allocatorGeom::Point libinkscape_base.dll 116 cairo_matrix_transform_point libcairo-2.dll 110 cell_list_render_edge libcairo-2.dll 106 g_type_check_instance_is_a libgobject-2.0-0.dll 106

Show replies by date

Bryce Harrington

1 Jan 1 Jan

9:42 a.m.

On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:

...

In my quest for a more fluid experience with fewer distractions, I've attempted to multithread the rendering. I see this has been a long standing discussion:

https://bugs.launchpad.net/inkscape/+bug/200415 https://bugs.launchpad.net/inkscape/+bug/330271

I'm using a Lenovo P40 tablet and the total frame rendering time for a simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2 layers with no filter effects - only alpha compositing). This slow rendering speed makes the touchscreen zooming I recently implemented very jerky.

So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP by splitting the rectangle into 2 and rendering them in ||. I used mutual exclusion for some obviously thread unsafe code like the call to markRect in SPCanvas::paintSingleBuffer(). The rendering would work for a few frames before it freezes (waiting threads timeout and then exit).

Then, I put other calls that I suspected were thread unsafe in mutually exclusive blocks until I discovered _root->render() isn't safe. No point in going further.

Excuse my naive attempt. Can anyone guess how feasible it is to multithread the rendering? For now, I don't care if it's pixel perfect. I just need something that's decent and doesn't crash/freeze.

Yes, various people have looked at multi-threading before, but not founda feasible way to attack it.

...

I'm also wondering why the Cairo OpenGL backend isn't being used? GPU rendering on integrated GPUs should give a nice speedup since there should be no copying overhead.

On Linux, the cairo library is typically shipped with its GL backend disabled, so that presents sort of a logistical roadblock that'd need solved. Also, while theoretically you're right it should provide a performance boost, it's not guaranteed. OpenGL has been experimental in Cairo and not as thoroughly tested as the X and other backends, so there may well be corner cases where performance is poorer. But no way to know for certain except to hook it up and try it out. A number of us have had this task on our todo list but I don't think anyone's taken a solid shot at it yet.

...

The other place to optimize is pixman. I did some profiling (rapidly zooming in and out with touchscreen) and >= 25% of the time is spent in pixman rendering. I already went ahead and ported a few to AVX2 and got ~1.3x speedup (should get more since my laptop is bottlenecked by memory bandwidth owing to having only 1 memory channel).

Since pixman is low level and widely used, optimzations would be very interesting. I don't know how widespread AVX2 is, or if the 1.3x improvement is a large enough benefit to warrant considering it for Pixman, though. Regardless, I'd be interested in learning more of your work along these paths. Perhaps you'll discover something worth inclusion in upstream codebases?

Thanks, Bryce

...

Function Module Samples sse2_blt.part.0 libpixman-1-0.dll 4221 sse2_combine_in_u libpixman-1-0.dll 2189 sse2_fill libpixman-1-0.dll 1693 cairo_tor_scan_converter_generate libcairo-2.dll 1494 sse2_composite_over_8888_8888 libpixman-1-0.dll 1424 bits_image_fetch_separable_convolution_affine_none_a8r8g8b8 libpixman-1-0.dll 1104 feed_curve_to_cairo(_cairo*Geom::Curve const& libinkscape_base.dll 611 fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC libpixman-1-0.dll 475 fill_xrgb32_lerp_opaque_spans libcairo-2.dll 348 cairo_tor_scan_converter_add_polygon libcairo-2.dll 260 compute_face libcairo-2.dll 238 _dynamic_cast libstdc++-6.dll 209 outer_join libcairo-2.dll 179 cairo_polygon_add_edge libcairo-2.dll 178 g_hash_table_lookup libglib-2.0-0.dll 169 cairo_spline_decompose_into libcairo-2.dll 153 g_slice_alloc libglib-2.0-0.dll 138 cairo_spline_intersects libcairo-2.dll 131 feed_pathvector_to_cairo(_cairo*Geom::PathVector const&) libinkscape_base.dll 127 line_to libcairo-2.dll 119 void std::vector<Geom::Pointstd::allocatorGeom::Point libinkscape_base.dll 116 cairo_matrix_transform_point libcairo-2.dll 110 cell_list_render_edge libcairo-2.dll 106 g_type_check_instance_is_a libgobject-2.0-0.dll 106

Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Yale Zhang

9 Jan 9 Jan

3:38 a.m.

Thanks for the encouragement. It's both encouraging and intimidating to hear that multithreaded rendering hasn't been seriously attempted before.

I have an ace up my sleeve: GCC/LLVM's thread sanitizer. It works like Valgrind/AddressSanitizer but reports race conditions instead of buffer overflows. I figured out how to use it and iteratively eliminated all the fatal race conditions one by one. Patch is attached if anyone wants to try.

Here're the speedups I got on a complex scene with lots of filters (unexpected_visitor.svg from the discussion on my vectorized gaussian blur)

1 thread: 4.2s 8 (hyperthread): 1.1 1 thread (8 for filters): 2.6

CPU: Intel 4770 @3.4 GHz memory: 2 channels DDR3 @ 1866 MHz OS: Windows 10

So a ~2.4x speedup over the current implementation. Not bad, but on another synthetic scene (3 heavily blurred boxes), multithreading is actually > 2x slower than single thread ! This was very puzzling, but I finally figured it out. It's because it's doing almost 4x more work. When rendering a filtered object, all objects behind that one have to be rendered immediately and that intermediate rendering can have a larger area than the rendered region itself since filters can access neighboring pixels. For the blur filter, the expanded region was way bigger than the rectangle each thread was rendering to! It also explains why the current renderer without multithreading is very slow when zooming into a heavily blurred region. It's because the rendering is done in blocks to improve responsiveness. But the block size (64k) is too small. Please consider increasing this - make it 1/8 of the window height?

I hope this isn't a fundamental problem. Any thoughts on how this might be improved?

The other, safer approach is to multithread pixman, but that also has lots of challenges and probably won't be as fast for most scenes: -lots of small functions to optimize -some functions like Cairo's scanline rendering (cairo_tor_scanline_converter_generate()) are probably difficult or too small to be multithreaded -needs more forks/joins (fine grained ||ism). This will be bad on Windows, where the pthread wake up latency is 7 times longer than on Linux.

"I don't know how widespread AVX2 is, or if the 1.3x improvement is a large enough benefit to warrant considering it for [pixman]"

AVX2 should be on all Intel processors since summer, 2013 (Haswell). I measured the speedups again on my desktop with > 2x the memory bandwidth of my laptop and it's still quite weak. It must be that those functions like blits (2D memcpy), fills, composite_in, composite_out, are all bandwidth limited, so wider SIMD isn't much help. You might say this bottleneck would contradict the reported speedups above, but keep in mind that 1 core alone can't fully use up all the memory bandwidth.

I'll have a discussion with the pixman developers to see what they think. On a related note, I've also submitted a patch for Windows touchscreen support in GTK: https://bugzilla.gnome.org/show_bug.cgi?id=776568

-Yale

On Sun, Jan 1, 2017 at 1:42 AM, Bryce Harrington <bryce@...961...> wrote:

...

On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:

...
In my quest for a more fluid experience with fewer distractions, I've attempted to multithread the rendering. I see this has been a long standing discussion:

https://bugs.launchpad.net/inkscape/+bug/200415 https://bugs.launchpad.net/inkscape/+bug/330271

I'm using a Lenovo P40 tablet and the total frame rendering time for a simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2 layers with no filter effects - only alpha compositing). This slow rendering speed makes the touchscreen zooming I recently implemented very jerky.

So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP by splitting the rectangle into 2 and rendering them in ||. I used mutual exclusion for some obviously thread unsafe code like the call to markRect in SPCanvas::paintSingleBuffer(). The rendering would work for a few frames before it freezes (waiting threads timeout and then exit).

Then, I put other calls that I suspected were thread unsafe in mutually exclusive blocks until I discovered _root->render() isn't safe. No point in going further.

Excuse my naive attempt. Can anyone guess how feasible it is to multithread the rendering? For now, I don't care if it's pixel perfect. I just need something that's decent and doesn't crash/freeze.

Yes, various people have looked at multi-threading before, but not founda feasible way to attack it.

...
I'm also wondering why the Cairo OpenGL backend isn't being used? GPU rendering on integrated GPUs should give a nice speedup since there should be no copying overhead.

On Linux, the cairo library is typically shipped with its GL backend disabled, so that presents sort of a logistical roadblock that'd need solved. Also, while theoretically you're right it should provide a performance boost, it's not guaranteed. OpenGL has been experimental in Cairo and not as thoroughly tested as the X and other backends, so there may well be corner cases where performance is poorer. But no way to know for certain except to hook it up and try it out. A number of us have had this task on our todo list but I don't think anyone's taken a solid shot at it yet.

...
The other place to optimize is pixman. I did some profiling (rapidly zooming in and out with touchscreen) and >= 25% of the time is spent in pixman rendering. I already went ahead and ported a few to AVX2 and got ~1.3x speedup (should get more since my laptop is bottlenecked by memory bandwidth owing to having only 1 memory channel).

Since pixman is low level and widely used, optimzations would be very interesting. I don't know how widespread AVX2 is, or if the 1.3x improvement is a large enough benefit to warrant considering it for Pixman, though. Regardless, I'd be interested in learning more of your work along these paths. Perhaps you'll discover something worth inclusion in upstream codebases?

Thanks, Bryce

...
Function Module Samples sse2_blt.part.0 libpixman-1-0.dll 4221 sse2_combine_in_u libpixman-1-0.dll 2189 sse2_fill libpixman-1-0.dll 1693 cairo_tor_scan_converter_generate libcairo-2.dll 1494 sse2_composite_over_8888_8888 libpixman-1-0.dll 1424 bits_image_fetch_separable_convolution_affine_none_a8r8g8b8 libpixman-1-0.dll 1104 feed_curve_to_cairo(_cairo*Geom::Curve const& libinkscape_base.dll 611 fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC libpixman-1-0.dll 475 fill_xrgb32_lerp_opaque_spans libcairo-2.dll 348 cairo_tor_scan_converter_add_polygon libcairo-2.dll 260 compute_face libcairo-2.dll 238 _dynamic_cast libstdc++-6.dll 209 outer_join libcairo-2.dll 179 cairo_polygon_add_edge libcairo-2.dll 178 g_hash_table_lookup libglib-2.0-0.dll 169 cairo_spline_decompose_into libcairo-2.dll 153 g_slice_alloc libglib-2.0-0.dll 138 cairo_spline_intersects libcairo-2.dll 131 feed_pathvector_to_cairo(_cairo*Geom::PathVector const&) libinkscape_base.dll 127 line_to libcairo-2.dll 119 void std::vector<Geom::Pointstd::allocatorGeom::Point libinkscape_base.dll 116 cairo_matrix_transform_point libcairo-2.dll 110 cell_list_render_edge libcairo-2.dll 106 g_type_check_instance_is_a libgobject-2.0-0.dll 106

Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Tavmjong Bah

1:59 p.m.

Hi Yale,

This all looks quite interesting. I've CC'd our resident rendering expert, Krzysztof, who can probably give you the best feedback.

I've wondered for some time on how useful splitting the screen up into tiles really is. Computers have become much faster than they were since that code was written. And your right, it certainly is a big slow down when you zoom in close to an object that has a filter as one still must calculate a quite large area to handle filters that use a large displacement or a large blur radius... and now you have to recalculate it multiple times.

I don't have a good feel for what 64k block size really means vs. 1/8 of a screen. What block size would that require (or conversely, how big of an area does a 64k block size correspond to)?

Tav

On Sun, 2017-01-08 at 19:38 -0800, Yale Zhang wrote:

...

Thanks for the encouragement. It's both encouraging and intimidating to hear that multithreaded rendering hasn't been seriously attempted before.

I have an ace up my sleeve: GCC/LLVM's thread sanitizer. It works like Valgrind/AddressSanitizer but reports race conditions instead of buffer overflows. I figured out how to use it and iteratively eliminated all the fatal race conditions one by one. Patch is attached if anyone wants to try.

Here're the speedups I got on a complex scene with lots of filters (unexpected_visitor.svg from the discussion on my vectorized gaussian blur)

1 thread: 4.2s 8 (hyperthread): 1.1 1 thread (8 for filters): 2.6

CPU: Intel 4770 @3.4 GHz memory:  2 channels DDR3 @ 1866 MHz OS: Windows 10

So a ~2.4x speedup over the current implementation. Not bad, but on another synthetic scene (3 heavily blurred boxes), multithreading is actually > 2x slower than single thread !  This was very puzzling, but I finally figured it out. It's because it's doing almost 4x more work. When rendering a filtered object, all objects behind that one have to be rendered immediately and that intermediate rendering can have a larger area than the rendered region itself since filters can access neighboring pixels.  For the blur filter, the expanded region was way bigger than the rectangle each thread was rendering to! It also explains why the current renderer without multithreading is very slow when zooming into a heavily blurred region. It's because the rendering is done in blocks to improve responsiveness. But the block size (64k) is too small. Please consider increasing this - make it 1/8 of the window height?

I hope this isn't a fundamental problem. Any thoughts on how this might be improved?

The other, safer approach is to multithread pixman, but that also has lots of challenges and probably won't be as fast for most scenes: -lots of small functions to optimize -some functions like Cairo's scanline rendering (cairo_tor_scanline_converter_generate()) are probably difficult or too small to be multithreaded -needs more forks/joins (fine grained ||ism). This will be bad on Windows, where the pthread wake up latency is 7 times longer than on Linux.

"I don't know how widespread AVX2 is, or if the 1.3x improvement is a large enough benefit to warrant considering it for [pixman]"

AVX2 should be on all Intel processors since summer, 2013 (Haswell). I measured the speedups again on my desktop with > 2x the memory bandwidth of my laptop and it's still quite weak. It must be that those functions like blits (2D memcpy), fills, composite_in, composite_out, are all bandwidth limited, so wider SIMD isn't much help. You might say this bottleneck would contradict the reported speedups above, but keep in mind that 1 core alone can't fully use up all the memory bandwidth.

I'll have a discussion with the pixman developers to see what they think. On a related note, I've also submitted a patch for Windows touchscreen support in GTK: https://bugzilla.gnome.org/show_bug.cgi?id=776568

-Yale

On Sun, Jan 1, 2017 at 1:42 AM, Bryce Harrington <bryce@...961...> wrote:

...
On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:

...
In my quest for a more fluid experience with fewer distractions, I've attempted to multithread the rendering. I see this has been a long standing discussion:

https://bugs.launchpad.net/inkscape/+bug/200415 https://bugs.launchpad.net/inkscape/+bug/330271

I'm using a Lenovo P40 tablet and the total frame rendering time for a simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2 layers with no filter effects - only alpha compositing). This slow rendering speed makes the touchscreen zooming I recently implemented very jerky.

So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP by splitting the rectangle into 2 and rendering them in ||. I used mutual exclusion for some obviously thread unsafe code like the call to markRect in SPCanvas::paintSingleBuffer(). The rendering would work for a few frames before it freezes (waiting threads timeout and then exit).

Then, I put other calls that I suspected were thread unsafe in mutually exclusive blocks until I discovered _root->render() isn't safe. No point in going further.

Excuse my naive attempt. Can anyone guess how feasible it is to multithread the rendering? For now, I don't care if it's pixel perfect. I just need something that's decent and doesn't crash/freeze.

Yes, various people have looked at multi-threading before, but not founda feasible way to attack it.

...
I'm also wondering why the Cairo OpenGL backend isn't being used? GPU rendering on integrated GPUs should give a nice speedup since there should be no copying overhead.

On Linux, the cairo library is typically shipped with its GL backend disabled, so that presents sort of a logistical roadblock that'd need solved.  Also, while theoretically you're right it should provide a performance boost, it's not guaranteed.  OpenGL has been experimental in Cairo and not as thoroughly tested as the X and other backends, so there may well be corner cases where performance is poorer.  But no way to know for certain except to hook it up and try it out.  A number of us have had this task on our todo list but I don't think anyone's taken a solid shot at it yet.

...
The other place to optimize is pixman. I did some profiling (rapidly zooming in and out with touchscreen) and >= 25% of the time is spent in pixman rendering. I already went ahead and ported a few to AVX2 and got ~1.3x speedup (should get more since my laptop is bottlenecked by memory bandwidth owing to having only 1 memory channel).

Since pixman is low level and widely used, optimzations would be very interesting.  I don't know how widespread AVX2 is, or if the 1.3x improvement is a large enough benefit to warrant considering it for Pixman, though.  Regardless, I'd be interested in learning more of your work along these paths.  Perhaps you'll discover something worth inclusion in upstream codebases?

Thanks, Bryce

...
Function                                                    Modul e          Samples sse2_blt.part.0 libpixman-1-0.dll    4221 sse2_combine_in_u libpixman-1-0.dll    2189 sse2_fill libpixman-1-0.dll    1693 cairo_tor_scan_converter_generate libcairo-2.dll       1494 sse2_composite_over_8888_8888 libpixman-1-0.dll    1424 bits_image_fetch_separable_convolution_affine_none_a8r8g8b8 libpixman-1-0.dll    1104 feed_curve_to_cairo(_cairo*Geom::Curve const& libinkscape_base.dll 611 fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC libpixman-1-0.dll    475 fill_xrgb32_lerp_opaque_spans libcairo-2.dll       348 cairo_tor_scan_converter_add_polygon libcairo-2.dll       260 compute_face libcairo-2.dll       238 _dynamic_cast libstdc++-6.dll      209 outer_join libcairo-2.dll       179 cairo_polygon_add_edge libcairo-2.dll       178 g_hash_table_lookup libglib-2.0-0.dll    169 cairo_spline_decompose_into libcairo-2.dll       153 g_slice_alloc libglib-2.0-0.dll    138 cairo_spline_intersects libcairo-2.dll       131 feed_pathvector_to_cairo(_cairo*Geom::PathVector    const&) libinkscape_base.dll 127 line_to libcairo-2.dll       119 void std::vector<Geom::Pointstd::allocatorGeom::Point libinkscape_base.dll 116 cairo_matrix_transform_point libcairo-2.dll       110 cell_list_render_edge libcairo-2.dll       106 g_type_check_instance_is_a libgobject-2.0-0.dll 106

Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Yale Zhang

7:18 p.m.

Right, I know Krzysztof ported the renderer to Cairo. Also, thanks to the ThreadSanitizer developers for such a useful tool.

The small block problem is that in SPCanvas::paintRectInternal(), if the dirty rectangle is bigger than 64k pixels, it gets recursively split. From what I've seen, the split always seems to be in the vertical dimension. For a window width of 1500, the height would be < 43 pixels, which would be very inefficient for large blurs.

For my multithreaded testing, I increased the threshold so that it never splits.

// use 256K as a compromise to not slow down gradients // 256K is the cached buffer and we need 4 channels setup.max_pixels = 65536; // 256K/4

Also, in my benchmarks, I disabled the rendering cache.

On Mon, Jan 9, 2017 at 5:59 AM, Tavmjong Bah <tavmjong@...8...> wrote:

...

Hi Yale,

This all looks quite interesting. I've CC'd our resident rendering expert, Krzysztof, who can probably give you the best feedback.

I've wondered for some time on how useful splitting the screen up into tiles really is. Computers have become much faster than they were since that code was written. And your right, it certainly is a big slow down when you zoom in close to an object that has a filter as one still must calculate a quite large area to handle filters that use a large displacement or a large blur radius... and now you have to recalculate it multiple times.

I don't have a good feel for what 64k block size really means vs. 1/8 of a screen. What block size would that require (or conversely, how big of an area does a 64k block size correspond to)?

Tav

On Sun, 2017-01-08 at 19:38 -0800, Yale Zhang wrote:

...
Thanks for the encouragement. It's both encouraging and intimidating to hear that multithreaded rendering hasn't been seriously attempted before.

I have an ace up my sleeve: GCC/LLVM's thread sanitizer. It works like Valgrind/AddressSanitizer but reports race conditions instead of buffer overflows. I figured out how to use it and iteratively eliminated all the fatal race conditions one by one. Patch is attached if anyone wants to try.

Here're the speedups I got on a complex scene with lots of filters (unexpected_visitor.svg from the discussion on my vectorized gaussian blur)

1 thread: 4.2s 8 (hyperthread): 1.1 1 thread (8 for filters): 2.6

CPU: Intel 4770 @3.4 GHz memory: 2 channels DDR3 @ 1866 MHz OS: Windows 10

So a ~2.4x speedup over the current implementation. Not bad, but on another synthetic scene (3 heavily blurred boxes), multithreading is actually > 2x slower than single thread ! This was very puzzling, but I finally figured it out. It's because it's doing almost 4x more work. When rendering a filtered object, all objects behind that one have to be rendered immediately and that intermediate rendering can have a larger area than the rendered region itself since filters can access neighboring pixels. For the blur filter, the expanded region was way bigger than the rectangle each thread was rendering to! It also explains why the current renderer without multithreading is very slow when zooming into a heavily blurred region. It's because the rendering is done in blocks to improve responsiveness. But the block size (64k) is too small. Please consider increasing this - make it 1/8 of the window height?

I hope this isn't a fundamental problem. Any thoughts on how this might be improved?

The other, safer approach is to multithread pixman, but that also has lots of challenges and probably won't be as fast for most scenes: -lots of small functions to optimize -some functions like Cairo's scanline rendering (cairo_tor_scanline_converter_generate()) are probably difficult or too small to be multithreaded -needs more forks/joins (fine grained ||ism). This will be bad on Windows, where the pthread wake up latency is 7 times longer than on Linux.

"I don't know how widespread AVX2 is, or if the 1.3x improvement is a large enough benefit to warrant considering it for [pixman]"

AVX2 should be on all Intel processors since summer, 2013 (Haswell). I measured the speedups again on my desktop with > 2x the memory bandwidth of my laptop and it's still quite weak. It must be that those functions like blits (2D memcpy), fills, composite_in, composite_out, are all bandwidth limited, so wider SIMD isn't much help. You might say this bottleneck would contradict the reported speedups above, but keep in mind that 1 core alone can't fully use up all the memory bandwidth.

I'll have a discussion with the pixman developers to see what they think. On a related note, I've also submitted a patch for Windows touchscreen support in GTK: https://bugzilla.gnome.org/show_bug.cgi?id=776568

-Yale

On Sun, Jan 1, 2017 at 1:42 AM, Bryce Harrington <bryce@...961...> wrote:

...
On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:

...
In my quest for a more fluid experience with fewer distractions, I've attempted to multithread the rendering. I see this has been a long standing discussion:

https://bugs.launchpad.net/inkscape/+bug/200415 https://bugs.launchpad.net/inkscape/+bug/330271

I'm using a Lenovo P40 tablet and the total frame rendering time for a simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2 layers with no filter effects - only alpha compositing). This slow rendering speed makes the touchscreen zooming I recently implemented very jerky.

So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP by splitting the rectangle into 2 and rendering them in ||. I used mutual exclusion for some obviously thread unsafe code like the call to markRect in SPCanvas::paintSingleBuffer(). The rendering would work for a few frames before it freezes (waiting threads timeout and then exit).

Then, I put other calls that I suspected were thread unsafe in mutually exclusive blocks until I discovered _root->render() isn't safe. No point in going further.

Excuse my naive attempt. Can anyone guess how feasible it is to multithread the rendering? For now, I don't care if it's pixel perfect. I just need something that's decent and doesn't crash/freeze.

Yes, various people have looked at multi-threading before, but not founda feasible way to attack it.

...
I'm also wondering why the Cairo OpenGL backend isn't being used? GPU rendering on integrated GPUs should give a nice speedup since there should be no copying overhead.

On Linux, the cairo library is typically shipped with its GL backend disabled, so that presents sort of a logistical roadblock that'd need solved. Also, while theoretically you're right it should provide a performance boost, it's not guaranteed. OpenGL has been experimental in Cairo and not as thoroughly tested as the X and other backends, so there may well be corner cases where performance is poorer. But no way to know for certain except to hook it up and try it out. A number of us have had this task on our todo list but I don't think anyone's taken a solid shot at it yet.

...
The other place to optimize is pixman. I did some profiling (rapidly zooming in and out with touchscreen) and >= 25% of the time is spent in pixman rendering. I already went ahead and ported a few to AVX2 and got ~1.3x speedup (should get more since my laptop is bottlenecked by memory bandwidth owing to having only 1 memory channel).

Since pixman is low level and widely used, optimzations would be very interesting. I don't know how widespread AVX2 is, or if the 1.3x improvement is a large enough benefit to warrant considering it for Pixman, though. Regardless, I'd be interested in learning more of your work along these paths. Perhaps you'll discover something worth inclusion in upstream codebases?

Thanks, Bryce

...
Function Modul e Samples sse2_blt.part.0 libpixman-1-0.dll 4221 sse2_combine_in_u libpixman-1-0.dll 2189 sse2_fill libpixman-1-0.dll 1693 cairo_tor_scan_converter_generate libcairo-2.dll 1494 sse2_composite_over_8888_8888 libpixman-1-0.dll 1424 bits_image_fetch_separable_convolution_affine_none_a8r8g8b8 libpixman-1-0.dll 1104 feed_curve_to_cairo(_cairo*Geom::Curve const& libinkscape_base.dll 611 fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC libpixman-1-0.dll 475 fill_xrgb32_lerp_opaque_spans libcairo-2.dll 348 cairo_tor_scan_converter_add_polygon libcairo-2.dll 260 compute_face libcairo-2.dll 238 _dynamic_cast libstdc++-6.dll 209 outer_join libcairo-2.dll 179 cairo_polygon_add_edge libcairo-2.dll 178 g_hash_table_lookup libglib-2.0-0.dll 169 cairo_spline_decompose_into libcairo-2.dll 153 g_slice_alloc libglib-2.0-0.dll 138 cairo_spline_intersects libcairo-2.dll 131 feed_pathvector_to_cairo(_cairo*Geom::PathVector const&) libinkscape_base.dll 127 line_to libcairo-2.dll 119 void std::vector<Geom::Pointstd::allocatorGeom::Point libinkscape_base.dll 116 cairo_matrix_transform_point libcairo-2.dll 110 cell_list_render_edge libcairo-2.dll 106 g_type_check_instance_is_a libgobject-2.0-0.dll 106

Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

3149

Age (days ago)

3158

Last active (days ago)

List overview

Download

4 comments

3 participants

tags (0)

participants (3)

Bryce Harrington
Tavmjong Bah
Yale Zhang