Re: [Inkscape-devel] multithreaded rendering

9 Jan 2017

      Thanks for the encouragement. It's both encouraging and intimidating
to hear that multithreaded rendering hasn't been seriously attempted
before.
I have an ace up my sleeve: GCC/LLVM's thread sanitizer. It works like
Valgrind/AddressSanitizer but reports race conditions instead of
buffer overflows. I figured out how to use it and iteratively
eliminated all the fatal race conditions one by one. Patch is attached
if anyone wants to try.
Here're the speedups I got on a complex scene with lots of filters
(unexpected_visitor.svg from the discussion on my vectorized gaussian
blur)
1 thread: 4.2s
8 (hyperthread): 1.1
1 thread (8 for filters): 2.6
CPU: Intel 4770 @3.4 GHz
memory:  2 channels DDR3 @ 1866 MHz
OS: Windows 10
So a ~2.4x speedup over the current implementation. Not bad, but on
another synthetic scene (3 heavily blurred boxes), multithreading is
actually > 2x slower than single thread !  This was very puzzling, but
I finally figured it out. It's because it's doing almost 4x more work.
When rendering a filtered object, all objects behind that one have to
be rendered immediately and that intermediate rendering can have a
larger area than the rendered region itself since filters can access
neighboring pixels.  For the blur filter, the expanded region was way
bigger than the rectangle each thread was rendering to! It also
explains why the current renderer without multithreading is very slow
when zooming into a heavily blurred region. It's because the rendering
is done in blocks to improve responsiveness. But the block size (64k)
is too small. Please consider increasing this - make it 1/8 of the
window height?
I hope this isn't a fundamental problem. Any thoughts on how this
might be improved?
The other, safer approach is to multithread pixman, but that also has
lots of challenges and probably won't be as fast for most scenes:
-lots of small functions to optimize
-some functions like Cairo's scanline rendering
(cairo_tor_scanline_converter_generate()) are probably difficult or
too small to be multithreaded
-needs more forks/joins (fine grained ||ism). This will be bad on
Windows, where the pthread wake up latency is 7 times longer than on
Linux.
"I don't know how widespread AVX2 is, or if the 1.3x improvement is a
large enough benefit to warrant considering it for [pixman]"
AVX2 should be on all Intel processors since summer, 2013 (Haswell). I
measured the speedups again on my desktop with > 2x the memory
bandwidth of my laptop and it's still quite weak. It must be that
those functions like blits (2D memcpy), fills, composite_in,
composite_out, are all bandwidth limited, so wider SIMD isn't much
help. You might say this bottleneck would contradict the reported
speedups above, but keep in mind that 1 core alone can't fully use up
all the memory bandwidth.
I'll have a discussion with the pixman developers to see what they
think. On a related note, I've also submitted a patch for Windows
touchscreen support in GTK:
https://bugzilla.gnome.org/show_bug.cgi?id=776568
-Yale
On Sun, Jan 1, 2017 at 1:42 AM, Bryce Harrington
<bryce@...961...> wrote:
...
On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:
...
In my quest for a more fluid experience with fewer distractions, I've
attempted to multithread the rendering. I see this has been a long
standing discussion:
https://bugs.launchpad.net/inkscape/+bug/200415
https://bugs.launchpad.net/inkscape/+bug/330271
I'm using a Lenovo P40 tablet and the total frame rendering time for a
simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2
layers with no filter effects - only alpha compositing). This slow
rendering speed makes the touchscreen zooming I recently implemented
very jerky.
So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP
by splitting the rectangle into 2 and rendering them in ||.
I used mutual exclusion for some obviously thread unsafe code like the
call to markRect in SPCanvas::paintSingleBuffer(). The rendering would
work for a few frames before it freezes (waiting threads timeout and
then exit).
Then, I put other calls that I suspected were thread unsafe in
mutually exclusive blocks until I discovered _root->render() isn't
safe. No point in going further.
Excuse my naive attempt. Can anyone guess how feasible it is to
multithread the rendering? For now, I don't care if it's pixel
perfect. I just need something that's decent and doesn't crash/freeze.
Yes, various people have looked at multi-threading before, but not
founda feasible way to attack it.
...
I'm also wondering why the Cairo OpenGL backend isn't being used? GPU
rendering on integrated GPUs should give a nice speedup since there
should be no copying overhead.
On Linux, the cairo library is typically shipped with its GL backend
disabled, so that presents sort of a logistical roadblock that'd need
solved.  Also, while theoretically you're right it should provide a
performance boost, it's not guaranteed.  OpenGL has been experimental in
Cairo and not as thoroughly tested as the X and other backends, so there
may well be corner cases where performance is poorer.  But no way to
know for certain except to hook it up and try it out.  A number of us
have had this task on our todo list but I don't think anyone's taken a
solid shot at it yet.
...
The other place to optimize is pixman. I did some profiling (rapidly
zooming in and out with touchscreen) and >= 25% of the time is spent
in pixman rendering. I already went ahead and ported a few to AVX2 and
got ~1.3x speedup (should get more since my laptop is bottlenecked by
memory bandwidth owing to having only 1 memory channel).
Since pixman is low level and widely used, optimzations would be very
interesting.  I don't know how widespread AVX2 is, or if the 1.3x
improvement is a large enough benefit to warrant considering it for
Pixman, though.  Regardless, I'd be interested in learning more of your
work along these paths.  Perhaps you'll discover something worth
inclusion in upstream codebases?
Thanks,
Bryce
...
Function                                                    Module
         Samples
sse2_blt.part.0
libpixman-1-0.dll    4221
sse2_combine_in_u
libpixman-1-0.dll    2189
sse2_fill
libpixman-1-0.dll    1693
cairo_tor_scan_converter_generate
libcairo-2.dll       1494
sse2_composite_over_8888_8888
libpixman-1-0.dll    1424
bits_image_fetch_separable_convolution_affine_none_a8r8g8b8
libpixman-1-0.dll    1104
feed_curve_to_cairo(_cairo*Geom::Curve const&
libinkscape_base.dll 611
fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC
libpixman-1-0.dll    475
fill_xrgb32_lerp_opaque_spans
libcairo-2.dll       348
cairo_tor_scan_converter_add_polygon
libcairo-2.dll       260
compute_face
libcairo-2.dll       238
_dynamic_cast
libstdc++-6.dll      209
outer_join
libcairo-2.dll       179
cairo_polygon_add_edge
libcairo-2.dll       178
g_hash_table_lookup
libglib-2.0-0.dll    169
cairo_spline_decompose_into
libcairo-2.dll       153
g_slice_alloc
libglib-2.0-0.dll    138
cairo_spline_intersects
libcairo-2.dll       131
feed_pathvector_to_cairo(_cairo*Geom::PathVector    const&)
libinkscape_base.dll 127
line_to
libcairo-2.dll       119
void std::vector<Geom::Pointstd::allocatorGeom::Point
libinkscape_base.dll 116
cairo_matrix_transform_point
libcairo-2.dll       110
cell_list_render_edge
libcairo-2.dll       106
g_type_check_instance_is_a
libgobject-2.0-0.dll 106

Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Inkscape-devel mailing list
Inkscape-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/inkscape-devel