Re: [Inkscape-devel] performance problems and possible remedies?

29 Sep 2016

      OK, I've created a branch with the optimized gaussian blur here:
https://code.launchpad.net/~simdgenius/inkscape/inkscape
Can anyone approve merging this into trunk?
-Yale
On Mon, Sep 26, 2016 at 5:47 AM, Yale Zhang <yzhang1985@...400...> wrote:
...
Is no one interested in faster blurs any more? There at least has to
be some users who want faster filters like this guy:
https://www.youtube.com/watch?v=NUqzYC_Ehtc  @20:30
I've attached one of my comic panels that's quite slow to render with
filters on, as an example.
Also, I've prepared a new version that supports dynamic dispatch
(SSE2, AVX, AVX2) and doesn't require any compile flag changes. This
was quite difficult to do - lots of trouble getting multiple versions
of a function to co-exist. But now, this can be a drop in replacement.
Are there any other functions that people want to see become faster?
For me, the 2nd biggest one is the turbulence filter.
Maybe Inkscape isn't the best place for optimized code. Lots of
projects use FIR filters, so maybe OpenCV?
There's also the Oil library, but I don't think think it should go
there because filtering isn't that generic and its Intel IPP like
approach of providing lots of low level functions to compose from
isn't likely to have as good speedup as a fully custom one.
-Yale
On Tue, Sep 20, 2016 at 12:54 PM, Yale Zhang <yzhang1985@...400...> wrote:
...
Thanks for the data. I realized there's been further problems with the
benchmark on both our ends.

the speedups (Skylake i6700HQ vs Haswell 4770 in column I) isn't a

valid comparison
Those numbers are almost certainly the multithreaded throughput for 4 cores.
   I forgot to say if you want to benchmark single thread throughput,
you need to uncomment the line, // omp_set_num_threads(1)
If I compare your numbers with mine from the 2nd sheet, the speed

ups range from 0.4x to 1.2x, average = 0.78x. This is believable since
you're using a power sipping 2.6 GHz CPU compared to a 3.4 GHz for me.
If I scale up your numbers by 3.4/2.6 (optimistic, since in a lot of
cases, especially IIR, memory bandwidth is a bigger bottleneck than
CPU), then the speedup becomes 1.02x.
That's about right. Haswell and Skylake should have almost the same
performance per clock cycle.

the width & height for the benchmark (but not the accuracy checking

test) is swapped due to calling IterateCombinations() with the width &
height swapped
    but I made the same mistake, so no inconsistency.

my desktop memory is only dual channel. Must've gotten it mixed up

with the desktop and servers I use at work, which are all quad
channel.

multithreaded throughput unstable - > 10% run to run difference
Windows' anti-malware service?

Also, I found the performance of the optimized loops where I use "goto
middle" to handle SIMD remainders while keeping code size to a minimum
(so it fits in cache or even better, the uOP cache) is fragile. I
found the Microsoft compiler reorders the basic blocks, resulting in a
loop that does 2 branches/iteration instead of 1, dropping the
performance by almost 2x. Luckily, GCC 6 doesn't do that, which is why
I used it for the benchmark. If anyone knows how to discourage the
compiler from reordering code like that, I'd like to know.
I just tried gcc 6.1.0, but saw no noticeable difference.
"can I make it print also the reference speed?"
yes, set useRefCode in BenchmarkFunction() to true
-Yale
On Tue, Sep 20, 2016 at 4:00 AM, Alexander Brock
<a.brock@...2965...> wrote:
...
On 09/20/2016 03:33 AM, Yale Zhang wrote:
...
" "error loading". I think I need the file"
Right, that's a panel from my comic. I'd rather not share it, but you
can just use any 8bit color PNG. The resolution I used for the
benchmark is 1467x1373.
I made a file with this resolution and ran the test. The program output
is attached in the file log1
real    0m7.513s
user    0m57.092s
sys     0m0.284s
...
May I ask how you're using blurs in your work?
I use only Gaussian blurr and I use it very rarely.
I also ran the benchmark code but it only prints speed of the vectorized
version, can I make it print also the reference speed? I attached my
results and my memory configuration.
Best Regards,
Alexander

Inkscape-devel mailing list
Inkscape-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/inkscape-devel