What's SimpleImage.h for? At first glance it seems a little strange to have "yet another" image representation just for computing blurs. Great work though!
On 09/29/2016 03:25 PM, Yale Zhang wrote:
OK, I've created a branch with the optimized gaussian blur here:
https://code.launchpad.net/~simdgenius/inkscape/inkscape
Can anyone approve merging this into trunk? -Yale
On Mon, Sep 26, 2016 at 5:47 AM, Yale Zhang <yzhang1985@...400...> wrote:
Is no one interested in faster blurs any more? There at least has to be some users who want faster filters like this guy:
https://www.youtube.com/watch?v=NUqzYC_Ehtc @20:30
I've attached one of my comic panels that's quite slow to render with filters on, as an example.
Also, I've prepared a new version that supports dynamic dispatch (SSE2, AVX, AVX2) and doesn't require any compile flag changes. This was quite difficult to do - lots of trouble getting multiple versions of a function to co-exist. But now, this can be a drop in replacement.
Are there any other functions that people want to see become faster? For me, the 2nd biggest one is the turbulence filter.
Maybe Inkscape isn't the best place for optimized code. Lots of projects use FIR filters, so maybe OpenCV? There's also the Oil library, but I don't think think it should go there because filtering isn't that generic and its Intel IPP like approach of providing lots of low level functions to compose from isn't likely to have as good speedup as a fully custom one.
-Yale
On Tue, Sep 20, 2016 at 12:54 PM, Yale Zhang <yzhang1985@...400...> wrote:
Thanks for the data. I realized there's been further problems with the benchmark on both our ends.
- the speedups (Skylake i6700HQ vs Haswell 4770 in column I) isn't a
valid comparison
Those numbers are almost certainly the multithreaded throughput for 4 cores. I forgot to say if you want to benchmark single thread throughput, you need to uncomment the line, // omp_set_num_threads(1)
If I compare your numbers with mine from the 2nd sheet, the speed
ups range from 0.4x to 1.2x, average = 0.78x. This is believable since you're using a power sipping 2.6 GHz CPU compared to a 3.4 GHz for me. If I scale up your numbers by 3.4/2.6 (optimistic, since in a lot of cases, especially IIR, memory bandwidth is a bigger bottleneck than CPU), then the speedup becomes 1.02x.
That's about right. Haswell and Skylake should have almost the same performance per clock cycle.
- the width & height for the benchmark (but not the accuracy checking
test) is swapped due to calling IterateCombinations() with the width & height swapped but I made the same mistake, so no inconsistency.
- my desktop memory is only dual channel. Must've gotten it mixed up
with the desktop and servers I use at work, which are all quad channel.
multithreaded throughput unstable - > 10% run to run difference
Windows' anti-malware service?
Also, I found the performance of the optimized loops where I use "goto middle" to handle SIMD remainders while keeping code size to a minimum (so it fits in cache or even better, the uOP cache) is fragile. I found the Microsoft compiler reorders the basic blocks, resulting in a loop that does 2 branches/iteration instead of 1, dropping the performance by almost 2x. Luckily, GCC 6 doesn't do that, which is why I used it for the benchmark. If anyone knows how to discourage the compiler from reordering code like that, I'd like to know.
I just tried gcc 6.1.0, but saw no noticeable difference.
"can I make it print also the reference speed?" yes, set useRefCode in BenchmarkFunction() to true
-Yale
On Tue, Sep 20, 2016 at 4:00 AM, Alexander Brock <a.brock@...2965...> wrote:
On 09/20/2016 03:33 AM, Yale Zhang wrote:
" "error loading". I think I need the file" Right, that's a panel from my comic. I'd rather not share it, but you can just use any 8bit color PNG. The resolution I used for the benchmark is 1467x1373.
I made a file with this resolution and ran the test. The program output is attached in the file log1
real 0m7.513s user 0m57.092s sys 0m0.284s
May I ask how you're using blurs in your work?
I use only Gaussian blurr and I use it very rarely.
I also ran the benchmark code but it only prints speed of the vectorized version, can I make it print also the reference speed? I attached my results and my memory configuration.
Best Regards, Alexander
Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel