Glad to finally hear from you all. I've submitted a merge proposal.
"What's SimpleImage.h for?"
It's to allow 2D indexing with image[y][x] instead of image[y * width + x].
The intention is to use a minimal representation (pointer and stride)
that's completely reusable. Functions that process images will only
use the minimal representation while functions that create images will
use a more specific class that handles memory allocation.
Since templates make things generic anyways, it could be changed to
directly use a specific image class if Inkscape already has one. My
concern is that the  operator should use a pointer sized integer
instead of int or else the compiler will generate redundant sign or
zero extend instructions:
On Thu, Sep 29, 2016 at 8:12 AM, Jasper van de Gronde
What's SimpleImage.h for? At first glance it seems a little
have "yet another" image representation just for computing blurs. Great
On 09/29/2016 03:25 PM, Yale Zhang wrote:
> OK, I've created a branch with the optimized gaussian blur here:
> Can anyone approve merging this into trunk?
> On Mon, Sep 26, 2016 at 5:47 AM, Yale Zhang <yzhang1985@...400...> wrote:
>> Is no one interested in faster blurs any more? There at least has to
>> be some users who want faster filters like this guy:
>> I've attached one of my comic panels that's quite slow to render with
>> filters on, as an example.
>> Also, I've prepared a new version that supports dynamic dispatch
>> (SSE2, AVX, AVX2) and doesn't require any compile flag changes. This
>> was quite difficult to do - lots of trouble getting multiple versions
>> of a function to co-exist. But now, this can be a drop in replacement.
>> Are there any other functions that people want to see become faster?
>> For me, the 2nd biggest one is the turbulence filter.
>> Maybe Inkscape isn't the best place for optimized code. Lots of
>> projects use FIR filters, so maybe OpenCV?
>> There's also the Oil library, but I don't think think it should go
>> there because filtering isn't that generic and its Intel IPP like
>> approach of providing lots of low level functions to compose from
>> isn't likely to have as good speedup as a fully custom one.
>> On Tue, Sep 20, 2016 at 12:54 PM, Yale Zhang <yzhang1985@...400...> wrote:
>>> Thanks for the data. I realized there's been further problems with the
>>> benchmark on both our ends.
>>> 1. the speedups (Skylake i6700HQ vs Haswell 4770 in column I) isn't a
>>> valid comparison
>>> Those numbers are almost certainly the multithreaded throughput for 4
>>> I forgot to say if you want to benchmark single thread throughput,
>>> you need to uncomment the line, // omp_set_num_threads(1)
>>> If I compare your numbers with mine from the 2nd sheet, the speed
>>> ups range from 0.4x to 1.2x, average = 0.78x. This is believable since
>>> you're using a power sipping 2.6 GHz CPU compared to a 3.4 GHz for me.
>>> If I scale up your numbers by 3.4/2.6 (optimistic, since in a lot of
>>> cases, especially IIR, memory bandwidth is a bigger bottleneck than
>>> CPU), then the speedup becomes 1.02x.
>>> That's about right. Haswell and Skylake should have almost the same
>>> performance per clock cycle.
>>> 2. the width & height for the benchmark (but not the accuracy checking
>>> test) is swapped due to calling IterateCombinations() with the width &
>>> height swapped
>>> but I made the same mistake, so no inconsistency.
>>> 3. my desktop memory is only dual channel. Must've gotten it mixed up
>>> with the desktop and servers I use at work, which are all quad
>>> 4. multithreaded throughput unstable - > 10% run to run difference
>>> Windows' anti-malware service?
>>> Also, I found the performance of the optimized loops where I use "goto
>>> middle" to handle SIMD remainders while keeping code size to a minimum
>>> (so it fits in cache or even better, the uOP cache) is fragile. I
>>> found the Microsoft compiler reorders the basic blocks, resulting in a
>>> loop that does 2 branches/iteration instead of 1, dropping the
>>> performance by almost 2x. Luckily, GCC 6 doesn't do that, which is why
>>> I used it for the benchmark. If anyone knows how to discourage the
>>> compiler from reordering code like that, I'd like to know.
>>> I just tried gcc 6.1.0, but saw no noticeable difference.
>>> "can I make it print also the reference speed?"
>>> yes, set useRefCode in BenchmarkFunction() to true
>>> On Tue, Sep 20, 2016 at 4:00 AM, Alexander Brock
>>> <a.brock@...2965...> wrote:
>>>> On 09/20/2016 03:33 AM, Yale Zhang wrote:
>>>>> " "error loading". I think I need the file"
>>>>> Right, that's a panel from my comic. I'd rather not share it,
>>>>> can just use any 8bit color PNG. The resolution I used for the
>>>>> benchmark is 1467x1373.
>>>> I made a file with this resolution and ran the test. The program output
>>>> is attached in the file log1
>>>> real 0m7.513s
>>>> user 0m57.092s
>>>> sys 0m0.284s
>>>>> May I ask how you're using blurs in your work?
>>>> I use only Gaussian blurr and I use it very rarely.
>>>> I also ran the benchmark code but it only prints speed of the vectorized
>>>> version, can I make it print also the reference speed? I attached my
>>>> results and my memory configuration.
>>>> Best Regards,
>>>> Inkscape-devel mailing list
> Inkscape-devel mailing list
Inkscape-devel mailing list