Multithreaded blur using OpenMP
Using GCC's support of OpenMP (now also in MingW!) I had a stab at making the outer loop of blurring parallel (the loops are typical DSP loops and shouldn't even throw an exception, so it should be perfectly safe). On my dual-core system blurring (a large image) is approximately twice as fast, so it seems to scale really well.
As I don't have access to a Linux system at the moment I haven't modified the Makefiles to add support, but it shouldn't be too difficult (see the changes to build.xml). Also, I've made it write a log to be able to analyze the performance, but it uses Win32 specific functions for timing. So, if anyone would like to have a go at making it work on Linux and/or testing it that would be great :)
The number of threads defaults to whatever omp_get_num_procs() returns, but on my system this doesn't work... So to use it, set the number of threads in the preferences, using: <group id="threading" numthreads="1" /> in the "options" group (so it's /options/threading/numthreads).
Obviously this is not the most efficient way to do this, as new threads are created for each blur. But so far the overhead doesn't appear to be too bad, compared with the actual blurring operation. And using OpenMP the impact on the code is minimal, I only had to put two #pragma's before the blur loops, allocate the temporary data it needs for all threads and add some preference for the number of threads. So if this works out it might provide a relatively easy way to start making use of all these dual- and quad-core machines people have lying around doing almost nothing these days :)
Patch attached, if wanted I can also provide a binary for Windows. This version writes a log file (blurlog.txt) that lists how long each blur takes.
Note that inkscape also needs mingwm10.dll to run with this patch. As it isn't included with devlibs at the moment you'll have to copy it to Inkscape's directory manually (it's in the MingW bin directory).
Index: build.xml =================================================================== --- build.xml (revision 20219) +++ build.xml (working copy) @@ -375,6 +375,7 @@ -Wall -Wformat -Werror=format-security -W -Wpointer-arith -Wcast-align -Wsign-compare -Woverloaded-virtual -Wswitch -O2 -mms-bitfields + -fopenmp </flags> <defines> -DVERSION="${version}" @@ -476,6 +477,7 @@ objcopycommand="${archutil}objcopy"> <flags> -mwindows + -mthreads </flags> <fileset dir="${build}"> <include name="inkres.o"/> @@ -501,6 +503,7 @@ -lpng -ljpeg.dll -ltiff.dll -lpopt ${devlibs}/lib/zdll.lib -lgc -lws2_32 -lintl -lgdi32 -lcomdlg32 -lm + -lgomp -lpthreadGC2 </libs> </link> </target> @@ -530,6 +533,7 @@ objcopycommand="${archutil}objcopy"> <flags> -mwindows + -mthreads </flags> <fileset dir="${build}"> <include name="inkviewres.o"/> @@ -553,6 +557,7 @@ -lpng -ljpeg.dll -ltiff.dll -lpopt ${devlibs}/lib/zdll.lib -lgc -lws2_32 -lintl -lgdi32 -lcomdlg32 -lm + -lgomp -lpthreadGC2 </libs> </link> </target> @@ -572,6 +577,7 @@ stripcommand="${archutil}strip" objcopycommand="${archutil}objcopy"> <flags> + -mthreads </flags> <fileset dir="${build}"> <include name="obj/test-main.o"/> @@ -602,6 +608,7 @@ -lpng -ljpeg.dll -ltiff.dll -lpopt ${devlibs}/lib/zdll.lib -lgc -lws2_32 -lintl -lgdi32 -lcomdlg32 -lm + -lgomp -lpthreadGC2 </libs> </link> </target> @@ -667,6 +674,7 @@ <copy todir="${dist}" file="${devlibs}/bin/libpopt-0.dll"/> <copy todir="${dist}" file="${devlibs}/bin/liblcms-1.dll"/> <copy todir="${dist}" file="${devlibs}/bin/intl.dll"/> + <copy todir="${dist}" file="${devlibs}/bin/pthreadGC2.dll"/> <copy file="${devlibs}/bin/intl.dll" tofile="${dist}/libintl-2.dll"/>
<!-- MSGFMT files --> Index: src/display/nr-filter-gaussian.cpp =================================================================== --- src/display/nr-filter-gaussian.cpp (revision 20219) +++ src/display/nr-filter-gaussian.cpp (working copy) @@ -16,9 +16,11 @@ #include <algorithm> #include <cmath> #include <complex> +#include <cstdlib> #include <glib.h> -#include <cstdlib> #include <limits> +#include <omp.h> +#include <windows.h> // For performance logging
#include "2geom/isnan.h"
@@ -268,9 +270,11 @@ filter2D_IIR(PT *const dest, int const dstr1, int const dstr2, PT const *const src, int const sstr1, int const sstr2, int const n1, int const n2, IIRValue const b[N+1], double const M[N*N], - IIRValue *const tmpdata) + IIRValue *const tmpdata[], int const num_threads) { +#pragma omp parallel for num_threads(num_threads) for ( int c2 = 0 ; c2 < n2 ; c2++ ) { + unsigned int tid = omp_get_thread_num(); // corresponding line in the source and output buffer PT const * srcimg = src + c2*sstr2; PT * dstimg = dest + c2*dstr2 + n1*dstr1; @@ -288,7 +292,7 @@ for(unsigned int i=1; i<N+1; i++) { for(unsigned int c=0; c<PC; c++) u[0][c] += u[i][c]*b[i]; } - copy_n(u[0], PC, tmpdata+c1*PC); + copy_n(u[0], PC, tmpdata[tid]+c1*PC); } // Backward pass IIRValue v[N+1][PC]; @@ -303,7 +307,7 @@ int c1=n1-1; while(c1-->0) { for(unsigned int i=N; i>0; i--) copy_n(v[i-1], PC, v[i]); - copy_n(tmpdata+c1*PC, PC, v[0]); + copy_n(tmpdata[tid]+c1*PC, PC, v[0]); for(unsigned int c=0; c<PC; c++) v[0][c] *= b[0]; for(unsigned int i=1; i<N+1; i++) { for(unsigned int c=0; c<PC; c++) v[0][c] += v[i][c]*b[i]; @@ -326,11 +330,12 @@ static void filter2D_FIR(PT *const dst, int const dstr1, int const dstr2, PT const *const src, int const sstr1, int const sstr2, - int const n1, int const n2, FIRValue const *const kernel, int const scr_len) + int const n1, int const n2, FIRValue const *const kernel, int const scr_len, int const num_threads) { // Past pixels seen (to enable in-place operation) PT history[scr_len+1][PC];
+#pragma omp parallel for num_threads(num_threads) private(history) for ( int c2 = 0 ; c2 < n2 ; c2++ ) {
// corresponding line in the source buffer @@ -539,13 +544,14 @@ }
// Some common constants + Inkscape::Preferences *prefs = Inkscape::Preferences::get(); int const width_org = in->area.x1-in->area.x0, height_org = in->area.y1-in->area.y0; double const deviation_x_org = _deviation_x * NR::expansionX(trans); double const deviation_y_org = _deviation_y * NR::expansionY(trans); int const PC = NR_PIXBLOCK_BPP(in); + int const NTHREADS = std::max(1,std::min(8,prefs->getInt("/options/threading/numthreads",omp_get_num_procs())));
// Subsampling constants - Inkscape::Preferences *prefs = Inkscape::Preferences::get(); int const quality = prefs->getInt("/options/blurquality/value"); int const x_step_l2 = _effect_subsample_step_log2(deviation_x_org, quality); int const y_step_l2 = _effect_subsample_step_log2(deviation_y_org, quality); @@ -567,6 +573,11 @@ bool const use_IIR_x = deviation_x > 3; bool const use_IIR_y = deviation_y > 3;
+ // Temporary performance logging + LARGE_INTEGER startTime, endTime, timeFrequency; + QueryPerformanceFrequency(&timeFrequency); + QueryPerformanceCounter(&startTime); + // new buffer for the subsampled output NRPixBlock *out = new NRPixBlock; nr_pixblock_setup_fast(out, in->mode, in->area.x0/x_step, in->area.y0/y_step, @@ -577,13 +588,19 @@ } // Temporary storage for IIR filter // NOTE: This can be eliminated, but it reduces the precision a bit - IIRValue * tmpdata = 0; + IIRValue * tmpdata[NTHREADS]; + std::fill_n(tmpdata, NTHREADS, (IIRValue*)0); if ( use_IIR_x || use_IIR_y ) { - tmpdata = new IIRValue[std::max(width,height)*PC]; - if (tmpdata == NULL) { - nr_pixblock_release(out); - delete out; - return 0; + for(int i=0; i<NTHREADS; i++) { + tmpdata[i] = new IIRValue[std::max(width,height)*PC]; + if (tmpdata[i] == NULL) { + nr_pixblock_release(out); + while(i-->0) { + delete[] tmpdata[i]; + } + delete out; + return 0; + } } } NRPixBlock *ssin = in; @@ -629,16 +646,16 @@ // Filter (x) switch(in->mode) { case NR_PIXBLOCK_MODE_A8: ///< Grayscale - filter2D_IIR<unsigned char,1,false>(NR_PIXBLOCK_PX(out), 1, out->rs, NR_PIXBLOCK_PX(ssin), 1, ssin->rs, width, height, b, M, tmpdata); + filter2D_IIR<unsigned char,1,false>(NR_PIXBLOCK_PX(out), 1, out->rs, NR_PIXBLOCK_PX(ssin), 1, ssin->rs, width, height, b, M, tmpdata, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8: ///< 8 bit RGB - filter2D_IIR<unsigned char,3,false>(NR_PIXBLOCK_PX(out), 3, out->rs, NR_PIXBLOCK_PX(ssin), 3, ssin->rs, width, height, b, M, tmpdata); + filter2D_IIR<unsigned char,3,false>(NR_PIXBLOCK_PX(out), 3, out->rs, NR_PIXBLOCK_PX(ssin), 3, ssin->rs, width, height, b, M, tmpdata, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8A8N: ///< Normal 8 bit RGBA - filter2D_IIR<unsigned char,4,false>(NR_PIXBLOCK_PX(out), 4, out->rs, NR_PIXBLOCK_PX(ssin), 4, ssin->rs, width, height, b, M, tmpdata); + filter2D_IIR<unsigned char,4,false>(NR_PIXBLOCK_PX(out), 4, out->rs, NR_PIXBLOCK_PX(ssin), 4, ssin->rs, width, height, b, M, tmpdata, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8A8P: ///< Premultiplied 8 bit RGBA - filter2D_IIR<unsigned char,4,true >(NR_PIXBLOCK_PX(out), 4, out->rs, NR_PIXBLOCK_PX(ssin), 4, ssin->rs, width, height, b, M, tmpdata); + filter2D_IIR<unsigned char,4,true >(NR_PIXBLOCK_PX(out), 4, out->rs, NR_PIXBLOCK_PX(ssin), 4, ssin->rs, width, height, b, M, tmpdata, NTHREADS); break; default: assert(false); @@ -651,16 +668,16 @@ // Filter (x) switch(in->mode) { case NR_PIXBLOCK_MODE_A8: ///< Grayscale - filter2D_FIR<unsigned char,1>(NR_PIXBLOCK_PX(out), 1, out->rs, NR_PIXBLOCK_PX(ssin), 1, ssin->rs, width, height, kernel, scr_len_x); + filter2D_FIR<unsigned char,1>(NR_PIXBLOCK_PX(out), 1, out->rs, NR_PIXBLOCK_PX(ssin), 1, ssin->rs, width, height, kernel, scr_len_x, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8: ///< 8 bit RGB - filter2D_FIR<unsigned char,3>(NR_PIXBLOCK_PX(out), 3, out->rs, NR_PIXBLOCK_PX(ssin), 3, ssin->rs, width, height, kernel, scr_len_x); + filter2D_FIR<unsigned char,3>(NR_PIXBLOCK_PX(out), 3, out->rs, NR_PIXBLOCK_PX(ssin), 3, ssin->rs, width, height, kernel, scr_len_x, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8A8N: ///< Normal 8 bit RGBA - filter2D_FIR<unsigned char,4>(NR_PIXBLOCK_PX(out), 4, out->rs, NR_PIXBLOCK_PX(ssin), 4, ssin->rs, width, height, kernel, scr_len_x); + filter2D_FIR<unsigned char,4>(NR_PIXBLOCK_PX(out), 4, out->rs, NR_PIXBLOCK_PX(ssin), 4, ssin->rs, width, height, kernel, scr_len_x, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8A8P: ///< Premultiplied 8 bit RGBA - filter2D_FIR<unsigned char,4>(NR_PIXBLOCK_PX(out), 4, out->rs, NR_PIXBLOCK_PX(ssin), 4, ssin->rs, width, height, kernel, scr_len_x); + filter2D_FIR<unsigned char,4>(NR_PIXBLOCK_PX(out), 4, out->rs, NR_PIXBLOCK_PX(ssin), 4, ssin->rs, width, height, kernel, scr_len_x, NTHREADS); break; default: assert(false); @@ -688,16 +705,16 @@ // Filter (y) switch(in->mode) { case NR_PIXBLOCK_MODE_A8: ///< Grayscale - filter2D_IIR<unsigned char,1,false>(NR_PIXBLOCK_PX(out), out->rs, 1, NR_PIXBLOCK_PX(out), out->rs, 1, height, width, b, M, tmpdata); + filter2D_IIR<unsigned char,1,false>(NR_PIXBLOCK_PX(out), out->rs, 1, NR_PIXBLOCK_PX(out), out->rs, 1, height, width, b, M, tmpdata, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8: ///< 8 bit RGB - filter2D_IIR<unsigned char,3,false>(NR_PIXBLOCK_PX(out), out->rs, 3, NR_PIXBLOCK_PX(out), out->rs, 3, height, width, b, M, tmpdata); + filter2D_IIR<unsigned char,3,false>(NR_PIXBLOCK_PX(out), out->rs, 3, NR_PIXBLOCK_PX(out), out->rs, 3, height, width, b, M, tmpdata, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8A8N: ///< Normal 8 bit RGBA - filter2D_IIR<unsigned char,4,false>(NR_PIXBLOCK_PX(out), out->rs, 4, NR_PIXBLOCK_PX(out), out->rs, 4, height, width, b, M, tmpdata); + filter2D_IIR<unsigned char,4,false>(NR_PIXBLOCK_PX(out), out->rs, 4, NR_PIXBLOCK_PX(out), out->rs, 4, height, width, b, M, tmpdata, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8A8P: ///< Premultiplied 8 bit RGBA - filter2D_IIR<unsigned char,4,true >(NR_PIXBLOCK_PX(out), out->rs, 4, NR_PIXBLOCK_PX(out), out->rs, 4, height, width, b, M, tmpdata); + filter2D_IIR<unsigned char,4,true >(NR_PIXBLOCK_PX(out), out->rs, 4, NR_PIXBLOCK_PX(out), out->rs, 4, height, width, b, M, tmpdata, NTHREADS); break; default: assert(false); @@ -710,24 +727,32 @@ // Filter (y) switch(in->mode) { case NR_PIXBLOCK_MODE_A8: ///< Grayscale - filter2D_FIR<unsigned char,1>(NR_PIXBLOCK_PX(out), out->rs, 1, NR_PIXBLOCK_PX(out), out->rs, 1, height, width, kernel, scr_len_y); + filter2D_FIR<unsigned char,1>(NR_PIXBLOCK_PX(out), out->rs, 1, NR_PIXBLOCK_PX(out), out->rs, 1, height, width, kernel, scr_len_y, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8: ///< 8 bit RGB - filter2D_FIR<unsigned char,3>(NR_PIXBLOCK_PX(out), out->rs, 3, NR_PIXBLOCK_PX(out), out->rs, 3, height, width, kernel, scr_len_y); + filter2D_FIR<unsigned char,3>(NR_PIXBLOCK_PX(out), out->rs, 3, NR_PIXBLOCK_PX(out), out->rs, 3, height, width, kernel, scr_len_y, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8A8N: ///< Normal 8 bit RGBA - filter2D_FIR<unsigned char,4>(NR_PIXBLOCK_PX(out), out->rs, 4, NR_PIXBLOCK_PX(out), out->rs, 4, height, width, kernel, scr_len_y); + filter2D_FIR<unsigned char,4>(NR_PIXBLOCK_PX(out), out->rs, 4, NR_PIXBLOCK_PX(out), out->rs, 4, height, width, kernel, scr_len_y, NTHREADS); break; case NR_PIXBLOCK_MODE_R8G8B8A8P: ///< Premultiplied 8 bit RGBA - filter2D_FIR<unsigned char,4>(NR_PIXBLOCK_PX(out), out->rs, 4, NR_PIXBLOCK_PX(out), out->rs, 4, height, width, kernel, scr_len_y); + filter2D_FIR<unsigned char,4>(NR_PIXBLOCK_PX(out), out->rs, 4, NR_PIXBLOCK_PX(out), out->rs, 4, height, width, kernel, scr_len_y, NTHREADS); break; default: assert(false); }; }
- delete[] tmpdata; // deleting a nullptr has no effect, so this is save + for(int i=0; i<NTHREADS; i++) { + delete[] tmpdata[i]; // deleting a nullptr has no effect, so this is safe + }
+ // Temporary performance logging + QueryPerformanceCounter(&endTime); + FILE* logfile = fopen("blurlog.txt", "at"); + fprintf(logfile, "image size: %dx%d, threads: %d, time: %.3gs\n", width, height, NTHREADS, static_cast<double>(endTime.QuadPart-startTime.QuadPart)/timeFrequency.QuadPart); + fclose(logfile); + if ( !resampling ) { // No upsampling needed out->empty = FALSE;
This looks reasonable for the most part. As long as you're sticking to manipulating data in the buffers and not manipulating the object model from the OMP threads, you shouldn't hit any of the threading problems that I would ordinarily be concerned about.
I'm not sure about having a minimum of 8 threads, though -- on a system with many less CPUs than that you're just losing by adding more threads. For a wholly CPU-bound task like this, the number of threads to use should be the number of CPUs.
-mental
MenTaLguY wrote:
This looks reasonable for the most part. As long as you're sticking to manipulating data in the buffers and not manipulating the object model from the OMP threads, you shouldn't hit any of the threading problems that I would ordinarily be concerned about.
Exactly, that's the idea.
I'm not sure about having a minimum of 8 threads, though
It's not a minimum, it's a maximum :) min(8,x) returns the minimum of 8 and x, so 8 is the largest value it will return.
-- on a system with many less CPUs than that you're just losing by adding more threads. For a wholly CPU-bound task like this, the number of threads to use should be the number of CPUs.
I agree completely, that's why I tried using the default of omp_get_num_procs(). As this doesn't work properly on my system (as far as I can tell) I recommend setting it through the preferences though. However, hopefully this will work better with future versions of GCC (perhaps it will even work on Linux, as GCC has had OpenMP support for a bit longer there).
I adapted configure.ac and properly compiled it. I am sure that my configure.ac is wrong (flags placed in wrong places and probably not proper behaviour when one does not have openmp lid-devel installed), so, somebedy please fix it. It at least compiled here in my machine.
I tested it and it worked ok, but I have no idea how to benchmark it. I removed the logging code that was windows-only.
here is my version of the patch, I hope it helps.
On Tue, Nov 18, 2008 at 2:45 PM, Jasper van de Gronde < th.v.d.gronde@...528...> wrote:
MenTaLguY wrote:
This looks reasonable for the most part. As long as you're sticking to manipulating data in the buffers and not manipulating the object model from the OMP threads, you shouldn't hit any of the threading problems that I would ordinarily be concerned about.
Exactly, that's the idea.
I'm not sure about having a minimum of 8 threads, though
It's not a minimum, it's a maximum :) min(8,x) returns the minimum of 8 and x, so 8 is the largest value it will return.
-- on a system with many less CPUs than that you're just losing by adding more threads. For a wholly CPU-bound task like this, the number of threads to use should be the number of CPUs.
I agree completely, that's why I tried using the default of omp_get_num_procs(). As this doesn't work properly on my system (as far as I can tell) I recommend setting it through the preferences though. However, hopefully this will work better with future versions of GCC (perhaps it will even work on Linux, as GCC has had OpenMP support for a bit longer there).
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel
2008/11/18 Jasper van de Gronde <th.v.d.gronde@...528...>
Using GCC's support of OpenMP (now also in MingW!) I had a stab at making the outer loop of blurring parallel (the loops are typical DSP loops and shouldn't even throw an exception, so it should be perfectly safe). On my dual-core system blurring (a large image) is approximately twice as fast, so it seems to scale really well.
As I don't have access to a Linux system at the moment I haven't modified the Makefiles to add support, but it shouldn't be too difficult (see the changes to build.xml). Also, I've made it write a log to be able to analyze the performance, but it uses Win32 specific functions for timing. So, if anyone would like to have a go at making it work on Linux and/or testing it that would be great :)
The number of threads defaults to whatever omp_get_num_procs() returns, but on my system this doesn't work... So to use it, set the number of threads in the preferences, using: <group id="threading" numthreads="1" /> in the "options" group (so it's /options/threading/numthreads).
Obviously this is not the most efficient way to do this, as new threads are created for each blur. But so far the overhead doesn't appear to be too bad, compared with the actual blurring operation. And using OpenMP the impact on the code is minimal, I only had to put two #pragma's before the blur loops, allocate the temporary data it needs for all threads and add some preference for the number of threads. So if this works out it might provide a relatively easy way to start making use of all these dual- and quad-core machines people have lying around doing almost nothing these days :)
Patch attached, if wanted I can also provide a binary for Windows. This version writes a log file (blurlog.txt) that lists how long each blur takes.
Note that inkscape also needs mingwm10.dll to run with this patch. As it isn't included with devlibs at the moment you'll have to copy it to Inkscape's directory manually (it's in the MingW bin directory).
Very cool, am showing 40-50% speed up on my dual core on the bigger objects, total times quoted in the blurlog for 1vs 2 with 150 odd blurs in the drawing is 3.6sec vs 2.24 took a lot longer than that to render tho, so we're clearly eating a lot of time elsewhere too. Will stick this build on a USB drive and try it on my a dual processor machine at work with 2x quad xeons in it :D Will try get some better usertime estimates too. Definite step in the right direction tho...
Cheers
Sim
Felipe Sanches wrote:
I adapted configure.ac http://configure.ac and properly compiled it. I am sure that my configure.ac http://configure.ac is wrong (flags placed in wrong places and probably not proper behaviour when one does not have openmp lid-devel installed), so, somebedy please fix it. It at least compiled here in my machine.
Thanks a lot :)
I tested it and it worked ok, but I have no idea how to benchmark it. I removed the logging code that was windows-only.
Yeah, I'm afraid I'm just not familiar enough with Linux programming to know how to do accurate timing (let alone cross-platform timing). You could try profiling though. For me this doesn't work (for some reason I get really crazy profiling results on Vista), but it will probably work just fine for you, see: http://wiki.inkscape.org/wiki/index.php/Profiling
I don't know exactly how well it works with multi-threading though.
here is my version of the patch, I hope it helps.
If a few people test the code and no fundamental problems arise I'll probably commit it, so yes, then it would definitely help :) (To prevent the build from breaking.)
john cliff wrote:
Very cool, am showing 40-50% speed up on my dual core on the bigger objects, total times quoted in the blurlog for 1vs 2 with 150 odd blurs in the drawing is 3.6sec vs 2.24 took a lot longer than that to render tho, so we're clearly eating a lot of time elsewhere too.
Oh, definitely! With IIR blur and the resampled blur approximation code it's usually quite fast, but it's still one of the more (most?) expensive operations. I'm still thinking about some ways to make it even faster from an algorithmic point of view, but that will take some time. In the reasonably near future I may incorporate some code that takes advantage of the piecewise linear nature of, vector, images for example (but I still have to evaluate how it performs compared to the current options).
In addition, the times in the blurlog are only for the actual blurring code, so do not include any time spent on resampling (if used) or allocation of temporary arrays and such.
Will stick this build on a USB drive and try it on my a dual processor machine at work with 2x quad xeons in it :D
I though the maximum of 8 threads would be quite sufficient for most people and provide a reasonable upper bound to prevent people from doing stuff like selecting 200 threads, but I see I may have been a bit conservative :)
Will try get some better usertime estimates too. Definite step in the right direction tho...
Thanks, you may also want to have a look at profiling: http://wiki.inkscape.org/wiki/index.php/Profiling As I said in another mail this currently doesn't work for me on Vista, but YMMV (and it did work for me on XP).
Would it be possible to optimise the code with SSE vectorisation at all? That would require hand optimised assembly. Alternatively I wonder if liboil would have any loops that could be brought in?
On Wed, 2008-11-19 at 09:15 +0100, Jasper van de Gronde wrote:
john cliff wrote:
Very cool, am showing 40-50% speed up on my dual core on the bigger objects, total times quoted in the blurlog for 1vs 2 with 150 odd blurs in the drawing is 3.6sec vs 2.24 took a lot longer than that to render tho, so we're clearly eating a lot of time elsewhere too.
Oh, definitely! With IIR blur and the resampled blur approximation code it's usually quite fast, but it's still one of the more (most?) expensive operations. I'm still thinking about some ways to make it even faster from an algorithmic point of view, but that will take some time. In the reasonably near future I may incorporate some code that takes advantage of the piecewise linear nature of, vector, images for example (but I still have to evaluate how it performs compared to the current options).
In addition, the times in the blurlog are only for the actual blurring code, so do not include any time spent on resampling (if used) or allocation of temporary arrays and such.
Will stick this build on a USB drive and try it on my a dual processor machine at work with 2x quad xeons in it :D
I though the maximum of 8 threads would be quite sufficient for most people and provide a reasonable upper bound to prevent people from doing stuff like selecting 200 threads, but I see I may have been a bit conservative :)
Will try get some better usertime estimates too. Definite step in the right direction tho...
Thanks, you may also want to have a look at profiling: http://wiki.inkscape.org/wiki/index.php/Profiling As I said in another mail this currently doesn't work for me on Vista, but YMMV (and it did work for me on XP).
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Joel Holdsworth wrote:
Would it be possible to optimise the code with SSE vectorisation at all?
Most likely. Note that this would be pretty much orthogonal to using multithreading (and/or algorithmic enhancements) though. (Which definitely doesn't make it less attractive of course.)
That would require hand optimised assembly.
More or less. There are intrinsics that can be used which make it a lot easier. In fact I have some code like this to do rendering, which I still haven't incorporated because I wanted to check out liboil at the time and I may have to double-check the code to see if it's still okay.
Alternatively I wonder if liboil would have any loops that could be brought in?
Liboil is a very nice initiative, but it's reasonably limited in its scope. In theory we could use some of their compositing code, but: - Inkscape uses quite a few more pixel formats (which I doubt are all actually used, so this may be a moot point). - I have serious doubts about the quality of the computations. Based on a quick look at the source they repeatedly round the results (this makes it possible to optimize a bit more), which Inkscape also used to do and actually leads to visible artifacts.
For the blurring code I don't think much of their current code applies and at the moment I'm not too keen on trying to get liboil to change/improve (work-wise). But if you (or anyone else) feels up to it, be my guest :)
On Wed, 2008-11-19 at 09:01 +0100, Jasper van de Gronde wrote:
Felipe Sanches wrote:
I tested it and it worked ok, but I have no idea how to benchmark it. I removed the logging code that was windows-only.
Yeah, I'm afraid I'm just not familiar enough with Linux programming to know how to do accurate timing (let alone cross-platform timing). You could try profiling though. For me this doesn't work (for some reason I get really crazy profiling results on Vista), but it will probably work just fine for you, see: http://wiki.inkscape.org/wiki/index.php/Profiling
I don't know exactly how well it works with multi-threading though.
Not the most accurate profiling but what I used is this:
time inkscape --verb=FileQuit myfile.svg
This will cause the file to open, render once and the Inkscape quits. Doing that on Bulia's gradient mesh example (which uses blur) it took 5.3s without OpenMP and 4.8s with. So there is about a 10% speed up. Not sure how that breaks down overall, but it seems like a nice improvement.
here is my version of the patch, I hope it helps.
If a few people test the code and no fundamental problems arise I'll probably commit it, so yes, then it would definitely help :) (To prevent the build from breaking.)
I changed the autotools stuff a bit to be more flexible with other compilers pulling an macro from the autotools macro archive. I'll commit it once SF starts talking to me again...
--Ted
On Fri, Nov 21, 2008 at 11:48 PM, Ted Gould <ted@...11...> wrote:
I changed the autotools stuff a bit to be more flexible with other compilers pulling an macro from the autotools macro archive. I'll commit it once SF starts talking to me again...
With my gcc 4.0.3, it fails as of rev 20254:
checking for OpenMP flag of C++ compiler... unknown configure: error: Inkscape requires OpenMP support to build
Can't please make this an optional configure flag? I don't have multiple cores and would like to avoid upgrading gcc on this machine, as it would mean recompiling a ton of libraries.
On 11/24/2008 10:20 AM, bulia byak wrote:
With my gcc 4.0.3, it fails as of rev 20254:
checking for OpenMP flag of C++ compiler... unknown configure: error: Inkscape requires OpenMP support to build
Can't please make this an optional configure flag? I don't have multiple cores and would like to avoid upgrading gcc on this machine, as it would mean recompiling a ton of libraries.
Hi, Bulia. ltns!
I was thinking the same thing. It would fit into our current scheme of
#ifdef HAVE_DEPENDENCY and #ifdef WITH_FEATURE
This idea: "you MUST upgrade," doesn't agree at all.
I would be surprised if our friends in the OSX world did not have this problem, too.
bob
On Mon, 2008-11-24 at 12:20 -0400, bulia byak wrote:
On Fri, Nov 21, 2008 at 11:48 PM, Ted Gould <ted@...11...> wrote:
I changed the autotools stuff a bit to be more flexible with other compilers pulling an macro from the autotools macro archive. I'll commit it once SF starts talking to me again...
With my gcc 4.0.3, it fails as of rev 20254:
checking for OpenMP flag of C++ compiler... unknown configure: error: Inkscape requires OpenMP support to build
Can't please make this an optional configure flag? I don't have multiple cores and would like to avoid upgrading gcc on this machine, as it would mean recompiling a ton of libraries.
We can of course, but I didn't because OpenMP was integrated into GCC March 9, 2006[1] which was then released May 13, 2007[2]. And the GTK+ we're depending on 2.12 was released Fri, 14 Sep 2007[3]. It would seem that if you had a new enough GTK+ you'd have a new enough GCC.
I guess my question here is that in this rather rare case is it worth the complexity of having this, and likely more in the future, ifdefs in the code?
--Ted
[1] http://gcc.gnu.org/projects/gomp/ [2] http://gcc.gnu.org/gcc-4.2/ [3] http://mail.gnome.org/archives/gtk-devel-list/2007-September/msg00052.html
On Nov 24, 2008, at 12:54 PM, Ted Gould wrote:
On Mon, 2008-11-24 at 12:20 -0400, bulia byak wrote:
On Fri, Nov 21, 2008 at 11:48 PM, Ted Gould <ted@...11...> wrote:
I changed the autotools stuff a bit to be more flexible with other compilers pulling an macro from the autotools macro archive. I'll commit it once SF starts talking to me again...
With my gcc 4.0.3, it fails as of rev 20254:
checking for OpenMP flag of C++ compiler... unknown configure: error: Inkscape requires OpenMP support to build
Can't please make this an optional configure flag? I don't have multiple cores and would like to avoid upgrading gcc on this machine, as it would mean recompiling a ton of libraries.
We can of course, but I didn't because OpenMP was integrated into GCC March 9, 2006[1] which was then released May 13, 2007[2]. And the GTK+ we're depending on 2.12 was released Fri, 14 Sep 2007[3]. It would seem that if you had a new enough GTK+ you'd have a new enough GCC.
I guess my question here is that in this rather rare case is it worth the complexity of having this, and likely more in the future, ifdefs in the code?
Not a rare case.
On OS X, Apple is in charge of what compilers get bumped when. So the OpenMP stuff is failing at the moment.
I'm trying to make the config do the right thing and allow it to keep building here in Mac-land.
On Mon, Nov 24, 2008 at 4:54 PM, Ted Gould <ted@...11...> wrote:
We can of course, but I didn't because OpenMP was integrated into GCC March 9, 2006[1] which was then released May 13, 2007[2]. And the GTK+ we're depending on 2.12 was released Fri, 14 Sep 2007[3]. It would seem that if you had a new enough GTK+ you'd have a new enough GCC.
Upgrading GTK is simpler than upgrading the compiler, and there is more motivation for GTK upgrade because of visible new features. So I wouldn't be surprised if I'm not alone with a new GTK and an old GCC.
On Mon, 2008-11-24 at 13:24 -0800, Jon A. Cruz wrote:
Not a rare case.
On OS X, Apple is in charge of what compilers get bumped when. So the OpenMP stuff is failing at the moment.
I'm trying to make the config do the right thing and allow it to keep building here in Mac-land.
GCC 4.2 appeared in XCode 3.1 which seems to have been released in June.
http://developer.apple.com/releasenotes/DeveloperTools/RN-Xcode/index.html
Seems that it only runs on 10.5 but will build for 10.3 and higher.
--Ted
On Mon, 2008-11-24 at 17:31 -0400, bulia byak wrote:
On Mon, Nov 24, 2008 at 4:54 PM, Ted Gould <ted@...11...> wrote:
We can of course, but I didn't because OpenMP was integrated into GCC March 9, 2006[1] which was then released May 13, 2007[2]. And the GTK+ we're depending on 2.12 was released Fri, 14 Sep 2007[3]. It would seem that if you had a new enough GTK+ you'd have a new enough GCC.
Upgrading GTK is simpler than upgrading the compiler, and there is more motivation for GTK upgrade because of visible new features. So I wouldn't be surprised if I'm not alone with a new GTK and an old GCC.
I'm just not sure what our dependency policy should be then. Case-by-case? I feel like adding in OpenMP support was pretty conservative WRT dependencies.
--Ted
On Nov 24, 2008, at 1:35 PM, Ted Gould wrote:
On Mon, 2008-11-24 at 13:24 -0800, Jon A. Cruz wrote:
Not a rare case.
On OS X, Apple is in charge of what compilers get bumped when. So the OpenMP stuff is failing at the moment.
I'm trying to make the config do the right thing and allow it to keep building here in Mac-land.
GCC 4.2 appeared in XCode 3.1 which seems to have been released in June.
http://developer.apple.com/releasenotes/DeveloperTools/RN-Xcode/ index.html
Seems that it only runs on 10.5 but will build for 10.3 and higher.
Yes, and many systems are limited to 10.4
On Mon, 2008-11-24 at 13:49 -0800, Jon A. Cruz wrote:
On Nov 24, 2008, at 1:35 PM, Ted Gould wrote:
On Mon, 2008-11-24 at 13:24 -0800, Jon A. Cruz wrote:
Not a rare case.
On OS X, Apple is in charge of what compilers get bumped when. So the OpenMP stuff is failing at the moment.
I'm trying to make the config do the right thing and allow it to keep building here in Mac-land.
GCC 4.2 appeared in XCode 3.1 which seems to have been released in June.
http://developer.apple.com/releasenotes/DeveloperTools/RN-Xcode/ index.html
Seems that it only runs on 10.5 but will build for 10.3 and higher.
Yes, and many systems are limited to 10.4
http://en.wikipedia.org/wiki/Mac_OS_X_v10.5#System_requirements
Machines without G4s? Those without DVD drives?
Seems the first G4 processors were in Summer of 1999:
http://en.wikipedia.org/wiki/PowerPC_G4#PowerPC_7400
I think the last shipping G3s were in the iBook with the iBook G4 shipping in Oct. 2003.
http://en.wikipedia.org/wiki/IBook#iBook_G4
I think that 5-years is a reasonable deprecation time for hardware running the latest version.
--Ted
On Nov 24, 2008, at 2:10 PM, Ted Gould wrote:
http://en.wikipedia.org/wiki/Mac_OS_X_v10.5#System_requirements
Machines without G4s? Those without DVD drives?
Seems the first G4 processors were in Summer of 1999:
http://en.wikipedia.org/wiki/PowerPC_G4#PowerPC_7400
I think the last shipping G3s were in the iBook with the iBook G4 shipping in Oct. 2003.
http://en.wikipedia.org/wiki/IBook#iBook_G4
I think that 5-years is a reasonable deprecation time for hardware running the latest version.
No, machines that customers did not shell out extra money to get the latest and greatest OS.
Going into this past summer, some reasonable estimates had 10.5 only hitting 30% adoption, with 10.4 still holding about 60%.
Remember, the upgrade cycle is a little different than Linux, since one actually has to pony up cash to move forward.
Also businesses are a large factor. I'd done some research for work and it turns out that a lot of businesses are not upgrading to 10.5, so for anything targeting the enterprise support for 10.4 was a must.
On Mon, 2008-11-24 at 14:18 -0800, Jon A. Cruz wrote:
No, machines that customers did not shell out extra money to get the latest and greatest OS.
Hmm, you're going to have a hard time generating sympathy there. I'm fine with people buying proprietary software, but realize you're on the treadmill that your vendor puts you on.
Going into this past summer, some reasonable estimates had 10.5 only hitting 30% adoption, with 10.4 still holding about 60%.
Remember, the upgrade cycle is a little different than Linux, since one actually has to pony up cash to move forward.
Also businesses are a large factor. I'd done some research for work and it turns out that a lot of businesses are not upgrading to 10.5, so for anything targeting the enterprise support for 10.4 was a must.
Remember that we're talking about "building" not running. XCode 3.1 can build binaries for 10.3. We're not limiting people from running, just building. I imagine all of those businesses have one machine running 10.5 to do a build on.
--Ted
2008/11/24 Ted Gould <ted@...11...>
On Mon, 2008-11-24 at 14:18 -0800, Jon A. Cruz wrote:
No, machines that customers did not shell out extra money to get the latest and greatest OS.
Hmm, you're going to have a hard time generating sympathy there. I'm fine with people buying proprietary software, but realize you're on the treadmill that your vendor puts you on.
no actually hes not. theres an awful lot of people out there who dont a choice but to be 'on a treadmill' and theres also a large portion who wouldnt even realise they've chosen a treadmill. Add to that the fact that the treadmill your discussing still offers a better customer interface for most things than the alternatives, even if you are running the 4 yr old version and I have no problem with us trying to maintain support for it.
Going into this past summer, some reasonable estimates had 10.5 only hitting 30% adoption, with 10.4 still holding about 60%.
Remember, the upgrade cycle is a little different than Linux, since one actually has to pony up cash to move forward.
Also businesses are a large factor. I'd done some research for work and it turns out that a lot of businesses are not upgrading to 10.5, so for anything targeting the enterprise support for 10.4 was a must.
Remember that we're talking about "building" not running. XCode 3.1 can build binaries for 10.3. We're not limiting people from running, just building. I imagine all of those businesses have one machine running 10.5 to do a build on.
On Mon, 2008-11-24 at 23:30 +0000, john cliff wrote:
2008/11/24 Ted Gould <ted@...11...> On Mon, 2008-11-24 at 14:18 -0800, Jon A. Cruz wrote:
> No, machines that customers did not shell out extra money to get the > latest and greatest OS. Hmm, you're going to have a hard time generating sympathy there. I'm fine with people buying proprietary software, but realize you're on the treadmill that your vendor puts you on.
I should have said "sympathy from me."
no actually hes not. theres an awful lot of people out there who dont a choice but to be 'on a treadmill'
People in a free state have a choice. They may not be willing to make the trade offs to do so; and I would encourage anyone to choose food and shelter first. But, I imagine that most of those people are not running OS X on a Mac. And, the "at work" argument doesn't work here because we're talking about developers not users, so unless their getting paid to work on Inkscape I'd encourage them to not use work resources anyway.
and theres also a large portion who wouldnt even realise they've chosen a treadmill.
An opportunity for education.
Add to that the fact that the treadmill your discussing still offers a better customer interface for most things
Off topic.
than the alternatives, even if you are running the 4 yr old version and I have no problem with us trying to maintain support for it.
No, to be precise, we are asking developers on a particular platform to use a less obsolete version of a compiler that is supplied by their vendor. They already, in order to build Inkscape, have to download a huge selection of libraries that aren't supplied by their vendor and link those into an application package in ways that their vendor does not encourage.
Just to be clear. There is no issue building binaries for users of OS X all the way back to 10.3. We are only talking about developers.
I'm not happy about making the OpenMP stuff compile out, but I'd live with it. I'm entirely unwilling to have the Inkscape dependency policy be held hostage by Apple.
--Ted
Obviously we need Bryce (Solomon) here to cut this baby in half.
Executive decision.
bob
Bob Jamison wrote:
Obviously we need Bryce (Solomon) here to cut this baby in half.
Well, if someone with knowledge of auto* knows how to modify the build files on Linux+Mac so that the OpenMP dependency is optional, then simply make it so. There is a pre-existing define that can be checked to see if the compiler has OpenMP support, so there is no added complexity, apart from a few #ifdef's.
For reference, the variable is _OPENMP and is described in section 2.2 of the following document: http://www.openmp.org/mp-documents/spec30.pdf (Note that I think GCC still supports a slightly older version of OpenMP, but so far I haven't found anything that didn't work as specified in this document.)
If I had expected OpenMP support to be a problem on Linux or Mac OS X I would probably have included ifdef's in the first place, as it has only been in GCC for a pretty short time.
On Windows the story is a bit different. As nice as buildtool is, it has no easy way to make dependencies optional, and the required compiler is made conveniently available from Inkscape's Wiki, it doesn't even need to be installed in any way beyond extracting the archive.
On Nov 24, 2008, at 11:49 PM, Jasper van de Gronde wrote:
Well, if someone with knowledge of auto* knows how to modify the build files on Linux+Mac so that the OpenMP dependency is optional, then simply make it so. There is a pre-existing define that can be checked to see if the compiler has OpenMP support, so there is no added complexity, apart from a few #ifdef's.
Should be already committed.
On Mon, Nov 24, 2008 at 10:32:25PM -0600, Bob Jamison wrote:
Obviously we need Bryce (Solomon) here to cut this baby in half.
Executive decision.
Looks like this already got sorted out.
Making dependencies optional seems like a good solution. Even if we only had to care about Linux, it would still be most wise to introduce new dependencies as optional, turned off with an autoconf flag, because if there are bugs in the new code it gives testers a way to isolate them.
Bryce
participants (10)
-
Bob Jamison
-
Bryce Harrington
-
bulia byak
-
Felipe Sanches
-
Jasper van de Gronde
-
Joel Holdsworth
-
john cliff
-
Jon A. Cruz
-
MenTaLguY
-
Ted Gould