
I've been working on some of the groundwork for the layers dialog; doing a routine test build, I discovered that Inkscape had started crashing.
Backtrace points to livarot, which I haven't touched at all...
The only part of my new code that is actually currently used is the addition of three sigc++ signals to SPReprDoc; so far as I can tell that should not cause Shape::~Shape() to crash..
Can you all review this patch (attached) and see if I'm missing something obvious? I'm at my wits' end here...
-mental

On Tue, 07 Sep 2004 02:28:34 -0400, MenTaLguY <mental@...3...> wrote:
I've been working on some of the groundwork for the layers dialog; doing a routine test build, I discovered that Inkscape had started crashing.
You mean, it's crashing with this patch but is OK without it?
I can't test because I get this when trying to compile with your patch:
make: *** No rule to make target `widgets/document-tree-model.cpp', needed by `widgets/document-tree-model.o'. Stop.
Anyway, all I can say is that this seems to me somewhat similar to what we had on Windows recently - misleading tracebacks, sigc connection crashes etc. Maybe some problem with boehm again. Just a guess.

I have noticed this, also, that the last few lines of the backtrace are often incorrect, and that the cause might be elsewhere. Maybe this is the real effect of a 'stack corruption', and we are seeing it. Several times recently gdb has reported a segfault in fill-style.cpp, when I -know- that it is not there at all.
Bob
bulia byak wrote:
On Tue, 07 Sep 2004 02:28:34 -0400, MenTaLguY <mental@...3...> wrote:
I've been working on some of the groundwork for the layers dialog; doing a routine test build, I discovered that Inkscape had started crashing.
You mean, it's crashing with this patch but is OK without it?
I can't test because I get this when trying to compile with your patch:
make: *** No rule to make target `widgets/document-tree-model.cpp', needed by `widgets/document-tree-model.o'. Stop.
Anyway, all I can say is that this seems to me somewhat similar to what we had on Windows recently - misleading tracebacks, sigc connection crashes etc. Maybe some problem with boehm again. Just a guess.

On Tue, 2004-09-07 at 09:01, Bob Jamison wrote:
I have noticed this, also, that the last few lines of the backtrace are often incorrect, and that the cause might be elsewhere. Maybe this is the real effect of a 'stack corruption', and we are seeing it. Several times recently gdb has reported a segfault in fill-style.cpp, when I -know- that it is not there at all.
I've noticed that a common thread among the various crashes is that they happen in a specifically malloc()-related context (at least the ones I've seen personally, and I think also the sigc++ crashes).
I'm wondering whether something is writing out of bounds and corrupting the malloc headers. I have tried testing for this using ElectricFence, but at least on my machine we allocate too many small blocks with EF to make it all the way through startup. (not a bug on our part, just an annoying limitation of EF)
Anyway, if the malloc data structures are getting corrupted, stack corruption might easily follow on from that.
As far as potential culprits? libgc itself should be immune to corruption as it does not try to write to memory it does not manage, but of course if its own header structures are corrupted, it could theoretically stomp on something else.
Otherwise, the only other (obvious) possibilities I can think of are (in decreasing likelihood):
1) I screwed up the definition of Inkscape::GC::Finalized, and it's not using the right value for 'this' when calling the destructor.
In this case, introducing sigc++ signals to SPReprDoc (derived from Inkscape::GC::Finalized) could well result in heap corruption when the object was destroyed, since the signal destructors try to free memory and their internal pointers would be wrong.
2) livarot is corrupting the malloc headers, and it's pure luck that we have not been caught by it before.
livarot fails to check for array indices of -1 in many places where they may occur. I've caught it reading from such locations in malloc()ed arrays; it's certain that if it ever writes to them it will corrupt the heap.
Such corruption might be hard to detect, and only result in dramatic failures with specific usage patterns (like the allocation sequence introduced by the new sigc++ signals).
-mental

On Tue, 2004-09-07 at 21:18, MenTaLguY wrote:
- I screwed up the definition of Inkscape::GC::Finalized, and it's not
using the right value for 'this' when calling the destructor.
In this case, introducing sigc++ signals to SPReprDoc (derived from Inkscape::GC::Finalized) could well result in heap corruption when the object was destroyed, since the signal destructors try to free memory and their internal pointers would be wrong.
This one doesn't appear to be the problem; I set breakpoints on Inkscape::GC::Finalized::~Finalized() and SPReprDoc::~SPReprDoc(); the crash apparently occurs before either has been called.
(which I should have known; finalized garbage collected objects aren't destroyed until the idle loop)
-mental

Could those of you on a glibc-using platform please try setting the environment variable MALLOC_CHECK_ to 1 or 2 and running Inkscape?
- a value of 1 will print an error message to stderr if malloc is abused
- a value of 2 will call abort() -- the latter is more useful in the debugger probably, so you can see the backtrace
-mental

On Tue, 7 Sep 2004, MenTaLguY wrote:
Could those of you on a glibc-using platform please try setting the environment variable MALLOC_CHECK_ to 1 or 2 and running Inkscape?
a value of 1 will print an error message to stderr if malloc is abused
a value of 2 will call abort() -- the latter is more useful in the
debugger probably, so you can see the backtrace
Hi,
This shows up an overwrite of a malloc()ed array in FlowRes.cpp. I'm just working on a fix now.
Is there anything else that I should test? I haven't been able to find any other problems with random clicking, drawing shapes etc.
Carl

On Wed, 2004-09-08 at 07:23, Carl Hetherington wrote:
This shows up an overwrite of a malloc()ed array in FlowRes.cpp. I'm just working on a fix now.
Is there anything else that I should test? I haven't been able to find any other problems with random clicking, drawing shapes etc.
Different people seem to encounter different bugs. Unfortunately MALLOC_CHECK_ is not guaranteed to catch every bug.
Myself, I just fixed a handful of fairly serious bugs in my own code (missing initializations in SPRepr's copy constructor, primarily), but that doesn't appear to have solved all of the heap corruption problems.
-mental

On Tue, 7 Sep 2004, MenTaLguY wrote:
Could those of you on a glibc-using platform please try setting the environment variable MALLOC_CHECK_ to 1 or 2 and running Inkscape?
a value of 1 will print an error message to stderr if malloc is abused
a value of 2 will call abort() -- the latter is more useful in the
debugger probably, so you can see the backtrace
OK, I've fixed two problems highlighted by this check; one in FlowRes.cpp (a buffer overrun whenever doing basically anything with a text object) and one in helper/stock-items.cpp (a bad free(), which appeared when you chose a marker).
I can't find any more, at the moment. Is there any other operation I should try?
Carl

MenTaLguY wrote:
I'm wondering whether something is writing out of bounds and corrupting the malloc headers. I have tried testing for this using ElectricFence, but at least on my machine we allocate too many small blocks with EF to make it all the way through startup. (not a bug on our part, just an annoying limitation of EF)
Can you try valgrind? In general I've had much better luck with that. (It also catches many things that EF doesn't)

I am looking around for what might be a cause for <text> nodes not being rendered on Win32.
In FontFactory.cpp and FontInstance.cpp, there are numerous #ifdef switches between XFT and Win32. And any time we have different sets of code for different architectures, we have problems maintaining consistent functionality. And bugs that get fixed on one, often do not get fixed on the other.
Wouldn't it be better for us to drop the XFT/Win32 code, and use Freetype2? This level of abstraction means that we can let someone -else- worry about the machine dependent implementation. Or has this already been tried? Just a thought.
Bob

On Wed, 8 Sep 2004, Jon A. Cruz wrote:
MenTaLguY wrote:
I'm wondering whether something is writing out of bounds and corrupting the malloc headers. I have tried testing for this using ElectricFence, but at least on my machine we allocate too many small blocks with EF to make it all the way through startup. (not a bug on our part, just an annoying limitation of EF)
Can you try valgrind? In general I've had much better luck with that. (It also catches many things that EF doesn't)
valgrind is really unhappy with livarot at the moment; it hits 50k errors before the main window even comes up. But I suppose you could make a suppression profile for the very common one (a use of uninitialized data) and then look for overruns etc. that way.
Cheers
Carl

On Wed, 8 Sep 2004, Carl Hetherington wrote:
On Wed, 8 Sep 2004, Jon A. Cruz wrote:
MenTaLguY wrote:
I'm wondering whether something is writing out of bounds and corrupting the malloc headers. I have tried testing for this using ElectricFence, but at least on my machine we allocate too many small blocks with EF to make it all the way through startup. (not a bug on our part, just an annoying limitation of EF)
Can you try valgrind? In general I've had much better luck with that. (It also catches many things that EF doesn't)
valgrind is really unhappy with livarot at the moment; it hits 50k errors before the main window even comes up. But I suppose you could make a suppression profile for the very common one (a use of uninitialized data) and then look for overruns etc. that way.
I just had a fiddle with this, and the errors that valgrind finds seem ONLY to be those relating to uninitialized memory locations; I couldn't find any heap overwrites. But I only had a quick go; valgrind is very slow on my machine.
Cheers
Carl

On Wed, 2004-09-08 at 11:57, Carl Hetherington wrote:
I just had a fiddle with this, and the errors that valgrind finds seem ONLY to be those relating to uninitialized memory locations; I couldn't find any heap overwrites. But I only had a quick go; valgrind is very slow on my machine.
Got one:
==2132== Invalid write of size 4 ==2132== at 0x8123898: prefs_get_recent_files() (prefs-utils.cpp:179) ==2132== by 0x811EE5E: sp_menu_append_recent_documents(_GtkWidget*, SPView*) (interface.cpp:640) ==2132== by 0x811E597: sp_ui_menu_append_submenu(_GtkMenu*, SPView*, void (*)(_GtkWidget*, SPView*), char const*, char const*, char const*) (interface.cpp:481) ==2132== by 0x811F076: sp_ui_file_menu(_GtkMenu*, SPDocument*, SPView*) (interface.cpp:678) ==2132== Address 0x1C69CC8C is 0 bytes after a block of size 4 alloc'd ==2132== at 0x1B904EDD: malloc (vg_replace_malloc.c:131) ==2132== by 0x1C3530D6: g_malloc (in /usr/lib/libglib-2.0.so.0.400.2) ==2132== by 0x812382E: prefs_get_recent_files() (prefs-utils.cpp:169) ==2132== by 0x811EE5E: sp_menu_append_recent_documents(_GtkWidget*, SPView*) (interface.cpp:640)
This one is my fault, as I broke sp_repr_n_children() :/
Should be fixed in CVS shortly.
-mental

On Thu, 2004-09-09 at 00:21, MenTaLguY wrote:
This one is my fault, as I broke sp_repr_n_children() :/
Should be fixed in CVS shortly.
Actually the bug turned out to have been a change local to my tree. Whoops ^^;
At least I know why the patch I posted earler was breaking things (it was due to the sp_repr_n_children breakage).
Now that this is sorted (everything seems stable) I'm going to go ahead and commit the patch.
-mental

On Wed, 8 Sep 2004, Jon A. Cruz wrote:
Can you try valgrind? In general I've had much better luck with that. (It also catches many things that EF doesn't)
livarot sets valgrind screaming, but previous attempts to fix livarot had broken it (more) horribly and we had to revert the changes.
The livarot warnings are mostly harmless (uninitialized data which is copied, but never actually used), but they drown out anything else that might be going on.
When I attempted to fix that, I did eventually manage to get rid of the uninitialized value warnings, but livarot had also stopped working. I don't know if the remainder of the valgrind warnings (e.g. out of bounds array accesses) at that point were my fault or fred's.
I am trying to fix it again, taking a more gradual approach this time.
-mental

MenTaLguY wrote:
When I attempted to fix that, I did eventually manage to get rid of the uninitialized value warnings, but livarot had also stopped working. I don't know if the remainder of the valgrind warnings (e.g. out of bounds array accesses) at that point were my fault or fred's.
Ooooh. That's possibly bad.
(BTW, this is for the benefit of others, Mental is pretty up on these things)
The most likely thing is some accidental change that broke things.
However... in clearing up warnings of that type, it's possible that some buggy behavior was depended on, and cleaning things up properly "broke" some work-around that was elsewhere. Or just some sloppy code that just accidentally worked.
I am trying to fix it again, taking a more gradual approach this time.
Always good.
Of course, getting everyone possible to join in on "zero-warnings" is a good goal. Sometimes we have to be realistic about hitting it, but keeping it in mind helps get there slow and steady.

On Tue, 2004-09-07 at 04:20, bulia byak wrote:
On Tue, 07 Sep 2004 02:28:34 -0400, MenTaLguY <mental@...3...> wrote:
I've been working on some of the groundwork for the layers dialog; doing a routine test build, I discovered that Inkscape had started crashing.
You mean, it's crashing with this patch but is OK without it?
Yes. At least for me.
I can't test because I get this when trying to compile with your patch:
make: *** No rule to make target `widgets/document-tree-model.cpp', needed by `widgets/document-tree-model.o'. Stop.
I shouldn't have included the changes to src/widgets/Makefile_insert; you can safely omit them as the code in the missing files is not used by anything yet.
Anyway, all I can say is that this seems to me somewhat similar to what we had on Windows recently - misleading tracebacks, sigc connection crashes etc. Maybe some problem with boehm again. Just a guess.
I hope not. If it is a boehm issue we need to ask what we're doing differently than the other projects which have been using it without problems (gcc, for example).
-mental
participants (5)
-
Bob Jamison
-
bulia byak
-
Carl Hetherington
-
Jon A. Cruz
-
MenTaLguY