Hello all
Currently we have some handling for hrefs (in the form of Inkscape::URIReference). It provides some signals and callbacks, but doesn't do everything that I would like it to. To be specific, there are several places when solving the following problems efficiently would be a great thing:
1. Given an object, iterate over all references to it. 2. Given an object, iterate over all objects it refers to.
Point 1 is useful for all sorts of things: marking objects for update, compensating clones, deletion, vacuum defs, etc. Point 2 would allow to remove 80% of clipboard code, and implement robust ID changing and clash prevention (which means, one that doesn't have to be updated whenever a new href-like attribute is added).
Here is my idea. We define a three-way smartpointer that looks like this:
class ObjectRefBase { ... SPObject *obj; ObjectRefBase *next; ObjectRefBase *next_ref; }; template <typename T> class ObjectRef : public ObjectRefBase { ... }; template <typename T> ObjectRef<T> const &getRef(T *obj) { return static_cast<ObjectRef<T> const &>(*obj); }
The 'next' and 'next_ref' pointers form a singly-linked cyclic lists. Each SPObject contains such a smartpointer (by protected inheritance: 'class SPObject: protected ObjectRefBase, ... {') that can be accessed using the getRef function, which is a templated friend of SPObject and returns an ObjectRef of correct type. The header of both lists is the pointer contained in the referenced object; it has 'obj' set to itself. 'next' allows us to iterate over references to this object, while 'next_ref' - over references contained in this object. The constructor might look like this:
ObjectRefBase(SPObject *owner) { if (owner == this) { obj = this; next = this; next_ref = this; } else { next_ref = owner->next_ref; owner->next_ref = this; obj = NULL; next = NULL; } }
If this is the embedded instance from protected inheritance, the the first test will be true and both lists will be initialized as empty (an empty cyclic list is represented by the header node having a next pointer pointing to itself). The assignment operator is as follows:
ObjectRefBase &operator=(ObjectRefBase const &other) { if (obj != NULL) { ObjectRefBase *cur = this; while (cur->next != this) cur = cur->next; cur->next = obj->next; } next = other.next; other.next = this; }
If the reference pointed to some object, the function walks the list and removes the reference from the linked list of the previous object; it then inserts itself into the linked list of the new object. The destructor works similarly:
~ObjectRefBase() { ObjectRefBase *cur = this; while (cur->next != this) cur = cur->next; cur->next = obj->next; cur = next_ref; while (cur->next_ref != this) cur = cur->next_ref; cur->next_ref = next_ref; }
Now we can iterate over the references to an object like this:
for (ObjectRefBase *cur = this->next; cur != this; cur = cur->next) { // do something with each reference to this object }
And over references contained in an object like this:
for (ObjectRefBase *cur = this->next_ref; cur != this; cur = cur->next_ref) { // do something with each reference stored in this object }
Testing whether the object has any indirect references is dead simple: this == next. Same for checking whether this object refers to anything: this == next_ref. The owner of a reference can be retrieved in linear time (by walking next_ref until we encounter next_ref == obj). All this of course can be encapsulated into a nice STL-like iterator interface which provides some extra type safety not present in ObjectRefBase (for example, we know that all ObjectRefs which we can reach via 'next' point to the same type). Handling CSS properties with URI references that are not members of the object is easy as well: ObjectRef doesn't need to be a member of the owner for all this to work. There are no boilerplate attach() / release() methods: the assignment operator of ObjectRef takes care of everything.
If walking the lists during reassignment and destruction turns out to be a performance problem, we can make both lists doubly-linked rather than singly-linked, where a node can be removed in constant time, at the price of 2 more pointers stored in each reference. Fast owner retrieval additionally requires storing the owner pointer. This gives 5 pointers of overhead, so for 10000 references on a 64-bit machine we get 390KB of overhead - should be manageable. I thought about a trick that can be used to implement constant-time erase on a cyclic single list (the iterator actually points one node before and we use cur->next->value to get the value), but it can't be applied here.
Why this is better than keeping pointers to the references in std::list or std::vector? - no memory allocations when adding a reference = better performance. - removing a reference is much simpler, because finding it in the list is not needed. - signals can be put in the ObjectRef class. - simple and elegant implementation.
What do you think? Improvements and discussion welcome.
Regards, Krzysztof
On Aug 12, 2009, at 8:41 AM, Krzysztof Kosiński wrote:
Here is my idea. We define a three-way smartpointer that looks like this:
Well... first I'd avoid calling it a "smartpointer". That tends to wander into existing terminology for things that are very often poorly used.
Why this is better than keeping pointers to the references in std::list or std::vector?
- no memory allocations when adding a reference = better performance.
That may or may not be true. However the bottom line for performance items is to get solid metrics before and after. Also a key is to only optimize something that is shown to be a performance bottleneck.
Unfortunately I've often seen micro-optimizations lead to macro- degradation.
- removing a reference is much simpler, because finding it in the
list is not needed.
- signals can be put in the ObjectRef class.
- simple and elegant implementation.
Some of the issues are nice. However the overall impression I get here is of a system that breaks encapsulation. We should look to use this more in a situation where "smaller" owned things do not know about things that own them, etc.
While it does strike me as good code, it also looks like a less-than- efficient algorithm/architecture.
The signals part also raises some major issues. At the moment we have two trees of data (SPObject *and* xml repr) that tend to get out of sync. Among other things there are several layer bugs related to that. What we probably should focus on is a higher level solution that accounts for the two parallel trees, signals, event bubbling, even canceling, etc. I think that will bear more long-term fruit.
Also it is good that you've seen the id's as a problem. The code early on needed to put an id attribute on every single XML element. We have cleaned up much in the code and should be able to further attack things on that point. Why is this pertinent? Well, if one has id's on everything then then ID changing and clash prevention is a huge problem. However, if instead of having thousands of IDs in a document, one only has a dozen then the problem is a completely different one, with different parameters, scaling needs, etc.
I guess what I'm trying to point out is that instead of focusing on an optimized list structure, we need to look at an efficient overall solution for the whole use case of Inkscape, including parallel data trees, shadow trees, event propagation life-cycles, etc. Then as we drill-down on the larger problem, we might hit on tunings to use this data type. Or we might see a need for a few similar-yet-distinct types.
- no memory allocations when adding a reference = better performance.
That may or may not be true. However the bottom line for performance items is to get solid metrics before and after.
ObjectRefBase acts as an intrusive list. Look here: http://www.boost.org/doc/libs/1_39_0/doc/html/intrusive/performance.html In particular look at the back insertion / deletion benchmark (what we're doing when adding or replacing a reference) - it shows at least 3x better performance for an intrusive list of small objects.
Some of the issues are nice. However the overall impression I get here is of a system that breaks encapsulation.
In what way? It doesn't try to refer to implementation details of any particular SPObject. It just stores some of the data that's passed to its methods in some place. If this solution breaks encapsulation, then virtual functions do as well (we could have a virtual function that returns a list of references but that would result in a LOT of error prone boilerplate, which this solution avoids completely).
The signals part also raises some major issues. At the moment we have two trees of data (SPObject *and* xml repr) that tend to get out of sync.
This issue is not related to the problem I'm trying to solve at all...
Also it is good that you've seen the id's as a problem. The code early on needed to put an id attribute on every single XML element. We have cleaned up much in the code and should be able to further attack things on that point. Why is this pertinent? Well, if one has id's on everything then then ID changing and clash prevention is a huge problem. However, if instead of having thousands of IDs in a document, one only has a dozen then the problem is a completely different one, with different parameters, scaling needs, etc.
The fact that you don't have IDs on every document doesn't change anything: It's sufficient to open the same document twice, modify one of them, and paste from one to the other to get an ID conflict. So we still need those capabilities.
I guess what I'm trying to point out is that instead of focusing on an optimized list structure,
The list structure is only a means to an end, which I outlined at the beginning: to be able to iterate over references to an object, and to iterate over references in an object. Those very generic capabilities will make a lot of code simpler, and they're not really related to the structure of the document. The concept of reference between objects will not go away no matter how we change the SP tree.
Regards, Krzysztof
On Aug 13, 2009, at 4:27 PM, Krzysztof Kosiński wrote:
o memory allocations when adding a reference = better performance.
That may or may not be true. However the bottom line for performance items is to get solid metrics before and after.
ObjectRefBase acts as an intrusive list. Look here: http://www.boost.org/doc/libs/1_39_0/doc/html/intrusive/ performance.html In particular look at the back insertion / deletion benchmark (what we're doing when adding or replacing a reference) - it shows at least 3x better performance for an intrusive list of small objects.
Yes, but you're missing the main point.
You are benchmarking a *structure*, not an *application*.
It's the use case that is important. In the context of actual work is what really matters. Now I don't you you would ever fall into this trap, but the classic mistake here is spending time optimizing the system idle loop.
Much like in the early days of Pentium processors many people would use floating point because the CPU finally had the coprocessor built- in. However, they'd miss that getting things in and out of the fp registers stole more than the time gained, and would end up with a net loss.
Or conversely they would use fixed-point integer implementations "for performance", but could gain actual app speedups by using floating point variables and staying in fp.
A more recent example is using a GPU to process calculations. For many things, including some showcased at the most recent LGM, the main bottleneck is going in and out of the graphics card/memory. The chips themselves can do operations much faster than the overall system can feed it.
On Aug 13, 2009, at 4:27 PM, Krzysztof Kosiński wrote:
Also it is good that you've seen the id's as a problem. The code early on needed to put an id attribute on every single XML element. We have cleaned up much in the code and should be able to further attack things on that point. Why is this pertinent? Well, if one has id's on everything then then ID changing and clash prevention is a huge problem. However, if instead of having thousands of IDs in a document, one only has a dozen then the problem is a completely different one, with different parameters, scaling needs, etc.
The fact that you don't have IDs on every document doesn't change anything: It's sufficient to open the same document twice, modify one of them, and paste from one to the other to get an ID conflict. So we still need those capabilities.
Yes... but the *parameters* of the problem matter.
How often will it need to be done? How many items will need to be fixed? How long will it take? Is it in reaction to a user-initiated item? etc.
On Aug 13, 2009, at 4:27 PM, Krzysztof Kosiński wrote:
The signals part also raises some major issues. At the moment we have two trees of data (SPObject *and* xml repr) that tend to get out of sync.
This issue is not related to the problem I'm trying to solve at all...
Ah, but actually it is *very* related to the problem.
If you can't see the relation there, back up and try to ponder it from different angles. It seems you're a bit down in the middle of the "how" of doing things, and need to keep the "why" in mind also.
On Aug 13, 2009, at 4:27 PM, Krzysztof Kosiński wrote:
I guess what I'm trying to point out is that instead of focusing on an optimized list structure,
The list structure is only a means to an end, which I outlined at the beginning: to be able to iterate over references to an object, and to iterate over references in an object. Those very generic capabilities will make a lot of code simpler, and they're not really related to the structure of the document. The concept of reference between objects will not go away no matter how we change the SP tree.
OK. I think this is the crux of the matter.
What you've stated is not an "ends" at all. You've just stated one "means" in support of another "means".
=================== Q: *How* do you want to do that? A: By iterating over references to an object.
Q: *How* are you going to iterate over them? A: By using a nice triple-pointer list structure. ===================
So you've actually stated two "how" items and not any "why" items. The "how" items correspond to "means", and the "why" items correspond to "ends". Also, using the wording of my first example question there, you've not really stated what "that" is.
=================== Q: *Why* do you want to iterate over references to an object? A: Umm.... I dunno. Just some things. ===================
Often different structures are appropriate for solving different problems. I've personally seen where an array of pointers one walks and compares vastly outperforms "smarter" structures (I can go into that separately if you're interested). If you're trying to solve several different problems with one single structure then that's a red-flag.
I was thinking that the best thing to move forward would be to do a simple wiki page to list out specific use cases where this structure would be helpful. At the end, we might see that only one structure is needed. However, it might be that several end up more useful. I'm thinking of things like the DocumentSubset base class.
Jon A. Cruz wrote:
=================== Q: *Why* do you want to iterate over references to an object? A: Umm.... I dunno. Just some things. ===================
I don't understand you. I explicitly said why it's needed, at the very beginning. I'll repeat that list for you. 1 means the task would be simplified by "find hrefs to object", and 2 by "find hrefs in an object".
1. Generic clipboard code that does not require updates whenever a new type of SPObject that can refer to something is added (2). 2. Robust ID collision resolution (1). 3. Deleting objects that are referred to - currently deleting a gradient in the XML editor causes junk to be displayed and usually a crash (1). 4. Better vacuum defs (something like mark-and-sweep could be used instead of the current simple refcounting scheme) (2). 5. Editing a clipping path or mask that is applied to several objects - the outline should be shown over each item the mask is applied to (1). 6. Not having to redraw everything when a gradient, pattern, etc. is changed (1).
Often different structures are appropriate for solving different problems. I've personally seen where an array of pointers one walks and compares vastly outperforms "smarter" structures (I can go into that separately if you're interested). If you're trying to solve several different problems with one single structure then that's a red-flag.
There are only 2 problems to solve, and there are 2 separate but identical data structures that happen to have their data in the same object.
I was thinking that the best thing to move forward would be to do a simple wiki page to list out specific use cases where this structure would be helpful.
See above. I may put this on the wiki if you want, but I'm not sure it would get more exposure than here.
Regards, Krzysztof
On Aug 14, 2009, at 2:23 PM, Krzysztof Kosiński wrote:
Jon A. Cruz wrote:
=================== Q: *Why* do you want to iterate over references to an object? A: Umm.... I dunno. Just some things. ===================
I don't understand you. I explicitly said why it's needed, at the very beginning. I'll repeat that list for you. 1 means the task would be simplified by "find hrefs to object", and 2 by "find hrefs in an object".
No.
To clarify, you did not really explicitly say the "why" of the ends, just the "how" of the means you were looking at.
As you say "the task" is an *implicit* reference to the "why", but did not really address what that was.
And to be clear, "find hrefs to object" and "find hrefs in an object" are not high-level "why" items, but low-level "how" items.
- Generic clipboard code that does not require updates whenever a
new type of SPObject that can refer to something is added (2). 2. Robust ID collision resolution (1). 3. Deleting objects that are referred to - currently deleting a gradient in the XML editor causes junk to be displayed and usually a crash (1). 4. Better vacuum defs (something like mark-and-sweep could be used instead of the current simple refcounting scheme) (2). 5. Editing a clipping path or mask that is applied to several objects - the outline should be shown over each item the mask is applied to (1). 6. Not having to redraw everything when a gradient, pattern, etc. is changed (1).
Ooh. Good list.
#1 is a good problem. #2 is an "interesting" problem that might be much smaller than expected. This definitely does not require a high-performance structure. #3 is a very large and complicated problem. For this I can see that some UI work and such would be required. #4 is another large problem that is more than just "who is referred to". That would be part of the solution, but more work is needed. Things like "not referenced but still desired" is critical but hard to implement without some thought. #5 is technically not solved by just that simple structure. It could be *part* of the solution, but there really is a lot more to the problem. #6 is not nearly as simple as you might guess. Using the simple structure chain to try to determine that one will fall down pretty quickly. Among other things that involves bounding boxes, transformations, intersections of bounding boxes, etc.
Often different structures are appropriate for solving different problems. I've personally seen where an array of pointers one walks and compares vastly outperforms "smarter" structures (I can go into that separately if you're interested). If you're trying to solve several different problems with one single structure then that's a red-flag.
There are only 2 problems to solve, and there are 2 separate but identical data structures that happen to have their data in the same object.
Ah, and this is where you're missing things. There are more than two problems. There are many many problems, but 2 proposed implementations to address those problems.
If you can't see more than 2 problems, then you really need to take a step back and think about it. From an architectural viewpoint those two are both "means" and not "ends". Thus they themselves are not the problems but only the means you are looking at to be able to solve some problems.
I was thinking that the best thing to move forward would be to do a simple wiki page to list out specific use cases where this structure would be helpful.
See above. I may put this on the wiki if you want, but I'm not sure it would get more exposure than here.
I'll get that going, and include a breakdown. The wiki and chat rooms are much better communication channels to hash out such details as these. (Emails tend to get fragmented and hard to follow, etc.)
Just for an example, I'll start discussion on one point (look for it in the wiki soon): Problem: avoid ID collision When does this occur? * when a user pastes content from another SVG document * when a user changes an object's ID via the XML editor or object properties
These are two different use cases. The second use case is quite simple to address. A *highly* efficient structure to address this would be a map of IDs used in the document. Upon attempting to commit the ID change in the UI, a simple check for presence in the map would result in an immediate pass or fail. If, on the other hand, the three way pointer was relied upon then in order to detect if another object had the same ID the could would have to walk the *entire* document tree visiting each node and checking its ID.
So the tri-pointer really is not a solution for the "avoid ID collision" problem in that second use case.
For the first use case I believe we need to define what the requirement actually is. Here is a first cut:
"When pasting content from the clipboard, modify any colliding ID's to be new unique ID's"
Sounds good. But then we still have the first half of the problem where we need to walk *both* the existing document tree and also the clipboard tree to find each ID. If it were implemented with no other structures that would even lead to an exponential slowdown, so we can't do that. Ok.
So we'll assume that we have a set that contains the existing IDs in a document. Now we can walk the pasted in clipboard tree and check for any conflicts. So we'll just do a complete tree walk of that new subset (probably will be smaller than that of the main document, so that will help) checking each node for an ID and if it has a conflicting one just add it to a set of ID's to be changed.
Interesting. So now we have a set of IDs that are in the incoming tree that need to be changed. For each one we'll have to come up for a replacement that does not collide with the existing tree *or* the new replacements. We also will need to find each object with that id, etc. So perhaps instead of a set of IDs we should have built a set of SPObjects that need ID changes. Hmmm... and to properly avoid new collisions we'll need to create and update a list of *all* IDs in the incoming tree. So we can check any proposed replacement for presence in both the main tree and the incoming tree. Once we have that, we can visit the list of objects to be updated and change their IDs before merging into the main tree. Voila! then we're done.
Oop! no we're not. What if some of the things that were copied are dangling references! Hmmmm... what should we do about that? Oh, and what if we're copying from the same document we might want a half- copy that will keep references to gradients, etc, and have those gradients also just in case we're pasting into a different document. And what if we have a "just in case" gradient with the same ID as in the destination document... and that gradient has the same properties? Should we assume it is the same and not change the ID???
eeek
But... as I come to the end of this first pass analysis, I notice something quite interesting. The tri-pointer would only ever help with "ID changing" and not really with "clash prevention" itself.
Hmmm... and while we're at it, the proposed class does seem a bit heavyweight for the problem it wants to solve. Also I don't see anything addressing the problem of pinning. I would think we would need weak references in here. That's another thing to add to the consideration.
Bottom line... this is definitely a problem to hash out online in the chat room and/or on the wiki. Just looking at that one sub-section I typed out really let's me see email will not be the best medium for the discussion.
Jon A. Cruz wrote:
To clarify, you did not really explicitly say the "why" of the ends, just the "how" of the means you were looking at.
Let's keep the philosophy down to minimum
#2 is an "interesting" problem that might be much smaller than expected. This definitely does not require a high-performance structure.
There is nothing "high performance" about the structure I proposed - it's only a way to keep track of the needed information with minimal boilerplate. It is not a performance-oriented solution, it is a maintainability- and feature-oriented one. It might give better performance in some cases, but it's only a pleasant side effect.
#3 is a very large and complicated problem. For this I can see that some UI work and such would be required.
For deleting gradients and patterns, it is simple, clear the style or replace it with the default style. For masks, remove the mask. It might require some work for cases like deleting an LPE parameter.
#4 is another large problem that is more than just "who is referred to". That would be part of the solution, but more work is needed. Things like "not referenced but still desired" is critical but hard to implement without some thought.
Vacuum defs is intended as a cleanup action that will remove all unused objects to minimize file size. If you want to keep unreferenced invisible objects, then don't invoke it - simple. The problem of what to keep and what to delete is already solved by collection policies, and using smart references just makes the implementation able to catch more unused items than now.
#5 is technically not solved by just that simple structure. It could be *part* of the solution, but there really is a lot more to the problem.
I have already devoted some time to this problem and the only thing missing at the moment that prevents me from adding this feature to the node tool is the inability of reliably finding all references to an object.
#6 is not nearly as simple as you might guess. Using the simple structure chain to try to determine that one will fall down pretty quickly. Among other things that involves bounding boxes, transformations, intersections of bounding boxes, etc.
The renderer does not know about transforms - they are the responsibility of the SP layer. The rest depends on being able to determine which objects' appearance changed. For this we need smart references in the SP layer.
If you can't see more than 2 problems, then you really need to take a step back and think about it. From an architectural viewpoint those two are both "means" and not "ends". Thus they themselves are not the problems but only the means you are looking at to be able to solve some problems.
The distinction is purely philosophical. Both "find things to copy into the clipboard when user presses Ctrl+C" and "find objects which refer to this one" are valid problems ("ends"), but with different degrees of sophistication and varying relevance to the end user and developers. The proposed capability is very generic and I can assure you that solving those basic problems efficiently will lead to better solutions to more complicated problems.
Problem: avoid ID collision When does this occur?
- when a user pastes content from another SVG document
- when a user changes an object's ID via the XML editor or object
properties
* when the user changes the ID from the Object Properties dialog * when importing another document (should actually use the same code path as pasting, but doesn't at the moment)
Actually your point 2 is problematic. Personally I would expect the XML editor to remove all references to the changed ID from the document, and the object properties dialog to update references to the new ID.
These are two different use cases. The second use case is quite simple to address. A *highly* efficient structure to address this would be a map of IDs used in the document. Upon attempting to commit the ID change in the UI, a simple check for presence in the map would result in an immediate pass or fail. If, on the other hand, the three way pointer was relied upon then in order to detect if another object had the same ID the could would have to walk the *entire* document tree visiting each node and checking its ID.
The hard part of the problem is not determining whether some ID exists in the document, because this is already handled by XML::Document and its getObjectById method. The hard part is the ID update. Smart hrefs in no way preclude using an ID map. They are not intended to solve the "does this ID exist in the document" problem, because they *don't even know what the ID string actually is* (though their subclasses might), and we have other, better methods to solve that problem (ID map).
The ID collision problem can actually be decomposed into several parts (in parens means of solving): 1. Find all IDs used in the pasted document. (ID map of pasted doc) 2. For each ID, check whether it exists in the current document. (ID map of both documents) 3. If it does, find the object with that ID in the pasted document. (ID map) 4. Generate a new unique ID. (ID map, random number generator) 5. For each object that references the colliding object, change the ID it uses, then change the ID of the object. (smart hrefs + subclassing of ObjectRef)
Currently we can solve parts 1, 2, 3 and 4, but have big problems with part 5.
Oop! no we're not. What if some of the things that were copied are dangling references! Hmmmm... what should we do about that? Oh, and what if we're copying from the same document we might want a half- copy that will keep references to gradients, etc, and have those gradients also just in case we're pasting into a different document.
The safe thing to do is to create new gradients even if pasting into the same document, and let vacuum defs clean up gradients that are identical.
But... as I come to the end of this first pass analysis, I notice something quite interesting. The tri-pointer would only ever help with "ID changing" and not really with "clash prevention" itself.
Yes, because it's designed to simplify that part, and incidentally this is the part that causes problems. The other parts are easy because we already have this information available in another structure.
Hmmm... and while we're at it, the proposed class does seem a bit heavyweight for the problem it wants to solve. Also I don't see anything addressing the problem of pinning. I would think we would need weak references in here. That's another thing to add to the consideration.
What is this 'pinning problem'? If you mean something like cyclic references keeping things in the document, then it's not an issue, because this should be handled by vacuum defs (that's why I mentioned a mark-and-sweep algorithm). Finally, I wouldn't consider 6 pointers "heavyweight". Maybe if ObjectRef was intended to be used in millions of copies, it would make some difference, but realistically most documents will contain less than 100 references.
Regards, Krzysztof
On Fri, 2009-08-14 at 17:09 -0700, Krzysztof Kosiński wrote:
Jon A. Cruz wrote:
To clarify, you did not really explicitly say the "why" of the ends, just the "how" of the means you were looking at.
Let's keep the philosophy down to minimum
Actually, the philosophy is the *key* to solving engineering problems.
And the philosophy on this particular point is the one key thing Fred Brooks realized he got wrong in the original Mythical Man-Month and had to come back years later to correct.
However... I think these mails prove my point about the discussion. These have gotten so long that by the end of your mail you've stated the opposite about vacuum defs than you asserted at the beginning.
So PLEASE, drop in to the chat room more so that things will be easier to discuss. Or at least wait for me to get that wiki page set.
participants (2)
-
Jon A. Cruz
-
Krzysztof Kosiński