
Quoting bulia byak <buliabyak@...400...>:
On 10/17/05, jiho <jo.irisson@...400...> wrote:
http://wiki.inkscape.org/cgi-bin/wiki.pl?XML_Repair I provided links to the already well known files and tried to summarize the discussion here.
Does anyone know about any software that can parse invalid XML and convert it to valid? I'd be interested to have a look. This is probably a complex task so we better not try to solve it all by ourselves, and chances are someone has already tried to solve it.
About 90% of my (paid) programming work has involved writing code to process partially malformed or corrupted data, so I guess this is kind of my field...
A rather key question question is: invalid in what way?
It is very unlikely that there is a suitable pre-packaged solution out there for us.
In general you must find an appropriate set of heuristics that do best at inferring intent for _your particular application domain_, and the most common ways in which documents may end up malformed for that domain.
This is because there is a tradeoff: whatever sorts of corruptions you chose your heuristics to correct, it will be at the expense of the ability to repair others.
As far as for us specifically...
Basically we will need to write a heuristic XML parser which can be used on invalid documents for which libxml's parser fails. It doesn't need to be perfect, but just do well enough to get a reasonable tree in memory. Then we can perform further repairs at the SVG (rather than XML) level.
Before that, though, an important first step is to build a corpus of representative malformed documents.
-mental