URI is EVIL!!!

newer
[NEW] Waf build

older
Re: [Inkscape-devel] Inkscape...

Jon A. Cruz

29 Sep 2009 29 Sep '09

6 a.m.

Hi,

Just wanted to point out that technically URI classes are not usable for our assets.

For example, it is tempting to use an URI to resolve an asset to the base document path. The problem is that a URI according to the RFC's is not usable for file paths. Glib and friends properly implement URI, therefore we can't use them for files.

In the future we can probably use IRI's or our own classes for references and assets, etc. However URI's are not usable.

I'm pointing this out since we are just about to release, but I tracked down a critical bug to being caused by this. In the near term we should search through the codebase for improper uses of URI classes. The tricky thing is that they will appear to function since common path names are "safe." However as soon as unexpected characters come in (like ampersands), all bets are off.

To summarize:

If we are linking something that is represented by a file (images, color profiles, etc.) can't be processed with a URI class or URI functions. It will work for simple cases, but not all.

Show replies by date

Bob Jamison

29 Sep 29 Sep

11:57 a.m.

On 9/29/2009 1:00 AM, Jon A. Cruz wrote:

...

Hi,

Just wanted to point out that technically URI classes are not usable for our assets.

For example, it is tempting to use an URI to resolve an asset to the base document path. The problem is that a URI according to the RFC's is not usable for file paths. Glib and friends properly implement URI, therefore we can't use them for files.

In the future we can probably use IRI's or our own classes for references and assets, etc. However URI's are not usable.

I'm pointing this out since we are just about to release, but I tracked down a critical bug to being caused by this. In the near term we should search through the codebase for improper uses of URI classes. The tricky thing is that they will appear to function since common path names are "safe." However as soon as unexpected characters come in (like ampersands), all bets are off.

To summarize:

If we are linking something that is represented by a file (images, color profiles, etc.) can't be processed with a URI class or URI functions. It will work for simple cases, but not all.

Jon,

(for those uninterested in URI's, please ignore :-)

I suspect there is a problem either with the URI impl or how it is being used.

But I'm kinda pro URIs. Yes, the URI API as stated in the W3C doc does not have a getNativePath() method as far as I know. However, Java's runtime, XML2, and other libs -do- include such a call, and it is quite useful. I think it has become a defacto requirement.

In conjunction with resolve() and normalize(), you should be able to handle paths in a -predictable- manner. Maybe not what you expect, but always the same.

As far as IRI's go, in the specs themselves IRI's are described as URI classes which have been hacked to be able to handle Unicode and international information in the full resource paths. We could take either of the two URI impl's in Inkscape and hack them to work nicely in just a few days' work. I just call them all URI's whether they can handle IRI stuff or not.

The case that you mention, an asset pointing back to the owner doc. Is that feasible, or even promised anywhere? I can easily imagine that resolving owner doc->resource can only be guaranteed 1-way. Especially considering symlinks.

Once again, the "sandbox" comes up. It would be a wonderful thing if we had docs operate in a JAR-like sandbox. If you import an image, it gets copied into the sandbox with the SVG doc (unless you request otherwise). HTML editors work this way. The links within the sandbox are simple and predictable.

And when you JAR the directory into an archive, the META-INF directory and manifest are already ready to go.

This is something we needed for collaboration, too.

bob

Jon A. Cruz

6:24 p.m.

On Sep 29, 2009, at 4:57 AM, Bob Jamison wrote:

...

I suspect there is a problem either with the URI impl or how it is being used.

But I'm kinda pro URIs. Yes, the URI API as stated in the W3C doc does not have a getNativePath() method as far as I know. However, Java's runtime, XML2, and other libs -do- include such a call, and it is quite useful. I think it has become a defacto requirement.

Actually Mental and I both have looked into this a fair bit. URI as defined by the spec is not really usable for arbitrary filenames. Java's implementation is interesting, but also falls down. When doing some Java work a while back I recall hitting some of those issues.

For the glib world, I seem to recall that these break down: g_filename_from_uri() g_filename_to_uri()

One of the trickiest things to deal with is converting to and from file names when those names are *not* in UTF-8. We need to remember to keep those in our testing, as all data internal to Inkscape needs to be Unicode + UTF-8.

Basically I recall that IRI's should work nicely for our needs, with an implicit assumption that they will work as URIs for most APIs we care about. But using IRIs will keep us from failing in some edge cases.

Krzysztof Kosiński

6:46 p.m.

2009/9/29 Jon A. Cruz <jon@...18...>:

...

For the glib world, I seem to recall that these break down: g_filename_from_uri() g_filename_to_uri()

On what cases? Why?

...

One of the trickiest things to deal with is converting to and from file names when those names are *not* in UTF-8. We need to remember to keep those in our testing, as all data internal to Inkscape needs to be Unicode + UTF-8.

On Linuxes the filename encoding is usually UTF-8 (e.g. anything newer than 5 years). On Windows the filename encoding is UTF-8 by definition. However, the arguments from the command line are in locale encoding (different from filename encoding), which is usually a legacy 8-bit codepage - this is the only place where the conversion is non-trivial, and the arguments of main() do not actually contain sufficient information to determine what files were actually meant - we should therefore never use the arguments of main() for *anything* on Windows.

The problem I see with keeping everything in UTF-8 is that it's a convention unsupported by Glib. Glib has functions to work with native filenames and URIs, but not with native filenames converted to UTF-8. We should just use URIs everywhere; this would add network transparency via GIO, as we wouldn't be limited to local files.

By the way, there is some code that deals with versions of Windows that do not have the wide versions of Win32 API functions (Windows 95, Windows 98 and Windows ME) - this is totally superfluous since the version of Glib we depend on does not work on such systems.

Regards, Krzysztof

Jon A. Cruz

6:48 p.m.

On Sep 29, 2009, at 11:46 AM, Krzysztof Kosiński wrote:

...

On Linuxes the filename encoding is usually UTF-8 (e.g. anything newer than 5 years).

Actually that is not quite accurate.

The filename encoding is UTF-8 on *new* systems.

When a user upgrades a system, it keeps the filename encoding. We have several core Inkscape developers who have this issue. It is also far more prevalent for corporate users.

bulia byak

7:05 p.m.

On 9/29/09, Jon A. Cruz <jon@...18...> wrote:

...

When a user upgrades a system, it keeps the filename encoding. We have several core Inkscape developers who have this issue.

If you refer to me, then I have just installed a shiny new UTF-8 Ubuntu on a new machine after my old one died. So, until a couple weeks ago, indeed I had KOI8-R as my filename encoding, but not anymore.

-- bulia byak Inkscape. Draw Freely. http://www.inkscape.org

Jon A. Cruz

7:19 p.m.

On Sep 29, 2009, at 12:05 PM, bulia byak wrote:

...

On 9/29/09, Jon A. Cruz <jon@...18...> wrote:

...
When a user upgrades a system, it keeps the filename encoding. We have several core Inkscape developers who have this issue.

If you refer to me, then I have just installed a shiny new UTF-8 Ubuntu on a new machine after my old one died. So, until a couple weeks ago, indeed I had KOI8-R as my filename encoding, but not anymore.

Yes, you were generally taken as the pathological worst-case, since your filesystem also had files encoded in a few encodings. Was CP-1251 in there at some point?

But I've seen others. At least some of my boxen have been upgraded, and one common place to see "interesting" file system encodings has been Japan. There were many encodings used over there, and I've personally hit support issues with people having other than UTF-8 filesystems on boxes running modern distros that list "UTF-8" as their default filesystem.

The good news, though, is that as long as we do the right thing it will work 100% for everyone, including those with UTF-8 filesystems. It's minor work to do the right thing, and actually more work to do it wrong.

Jon A. Cruz

7 p.m.

On Sep 29, 2009, at 11:46 AM, Krzysztof Kosiński wrote:

...

The problem I see with keeping everything in UTF-8 is that it's a convention unsupported by Glib. Glib has functions to work with native filenames and URIs, but not with native filenames converted to UTF-8. We should just use URIs everywhere; this would add network transparency via GIO, as we wouldn't be limited to local files.

No, you're missing the point.

The only sane way to craft an application is to have a consistent data approach internally. The alternative is to have each individual string also marked with its encoding and do multiple conversions on the fly as things are processed. (Look to both early Windows 95 APIs and older Perl versions for examples of this).

The simple way is to say "All data inside the program is Unicode". Volia! Problem solved.

Now Unicode data can be encoded in many ways (here we hit the important difference of "encoding" vs. "character set"). Unicode can be UTF-8, UTF-7, UTF-16BE, UTF-16LE, UTF-32, etc. Some data can be UCS-2, but that is *not* the same as UTF-16 and can burn people who don't realize it.

On MS Windows, they went with UTF-16 as the standard. They also defined wchar_t as 16-bit when everyone else in the world followed the standard's recommendation and implemented it as 32-bit. Java also uses UTF-16 as far as the programmer can tell. IBM's ICU also uses UTF-16.

On Linux, however, and with GTK+ the base encoding is UTF-8. Everything we do on the UI *must* be in UTF-8. Therefore it makes sense for Inkscape to use UTF-8 as its Unicode encoding of choice.

You mention filename encoding, but miss some issues. First and foremost is that as far as Inkscape is concerned each and every filename *must* be presentable in the UI anyway (MRU lists, titlebars, media tracker list, etc.). Therefore we have to be able to handle UTF-8 for those. There are also safe round-trip conversions for *most* of the user scenarios. Therefore it vastly simplifies our code to just keep internal data in a single consistent format - Unicode.

And conceptually URI's don't actually even support Unicode. What the authors of those API's you cite have done, though, is follow RFC-3987 and implement their URI's *as* IRIs'. Thus we are 100% compatible with those API's you care so much about as long as we properly support IRIs.

...

By the way, there is some code that deals with versions of Windows that do not have the wide versions of Win32 API functions (Windows 95, Windows 98 and Windows ME) - this is totally superfluous since the version of Glib we depend on does not work on such systems.

For years this code was not superfluous. It only became redundant later on. Thus it now can be safely dropped, but only as long as care is taken in cutting it out.

Also early on Inkscape implemented code that did not exist in Glib and was adopted by the Glib maintainers because of us.

Krzysztof Kosiński

6:27 p.m.

2009/9/29 Bob Jamison <ishmalius@...400...>:

...

On 9/29/2009 1:00 AM, Jon A. Cruz wrote:

...
Hi,

Just wanted to point out that technically URI classes are not usable for our assets.

For example, it is tempting to use an URI to resolve an asset to the base document path. The problem is that a URI according to the RFC's is not usable for file paths. Glib and friends properly implement URI, therefore we can't use them for files.

We can, and in fact we must in order to support editing directly on network shares (GIO). The convention used by absolutely everyone is to use the file:/// scheme for local files.

...

...
In the future we can probably use IRI's or our own classes for references and assets, etc. However URI's are not usable.

The only difference between URIs and IRIs is the set of allowed characters. You can use Unicode characters in IRIs without escaping them, but you can use Unicode characters in URIs as well, you just need to escape their UTF-8 representation with %xx. http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier

...

But I'm kinda pro URIs. Yes, the URI API as stated in the W3C doc does not have a getNativePath() method as far as I know. However, Java's runtime, XML2, and other libs -do- include such a call, and it is quite useful. I think it has become a defacto requirement.

Seconded. URIs might not be originally intended for file paths but it is a de-facto standard to use the file:/// URI scheme for local paths. It is even used in the clipboard in GTK.

...

The case that you mention, an asset pointing back to the owner doc. Is that feasible, or even promised anywhere?

SVG provides mechanisms to embed the necessary data (e.g. images) in the SVG itself, so I don't see why we need this.

...

And when you JAR the directory into an archive, the META-INF directory and manifest are already ready to go.

This is something we needed for collaboration, too.

Multipage as well. But we should also fix embedding data in SVG (the data URI). For example, pasted and imported images should be embedded in the document by default. There must be a separate option to embed a link to the image, and there must be a clear distinction between those two concepts.

Regards, Krzysztof

Jon A. Cruz

6:47 p.m.

On Sep 29, 2009, at 11:27 AM, Krzysztof Kosiński wrote:

...

We can, and in fact we must in order to support editing directly on network shares (GIO). The convention used by absolutely everyone is to use the file:/// scheme for local files.

No...

There are many things that are *close*

We want 100%, not 95%

So for our internals we must use something that is consistent and also covers 100% of the use cases we need. As we touch each API, we can either feed something in directly (as I mentioned, both Mental and I have researched this URI/IRI issue) or we can do a quick conversion per API.

Also... at runtime we might need to track some additional data. Bob mentioned some additions to URI classes that people do in order to make things work. That is what we need to do. However, instead of having redundant code all over the place doing potentially buggy conversions, we need to have a consistent class that we use internally.

To quote RFC 3987 "which means that IRIs can be used instead of URIs, where appropriate, to identify resources." Later on it also mentions "Every URI is by definition an IRI."

You also seem to not be grasping what I'm trying to convey. Let me try to clarify: * URI's *can* be used for all sorts of things, but as is the standard library calls for conversions *fail* in our use cases. * There are many ways to tweak data in and out of URI's to make them more useful * RFC 3987 codifies the best practices of "useable URIs" and calls them "IRIs" * We need to use IRIs internally * We need to use a class that is more than just a bare string * We should use them to reference files

5784

Age (days ago)

5784

Last active (days ago)

List overview

Download

9 comments

4 participants

tags (0)

participants (4)

Bob Jamison
bulia byak
Jon A. Cruz
Krzysztof Kosiński