
On Sep 29, 2009, at 11:46 AM, Krzysztof KosiĆski wrote:
The problem I see with keeping everything in UTF-8 is that it's a convention unsupported by Glib. Glib has functions to work with native filenames and URIs, but not with native filenames converted to UTF-8. We should just use URIs everywhere; this would add network transparency via GIO, as we wouldn't be limited to local files.
No, you're missing the point.
The only sane way to craft an application is to have a consistent data approach internally. The alternative is to have each individual string also marked with its encoding and do multiple conversions on the fly as things are processed. (Look to both early Windows 95 APIs and older Perl versions for examples of this).
The simple way is to say "All data inside the program is Unicode". Volia! Problem solved.
Now Unicode data can be encoded in many ways (here we hit the important difference of "encoding" vs. "character set"). Unicode can be UTF-8, UTF-7, UTF-16BE, UTF-16LE, UTF-32, etc. Some data can be UCS-2, but that is *not* the same as UTF-16 and can burn people who don't realize it.
On MS Windows, they went with UTF-16 as the standard. They also defined wchar_t as 16-bit when everyone else in the world followed the standard's recommendation and implemented it as 32-bit. Java also uses UTF-16 as far as the programmer can tell. IBM's ICU also uses UTF-16.
On Linux, however, and with GTK+ the base encoding is UTF-8. Everything we do on the UI *must* be in UTF-8. Therefore it makes sense for Inkscape to use UTF-8 as its Unicode encoding of choice.
You mention filename encoding, but miss some issues. First and foremost is that as far as Inkscape is concerned each and every filename *must* be presentable in the UI anyway (MRU lists, titlebars, media tracker list, etc.). Therefore we have to be able to handle UTF-8 for those. There are also safe round-trip conversions for *most* of the user scenarios. Therefore it vastly simplifies our code to just keep internal data in a single consistent format - Unicode.
And conceptually URI's don't actually even support Unicode. What the authors of those API's you cite have done, though, is follow RFC-3987 and implement their URI's *as* IRIs'. Thus we are 100% compatible with those API's you care so much about as long as we properly support IRIs.
By the way, there is some code that deals with versions of Windows that do not have the wide versions of Win32 API functions (Windows 95, Windows 98 and Windows ME) - this is totally superfluous since the version of Glib we depend on does not work on such systems.
For years this code was not superfluous. It only became redundant later on. Thus it now can be safely dropped, but only as long as care is taken in cutting it out.
Also early on Inkscape implemented code that did not exist in Glib and was adopted by the Glib maintainers because of us.