Currently Inkscape fails in numerous ways when you try to name your file with non-ascii characters. For one thing, it records that filename into preferences.xml using the locale encoding, which is invalid UTF-8, which causes parser to reject the file upon the next load and the preferences are simply lost.
Can anyone give any advice on how to deal with this problem? I know about g_locale_to/from_utf8() but the question is, where is the best place to put these functions. It looks like all standard file-related functions (gtk_file_selection_get_filename, file opening and saving, etc) require the filename in locale encoding, but all display functions (setting window title, display in XML editor, recent files menu) require UTF-8. And of course, serialized XML must get UTF-8 too. Looks like there are quite a lot of places where I'll need to translate between the encodings, and I'm afraid it will become pretty messy.
I'm not going to experiment with this but if anyone has any inshights on this problem I would appreciate any help.
_________________________________________________________________ MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. http://join.msn.com/?page=features/virus&pgmarket=en-ca&RU=http%3a%2...
On Dec 25, 2003, at 10:47 PM, bulia byak wrote:
Can anyone give any advice on how to deal with this problem? I know about g_locale_to/from_utf8() but the question is, where is the best place to put these functions. It looks like all standard file-related functions (gtk_file_selection_get_filename, file opening and saving, etc) require the filename in locale encoding, but all display functions (setting window title, display in XML editor, recent files menu) require UTF-8. And of course, serialized XML must get UTF-8 too. Looks like there are quite a lot of places where I'll need to translate between the encodings, and I'm afraid it will become pretty messy.
I'll have to look into the codebase with charset issues in mind (I've worked with international software for around 15 years now).
In general, the simplest way is to keep all strings in a program in a canonical form, and only translate when required. Best is either UTF-16/UCS-2 or UTF-8 for the internal format (depending on the platform, etc). For GTK+ and Qt/KDE programs, it's generally good to go with UTF-8. However, keeping Win32 compatibility complicates things somewhat.
On file versus display encodings, off hand I think you're correct. However... for the local filenames, things might get tricky. It might be that things need to work in the local encoding, or... it might be that the encoding of the filesystem needs to be used. Much depends on the OS and the means of accessing the file systems in question.
Hiding these issues for many people is the recent change to UTF-8 for the local encoding setting (such as with RedHat 8.0). Since the local encoding is the same as the encoding needed for internal/display calls, programmers don't notice the potential problems. Also... if users and programmers aren't actually testing with non-ASCII names (i.e. names with characters outside of $00-$7f) then they will also not see any problems even if they exist for their configurations.
I was planning on looking into this soon, but will try to get to it a little sooner. It would also help if some developers who are setup to do Win32 building and testing could give me a hand (I have MSVC 5.0 on Win98 dual-boot, but almost never leave Linux)
open()/readdir() etc. take as argument a nul-terminated string of bytes, not some notion of characters. preferences.xml should store filenames as such.
How to render a filename, and how to interpret filenames that are typed (as distinct from selected from a list) is of secondary importance to avoiding munging filenames during storage. I believe on Linux the preferred default interpretation is UTF-8.
If <gui toolkit xyz> does something different, then that toolkit is broken and should be fixed.
IMH-and-ignorant-O,
pjm.
On Fri, 2003-12-26 at 21:42, Peter Moulder wrote:
open()/readdir() etc. take as argument a nul-terminated string of bytes, not some notion of characters. preferences.xml should store filenames as such.
Unless we were to use e.g. base64 encoding, there would be no XML-conformant way to store that in preferences XML.
In any case, we'd still need to store the UTF-8 filename, if only for display in the UI.
An even more interesting question is how things would work when preferences were shared between systems (e.g. a shared home directory).
On some platforms (e.g. Darwin/Mac OS X) where open()/readdir() use a specific, fixed, encoding, which has certain validity rules -- i.e. you _can't_ just give it an arbitrary byte stream (even if such a filename already exists in the filesystem).
To be honest, I really think this isn't our problem to fix. The best we can do is use UTF-8 internally, and call open() and friends with the string converted to LC_CTYPE.
-mental
On Dec 26, 2003, at 11:07 PM, MenTaLguY wrote:
In any case, we'd still need to store the UTF-8 filename, if only for display in the UI.
Exactly. Since a conversion would be required anyway, it makes sense to keep the canonical Unicode version around.
On some platforms (e.g. Darwin/Mac OS X) where open()/readdir() use a specific, fixed, encoding, which has certain validity rules -- i.e. you _can't_ just give it an arbitrary byte stream (even if such a filename already exists in the filesystem).
To be honest, I really think this isn't our problem to fix. The best we can do is use UTF-8 internally, and call open() and friends with the string converted to LC_CTYPE.
Or glib's g_filename_to_utf8() and g_filename_from_utf8() functions might be appropriate.
g_filename_from_uri() and g_filename_to_uri also look very promising.
Hmmm... speaking of URI's... Looking at some existing solutions might be handy:
http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html
(the thing to look at in File is how it can be used to work and manipulate paths instead of using bare string manipulation)
On Sat, 2003-12-27 at 05:28, Jon A.Cruz wrote:
To be honest, I really think this isn't our problem to fix. The best we can do is use UTF-8 internally, and call open() and friends with the string converted to LC_CTYPE.
Or glib's g_filename_to_utf8() and g_filename_from_utf8() functions might be appropriate.
g_filename_from_uri() and g_filename_to_uri also look very promising.
Ooh, those sound ideal. Seems like I never stop learning about cool new stuff in glib ^_^
Hmmm... speaking of URI's... Looking at some existing solutions might be handy:
http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html http://java.sun.com/j2se/1.4.2/docs/api/java/io/File.html
(the thing to look at in File is how it can be used to work and manipulate paths instead of using bare string manipulation)
Indeed, I've been looking at these already.
-mental
participants (4)
-
bulia byak
-
Jon A.Cruz
-
MenTaLguY
-
Peter Moulder