[jonadab@...581...: Re: [Clipart] character coding]
Jon, can you give some advice on this one?
----- Forwarded message from Jonadab the Unsightly One <jonadab@...581...> -----
Date: Mon, 07 Feb 2005 00:01:33 -0500 From: Jonadab the Unsightly One <jonadab@...581...> To: clipart@...330... Subject: Re: [Clipart] character coding
Nicu Buculei <nicu@...398...> writes:
i remember we talked several months ago about this problem and IIRC is generated by the upload tool.
The upload tool's approach to character encoding is completely naive: it just assumes that whatever form it _receives_ the data in is suitable also for _storing_ it. (At least, that's how it's supposed to work.) It doesn't care whether the data is ISO-8859-15 or what. (For the filename, it uses only 7-bit printable ASCII characters, but it doesn't remove non-ASCII characters from the actual metadata or change them in any way; it just represents them as underscores in the filename.) It does need the data to be in some encoding that encodes certain characters the same as in ASCII -- mainly the characters that have special significance to XML, such as < and > and / and so forth, plus the letters in certain tags (s, v, g, r, d, f, and so forth). But those are all 7-bit characters. It won't be able to handle EBCDIC data, for example, but any sane, ASCII-compatible encoding should work just fine. In theory.
I do not off the top of my head know what character set ISO-8859-15 is, other than that I think all the ISO-8859-anything charsets are fully ASCII-compatible in the bottom seven bits. And it was my understanding that UTF8 has this property also. So in *theory* it should Just Work (in the sense of not making any undesired changes).
The one problem I can think of off the top of my head that could occur with this is if the data that the upload tool receives is not consistent in its encoding -- e.g., if the SVG it receives is in one encoding, and the metadata the user fills in on the form is sent by the browser in a different encoding. Is it possible that that is what happened here?
is saved as ISO-8859-15
I was unaware that the filesystem maintained character-set metadata. What does it mean for a file to be "saved as" ISO-8859-15? How can you tell what character set a file uses, apart from looking at the charset information in the XML declaration?
More to the point, how can the script detect what encoding the information it's receiving is encoded in, short of asking the user?
Maybe we need a Unicode guru. I'm not one.
Alternatively: does RDF allow for non-ASCII characters in the metadata to be encoded as entities? Could we just use something along the lines of HTML::Entities to encode it (so that e.g. the problematic character in the file in question would become é or somesuch)? Wouldn't that render the character encoding basically irrelevant?
participants (1)
-
Bryce Harrington