Broken string trimming

newer
autopackages now statically linked

older
Draw Icons

Jon A. Cruz

9 Jan 2005 9 Jan '05

5:12 p.m.

I just saw this in the jabber logs

(01:20:33) *bryce:* to delete the first N chars off the start of a string s, is it ok to do s += N?

The answer, for most of our stuff, is a big resounding "NO!"

Remember, when multibyte encodings come into play, doing something like that is liable to break things mid-sequence and corrupt data. Or to get the wrong count and trim fewer characters that you'd expect.

In most places our code is using the 'standard' GTK+ encoding of UTF-8. That's multibyte, with from 1 to 6 bytes per character. Of course, if you're at the point where raw data is entering the app via some means other than UI widgets, the bytes will be in whatever the local encoding is.

So just remember, characters no longer match bytes one-for-one. Sometimes they'll happen to align, but other times not.

Show replies by date

Bryce Harrington

9 Jan 9 Jan

6:59 p.m.

I think you missed the context. This is for code that is not UTF-8.

However, if you want everything to be in UTF-8 just for the principle of it, give me the directions for how to do the below in UTF-8. (I'll also need pointers to the UTF-8 equiv functions for fgets and sscanf.)

Bryce

On Sun, 9 Jan 2005, Jon A. Cruz wrote:

...

I just saw this in the jabber logs

(01:20:33) *bryce:* to delete the first N chars off the start of a string s, is it ok to do s += N?

The answer, for most of our stuff, is a big resounding "NO!"

Remember, when multibyte encodings come into play, doing something like that is liable to break things mid-sequence and corrupt data. Or to get the wrong count and trim fewer characters that you'd expect.

In most places our code is using the 'standard' GTK+ encoding of UTF-8. That's multibyte, with from 1 to 6 bytes per character. Of course, if you're at the point where raw data is entering the app via some means other than UI widgets, the bytes will be in whatever the local encoding is.

So just remember, characters no longer match bytes one-for-one. Sometimes they'll happen to align, but other times not.

Jon A. Cruz

8:02 p.m.

Bryce Harrington wrote:

...

I think you missed the context. This is for code that is not UTF-8.

Are you *sure*???

I did scan for context and didn't see anything that seemed to guarantee non-UTF-8.

Given that RedHat switched the default locale to UTF-8 back in RH 8.0, and Solaris back with Solaris 7 or 8, many base assumptions are now dangerous.

In fact, I was having trouble just the other week regarding trying to use "C" and other locales to force stuff to not use UTF-8 on one box, all to no avail. :-(

MenTaLguY

8:34 p.m.

On Sun, 2005-01-09 at 15:02, Jon A. Cruz wrote:

...

Bryce Harrington wrote:

...
I think you missed the context. This is for code that is not UTF-8.

...

Are you *sure*???

I did scan for context and didn't see anything that seemed to guarantee non-UTF-8.

Given that RedHat switched the default locale to UTF-8 back in RH 8.0, and Solaris back with Solaris 7 or 8, many base assumptions are now dangerous.

Not dangerous, wrong. And they always were, UTF-8 or no, ever since the advent of multi-byte encodings decades ago. There are only two cases possible in the Inkscape codebase:

1. we're dealing with an internal string, which should always be UTF-8, in which case c+N to advance N chars is hopelessly broken

2. we're dealing with an external string in the current locale's encoding, which may be any single or multibyte encoding (not just UTF-8), in which case c+N to advance N chars is hopelessly broken

If you need to do nontrivial string manipulation in the first case, please use the appropriate glib functions (documented at http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html), or just use Glib::ustring, which does everything for you.

If you need to do nontrivial string manipulation in the second case, the easiest thing to do is to convert the string from the locale encoding to UTF-8 first. Glib::ustring also handles that automatically (though check its documentation at http://www.gtkmm.org/gtkmm2/docs/reference/html/classGlib_1_1ustring.html as the situations where it does conversion and the situations where it does not are not always obvious).

The easiest way is to always use Glib::ustring, and use Glib::locale_to_utf8() and Glib::utf8_to_locale() if you need to convert to/from it to strings in the current locale.

-mental

Peter Moulder

9 p.m.

On Sun, Jan 09, 2005 at 03:34:35PM -0500, MenTaLguY wrote:

...

we're dealing with an internal string, which should always be UTF-8,

in which case c+N to advance N chars is hopelessly broken

Qualification: for skipping past the leading N whitespace characters (g_ascii_isspace), c+N is fine, as g_ascii_isspace is true only for characters that are 1 byte in UTF-8.

(I don't know whether or not this qualification is relevant to what was originally being discussed, as I don't know what was originally being discussed.)

One note of caution about Glib::ustring: The documentation claims that the internal representation is utf8, which suggests that s[i] runs in time proportional to i. However, I don't have the source handy to check this, and I have my doubts. Anyone?

If in doubt, use the iterators for passing over ustring contents rather than unsigned/int i and s[i].

pjrm.

Mike Hearn

8:24 p.m.

On Sun, 09 Jan 2005 10:59:06 -0800, Bryce Harrington wrote:

...

...
(01:20:33) *bryce:* to delete the first N chars off the start of a string s, is it ok to do s += N?

For a UTF-8 string you can put it in a Glib::ustring then use erase() to delete the first N characters.

To get character N use Glib::ustring[]. If you have a UTF-8 char * but not a ustring and don't want to make one, the raw glib equivalents work as well.

I dunno about fgets and sscanf but if they don't make any assumptions about 1 character == 1 byte they're prolly safe to use.

thanks -mike

Jon A. Cruz

8:25 p.m.

Mike Hearn wrote:

...

I dunno about fgets and sscanf but if they don't make any assumptions about 1 character == 1 byte they're prolly safe to use.

Actually, they're supposed to operate in the current multibyte character encoding.

Which happens to be UTF-8 on RedHat 8.0, Fedora Core, etc. :-)

MenTaLguY

8:35 p.m.

On Sun, 2005-01-09 at 15:25, Jon A. Cruz wrote:

...

Actually, they're supposed to operate in the current multibyte character encoding.

Which happens to be UTF-8 on RedHat 8.0, Fedora Core, etc. :-)

But of course we still shouldn't rely on that. :-)

-mental

7532

Age (days ago)

7532

Last active (days ago)

List overview

Download

7 comments

5 participants

tags (0)

participants (5)

Bryce Harrington
Jon A. Cruz
MenTaLguY
Mike Hearn
Peter Moulder