Careful with C-strings!
Hi all, *Please take this to heart: be very careful when handling C-strings. And try hard to use std::string, even if it involves a bit of work / adding helper functions.*
Another C-string induced memleak was added today. Actually, the change fixed one memleak, but replaces it with the same one. Fixed in r13101.
How to go about things using std::string in this particular case, see for example: http://stackoverflow.com/questions/20446201/how-to-check-if-string-ends-with...
ciao, Johan
On Mon, 2014-03-03 at 22:09 +0100, Johan Engelen wrote:
Another C-string induced memleak was added today. Actually, the change fixed one memleak, but replaces it with the same one. Fixed in r13101.
I'm going to rip out c-strings from uri handling. It's dangerous code and needs to be dealt with.
And then I'm going to make the std::string implimentation of path normalize.
I'll get that into trunk and merge the dom/ removal (which ironically has a uri implementation that uses std::string properly)
So, I expect the code might cause some issues, but I'm going to try and test as much as I can before the push. But please don't revert it, just post to me any errors or critical issues.
Best Regards, Martin Owens
On 3-3-2014 22:47, Martin Owens wrote:
On Mon, 2014-03-03 at 22:09 +0100, Johan Engelen wrote:
Another C-string induced memleak was added today. Actually, the
change fixed one memleak, but replaces it with the same one. Fixed in r13101.
I'm going to rip out c-strings from uri handling. It's dangerous code and needs to be dealt with.
And then I'm going to make the std::string implimentation of path normalize.
I'll get that into trunk and merge the dom/ removal (which ironically has a uri implementation that uses std::string properly)
Good stuff. Perhaps you can create a helper/string.h file with all the string functions you need?
(just thought of it now, but might be complete nonsense:) shouldn't the path stuff be implemented with glib::ustring, for paths that contain, say, Japanese characters?
ciao, Johan
On Mon, 2014-03-03 at 22:56 +0100, Johan Engelen wrote:
Perhaps you can create a helper/string.h file with all the string functions you need?
If I get more than 2, I'll do that.
(just thought of it now, but might be complete nonsense:) shouldn't the path stuff be implemented with glib::ustring, for paths that contain, say, Japanese characters?
KK believes different:
On Fri, 2014-02-07 at 15:40 +0100, Krzysztof Kosiński wrote:
Another thing to note is that std::string should only be used for paths. For UTF-8 strings, we should use Glib::ustring, which has a character-based index operator instead of a byte-based one.
On Fri, 2014-01-24 at 22:16 +0100, Krzysztof Kosiński wrote:
Eventually, all paths should be stored in std::string and all XML content, user-facing strings, and so on in Glib::ustring, but these tricks should help you get by in the meantime.
Jon Cruz: What say you, std::string or std::ustring for paths?
Martin,
There aren't any disadvantages in using Glib::ustring, are there? These answers here http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-f... suggest that there might very well be paths containing utf-8 characters. The problem with std::string is that size() and length() do not return the number of characters but the number of bytes.
Regards, Markus
-----Ursprüngliche Nachricht----- Von: Martin Owens [mailto:doctormo@...400...] Gesendet: Montag, 3. März 2014 23:37 An: Johan Engelen Cc: Inkscape-Devel Betreff: Re: [Inkscape-devel] Careful with C-strings!
On Mon, 2014-03-03 at 22:56 +0100, Johan Engelen wrote:
Perhaps you can create a helper/string.h file with all the string functions you need?
If I get more than 2, I'll do that.
(just thought of it now, but might be complete nonsense:) shouldn't the path stuff be implemented with glib::ustring, for paths that contain, say, Japanese characters?
KK believes different:
On Fri, 2014-02-07 at 15:40 +0100, Krzysztof Kosiński wrote:
Another thing to note is that std::string should only be used for paths. For UTF-8 strings, we should use Glib::ustring, which has a character-based index operator instead of a byte-based one.
On Fri, 2014-01-24 at 22:16 +0100, Krzysztof Kosiński wrote:
Eventually, all paths should be stored in std::string and all XML content, user-facing strings, and so on in Glib::ustring, but these tricks should help you get by in the meantime.
Jon Cruz: What say you, std::string or std::ustring for paths?
Martin,
------------------------------------------------------------------------------ Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.cl... _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel
On Mar 3, 2014, at 2:37 PM, Martin Owens wrote:
On Mon, 2014-03-03 at 22:56 +0100, Johan Engelen wrote:
Perhaps you can create a helper/string.h file with all the string functions you need?
If I get more than 2, I'll do that.
(just thought of it now, but might be complete nonsense:) shouldn't the path stuff be implemented with glib::ustring, for paths that contain, say, Japanese characters?
KK believes different:
On Fri, 2014-02-07 at 15:40 +0100, Krzysztof Kosiński wrote:
Another thing to note is that std::string should only be used for paths. For UTF-8 strings, we should use Glib::ustring, which has a character-based index operator instead of a byte-based one.
On Fri, 2014-01-24 at 22:16 +0100, Krzysztof Kosiński wrote:
Eventually, all paths should be stored in std::string and all XML content, user-facing strings, and so on in Glib::ustring, but these tricks should help you get by in the meantime.
Jon Cruz: What say you, std::string or std::ustring for paths?
Some of it depends on where in the program the strings are going to be used.
In general we want to keep the core of our software to purely Unicode. In our case with GTK+ and GLib that means UTF-8.
User data *may* come in and out of certain points in the locale encoding. For those cases we need to convert appropriately.
Then file IO will be in the filesystem encoding. There too we probably want to convert fairly soon to UTF-8 data, and as we go out to the file system we'll have to convert back. Certain systems may have paths that don't convert cleanly to UTF-8. However those files and/or directories won't show up properly in file browsers, GNOME desktop, etc. so it might be fine to punt on those.
For more advanced use, we might add a system that maps untranslatable file-system paths to a corresponding UI string. And/or we could create a complex data type that included the filesystem string and the 'user/UI' string. These might be extreme measures, though.
Another big gotcha is that URI's are tricky, and the standard is very limited in character set. This then leads to most of the URI support in GTK+/GLib being unusable for our needs. Technically I think we actually might need IRI's.
So... the next step is probably to make a decision on where your path-type operations will fit: are they low-level in filesystem encoding random bytes, or are they slightly higher in UTF-8 data?
If you do decide to operate at the lower-level, then you have to take care to only ever walk strings from the very beginning... never search for path separators but walk for them instead, etc. UTF-8 strings, such as in GLib::ustring, can be searched for '/', '.', etc. since UTF-8 ensures we will never have problems with lead-byte/trail-byte ASCII mismatches and such.
And finally... a completely different approach might be used by leveraging Boost's filesystem lib.
Glib::
2014-03-04 2:08 GMT+01:00 Jon Cruz <jon@...18...>:
Some of it depends on where in the program the strings are going to be used.
In general we want to keep the core of our software to purely Unicode. In our case with GTK+ and GLib that means UTF-8.
User data *may* come in and out of certain points in the locale encoding. For those cases we need to convert appropriately.
Then file IO will be in the filesystem encoding. There too we probably want to convert fairly soon to UTF-8 data, and as we go out to the file system we'll have to convert back. Certain systems may have paths that don't convert cleanly to UTF-8. However those files and/or directories won't show up properly in file browsers, GNOME desktop, etc. so it might be fine to punt on those.
Fortunately, paths untranslatable to UTF-8 can only happen on some old and/or strangely configured Linux systems which use a filename encoding different than UTF-8. On Windows, the filename encoding is always UTF-8 (Glib/GTK convert it transparently to/from UTF-16 used in Windows system calls). I'm not completely sure about OS X, but it also seems to use UTF-8.
To support those old systems, local filenames should be in std::string and almost everything else in Glib::ustring.
For more advanced use, we might add a system that maps untranslatable file-system paths to a corresponding UI string. And/or we could create a complex data type that included the filesystem string and the 'user/UI' string. These might be extreme measures, though.
I think this is not necessary. The "std::string for filenames, Glib::ustring for everything else" convention recommended in the Glib/GTK manuals takes care of all relevant cases.
Another big gotcha is that URI's are tricky, and the standard is very limited in character set. This then leads to most of the URI support in GTK+/GLib being unusable for our needs. Technically I think we actually might need IRI's.
URIs as used in Glib for file access should contain URL-escaped bytes in the filename encoding, so again there is little potential for error as long as we ignore old / strange Linux.
The functions Glib::filename_from_uri and Glib::filename_to_uri can be used to convert between local filenames and URIs.
https://developer.gnome.org/glibmm/2.35/group__CharsetConv.html
Regards, Krzysztof
On Tue, 2014-03-04 at 03:25 +0100, Krzysztof Kosiński wrote:
URIs as used in Glib for file access should contain URL-escaped bytes in the filename encoding, so again there is little potential for error as long as we ignore old / strange Linux.
The functions Glib::filename_from_uri and Glib::filename_to_uri can be used to convert between local filenames and URIs.
OK, I'm going to need at least a few developers to confirm this consensus before I convert everything to using Glib uri handling instead of libxml2. The handling needs to support a few things, not just filenames/paths but also all the rest of the fields libxml2 provides.
Please confirm. std::ustring/GLib::ustring instead of std::string and filename_from_uri instead of libxml2/uri.h
Martin OWens,
On Mar 3, 2014, at 6:51 PM, Martin Owens wrote:
On Tue, 2014-03-04 at 03:25 +0100, Krzysztof Kosiński wrote:
URIs as used in Glib for file access should contain URL-escaped bytes in the filename encoding, so again there is little potential for error as long as we ignore old / strange Linux.
The functions Glib::filename_from_uri and Glib::filename_to_uri can be used to convert between local filenames and URIs.
OK, I'm going to need at least a few developers to confirm this consensus before I convert everything to using Glib uri handling instead of libxml2. The handling needs to support a few things, not just filenames/paths but also all the rest of the fields libxml2 provides.
Please confirm. std::ustring/GLib::ustring instead of std::string and filename_from_uri instead of libxml2/uri.h
Actually...
In the past we found the uri functions, including those, to be unusable as they strictly adhered to the RFC on URI and would fail/throw (depending on the glib or glibmm versions) on many of the cases we wanted to support
We would need to run some tests on all appropriate non-ASCII filenames to make sure things operate correctly.
Another of the points of failure we was was in the recently opened menu MRU support is a bit touchy in different gtk releases.
Martin Owens <doctormo@...360...> writes:
Jon Cruz: What say you, std::string or std::ustring for paths?
Lurker here, but the soon-to-be-canonical solution is std::path, which is currently known as boost::path or std::tr2::path if your compiler is new enough.
The only surprise you need to be aware of, if your string bits are stored as utf8 rather than the native narrow type, is that you need to use:
filesystem::detail::utf8_codecvt_facet utf8; path(myUtf8String, utf);
in boost, or
u8path(myUtf8String);
in tr2.
Aside from that, it's quite nice to work with. You can start doing things like:
ifstream fin(settingsPath / "settings.txt");
and it has all sorts of functionality to handle the platform-dependent differences for you.
Might not be worth switching now, since it's not standard yet, but worth keeping in mind.
The C++ standard group is also working on std::experimental::uri, which is interoperable with std::tr2::path, but that one is based on cpp-netlib rather than Boost, and the standardization effort is only about a year old, so you probably don't want to adopt it yet.
Cheers,
-Rick-
Refernces: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3505.html http://www.boost.org/doc/libs/1_55_0/libs/filesystem/doc/index.htm http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3947.html http://cpp-netlib.org/
2014-03-14 2:04 GMT+01:00 Rick Yorgason <rick@...3093...>:
Martin Owens <doctormo@...360...> writes:
Jon Cruz: What say you, std::string or std::ustring for paths?
Lurker here, but the soon-to-be-canonical solution is std::path, which is currently known as boost::path or std::tr2::path if your compiler is new enough.
boost::path is in Boost.Filesystem, which would mean adding an extra runtime dependency (Boost.Filesystem is one of the few libraries that is not header-only).
Regards, Krzysztof
On Mar 13, 2014, at 6:38 PM, Krzysztof Kosiński wrote:
2014-03-14 2:04 GMT+01:00 Rick Yorgason <rick@...3093...>:
Martin Owens <doctormo@...360...> writes:
Jon Cruz: What say you, std::string or std::ustring for paths?
Lurker here, but the soon-to-be-canonical solution is std::path, which is currently known as boost::path or std::tr2::path if your compiler is new enough.
boost::path is in Boost.Filesystem, which would mean adding an extra runtime dependency (Boost.Filesystem is one of the few libraries that is not header-only).
Yes. Key point.
At some point we can evaluate using non-header libs to include, but we need to do that carefully.
On Mar 13, 2014, at 6:04 PM, Rick Yorgason wrote:
Martin Owens <doctormo@...360...> writes:
Jon Cruz: What say you, std::string or std::ustring for paths?
Lurker here, but the soon-to-be-canonical solution is std::path, which is currently known as boost::path or std::tr2::path if your compiler is new enough.
The only surprise you need to be aware of, if your string bits are stored as utf8 rather than the native narrow type, is that you need to use:
filesystem::detail::utf8_codecvt_facet utf8; path(myUtf8String, utf);
We do hit another issue.
On Windows, there is no safe narrow string encoding for paths. If one uses only ASCII characters to name files and folders, things might appear to work. However, as soon as you encounter something not in the current Windows ANSI Code Page, that falls apart.
One classic example is the Japanese student studying in Germany, where the default local is German with an ANSI Code Page of 1252, whereas the user's home directory is his name... which is composed of Kanji that don't map to CP-1252.
Jon Cruz <jon@...360...> writes:
The only surprise you need to be aware of, if your string bits are
stored as
utf8 rather than the native narrow type, is that you need to use:
filesystem::detail::utf8_codecvt_facet utf8; path(myUtf8String, utf);
We do hit another issue.
On Windows, there is no safe narrow string encoding for paths. If one uses
only ASCII characters to name
files and folders, things might appear to work. However, as soon as you
encounter something not in the
current Windows ANSI Code Page, that falls apart.
Yes, it's not a problem internally, since internally boost::path and std::path use whatever data type is most appropriate to the platform, but unfortunately C++ doesn't have a decent way to differentiate between narrow strings and utf8 strings, so you need to know what kind of string you're holding when you construct the path.
In my work, and I suspect yours, most strings are utf8, so you need to use the workarounds I posted when making paths out of strings.*
You'll notice the std::tr2 version is more concise, since they made a u8path factory, which is quite nice. They also added a u8string() method, which is also a nice addition, although in my work I don't often find myself turning paths into strings; most often it's the other way around.
Since you're wary of adding a new runtime dependency, it's probably best just to wait for std::tr2::filesystem to make its way into std:: before migrating. It might take a couple of years before std::filesystem is reliably included with most compilers, but in the grand timeline of Inkscape, that's not all that long.
-Rick-
* Technically, you could stick to std::u16string, which is assumed to be UTF-encoded, but personally I prefer to stick to utf8.
participants (6)
-
Johan Engelen
-
Jon Cruz
-
Krzysztof Kosiński
-
Markus Engel
-
Martin Owens
-
Rick Yorgason