Hyphenation in Inkscape

Hello,
This is regarding a wishlist bug reported here; https://bugs.launchpad.net/inkscape/+bug/171140 I am writing an extension for hyphenating the text when it is justified. I have the first version ready to use for both English and Malayalam(ml_IN) and tested in Inkscape 0.46 in Debian Sid. It is available for testing here: http://thottingal.in/projects/inkscape/inkscape-hyphenation.zip
It is on top of the python hyphenation code written by Wilbert Berendsen. The hyphenation rules, also called as patterns is TeX or Openoffice itself. There are a few more changes need to be done: a) Making the extension language independent: Loading all the patterns from a directory while initializing? Or is it okey to ask user to select the language? As of now I am doing a unicode range checking to differentiate between Malayalam and English. but it will be buggy for other languages. b) On GNU/Linux platforms can we point to the default hyphenation patterns directory of openoffice?
Feedbacks are welcome.
Thanks Santhosh Thottingal htt://thottingal.in

On Mon, Sep 21, 2009 at 10:57:54AM +0530, Santhosh Thottingal wrote:
This is regarding a wishlist bug reported here; https://bugs.launchpad.net/inkscape/+bug/171140 I am writing an extension for hyphenating the text when it is justified.
It is on top of the python hyphenation code written by Wilbert Berendsen. The hyphenation rules, also called as patterns is TeX or Openoffice itself.
If we want it within Inkscape rather than as an extension, then we could use libhyphen, which is what OpenOffice itself uses.
There are a few more changes need to be done: a) Making the extension language independent: Loading all the patterns from a directory while initializing? Or is it okey to ask user to select the language? As of now I am doing a unicode range checking to differentiate between Malayalam and English. but it will be buggy for other languages.
The right thing to do as far as SVG is concerned is to look at the xml:lang tag (see http://www.w3.org/TR/xml/#sec-lang-tag).
Of course, that then requires that Inkscape have a GUI for selecting bits of text and saying what language they are.
In absence of information from HTTP or MIME headers (not currently available to Inkscape; I suppose we should add a command-line option so that any web browsers or mail clients can pass us any language information they encounter), and in absence of the aforementioned command-line option, I suppose we'd consult the locale.
(As you say, unicode script range checking is also useful. In many cases, I'd guess that it should even take precedence over xml:lang specification, if the xml:lang language doesn't use this script at all. Though I don't know how to find what scripts a hyphenation dictionary provides for.)
Btw, I like the fact that it works by inserting soft hyphens into the text: that may help the result to be more reproducible across SVG renderers.
Does Inkscape's "Convert to text" command do the right thing?
pjrm.

On Mon, Sep 21, 2009 at 3:01 AM, Peter Moulder <Peter.Moulder@...38...> wrote:
On Mon, Sep 21, 2009 at 10:57:54AM +0530, Santhosh Thottingal wrote:
It is on top of the python hyphenation code written by Wilbert Berendsen. The hyphenation rules, also called as patterns is TeX or Openoffice itself.
If we want it within Inkscape rather than as an extension, then we could use libhyphen, which is what OpenOffice itself uses.
I think that may be better inside inkscape too, because we may need to edit the text or the text box after the hyphenation (like in Scribus).
But that make me ask for a LTE (Live Text Effects) support (Based on the LPE). Hyphenation may be only the first Text Effect that need (or work well when) enable after edition and and remake the construction.
... is it okey to ask user to select the language?
I think Yes! I frequently need to do/edit multi-language products.

On 9/21/09, Santhosh Thottingal <santhosh.thottingal@...400...> wrote:
This is regarding a wishlist bug reported here; https://bugs.launchpad.net/inkscape/+bug/171140 I am writing an extension for hyphenating the text when it is justified. I have the first version ready to use for both English and Malayalam(ml_IN) and tested in Inkscape 0.46 in Debian Sid. It is available for testing here: http://thottingal.in/projects/inkscape/inkscape-hyphenation.zip
Thank you very much! I think we really need this, and if we can get it in before 0.47 we must do it. I have tested it on Windows XP and got it working somehow, but here are the issues I found:
1. In
def loadAllDicts(self, directory): for dirname, dirnames, filenames in os.walk('.'):
must really be
def loadAllDicts(self, directory): for dirname, dirnames, filenames in os.walk(directory):
because otherwise it walks a different directory that happened to be current (on Windows).
2. This failed for me:
self.hd += Hyph_dict(filename_full)
complaining that += is not supported for dictionaries. Maybe a different python version is the reason; we ship with 2.5.2 on Windows.
3. Finally and most importantly, while it did add correct hyphen points to the text, Inkscape didn't treat them correctly: it just broke words at those points but didn't insert visible hyphens as it should (in English text). I know that some languages need to insert hyphens and some don't. What is the proper way to fix this? Should Inkscape determine this based on xml:lang?
I also cc: our text expert, Richard Hughes. Richard, what would it take to implement adding visible hyphens on breaks in flowed text?

On Mon, Sep 21, 2009 at 9:10 PM, bulia byak <buliabyak@...400...> wrote:
Thank you very much! I think we really need this, and if we can get it in before 0.47 we must do it. I have tested it on Windows XP and got it working somehow, but here are the issues I found:
- In
def loadAllDicts(self, directory): for dirname, dirnames, filenames in os.walk('.'):
must really be
def loadAllDicts(self, directory): for dirname, dirnames, filenames in os.walk(directory):
because otherwise it walks a different directory that happened to be current (on Windows).
- This failed for me:
self.hd += Hyph_dict(filename_full)
complaining that += is not supported for dictionaries. Maybe a different python version is the reason; we ship with 2.5.2 on Windows.
Sorry, My mistake. Corrected these two issues. And corrected version available here http://thottingal.in/projects/inkscape/inkscape-hyphenation.zip
- Finally and most importantly, while it did add correct hyphen
points to the text, Inkscape didn't treat them correctly: it just broke words at those points but didn't insert visible hyphens as it should (in English text). I know that some languages need to insert hyphens and some don't. What is the proper way to fix this? Should Inkscape determine this based on xml:lang?
For Malayalam, visible hyphens are not required. Not sure about other languages. Th inserted hyphen is "Soft Hyphen- \u00AD"
I also cc: our text expert, Richard Hughes. Richard, what would it take to implement adding visible hyphens on breaks in flowed text?
Thanks. Santhosh

On 9/21/09, Santhosh Thottingal <santhosh.thottingal@...400...> wrote:
For Malayalam, visible hyphens are not required. Not sure about other languages. Th inserted hyphen is "Soft Hyphen- \u00AD"
And in German, even word spelling may change at the hyphenated boundary. So it looks unlikely that a soft hyphen-based solution will solve all our hyphenation needs. What we really need is a dictionary of "word, pre-break, post-break" triads for each language. But still, soft hyphen is a legal Unicode character and we must support it anyway. I'm just wondering if it is supposed to be treated differently depending on language; from what I read, it looks like it is always replaced with a visible hyphen when a break occurs. So it may be a workable solution for English and other languages but not for Malayalam :( Please correct me if I'm wrong.

On Monday, September 21, 2009, 6:19:38 PM, bulia wrote:
bb> On 9/21/09, Santhosh Thottingal <santhosh.thottingal@...400...> wrote:
For Malayalam, visible hyphens are not required. Not sure about other languages. Th inserted hyphen is "Soft Hyphen- \u00AD"
bb> And in German, even word spelling may change at the hyphenated bb> boundary. So it looks unlikely that a soft hyphen-based solution will bb> solve all our hyphenation needs. What we really need is a dictionary
Yes, for the general case.
For some specific cases, text that has had soft hyphens inserted will break better than text which has not had them inserted.
For cases where soft hyphen doesn't help well, don't use it.
None of which precludes later adding dictionary based breaking (which is required for Thai, for example) as well.
bb> of "word, pre-break, post-break" triads for each language. But still, bb> soft hyphen is a legal Unicode character and we must support it bb> anyway. I'm just wondering if it is supposed to be treated differently bb> depending on language; from what I read, it looks like it is always bb> replaced with a visible hyphen when a break occurs. So it may be a bb> workable solution for English and other languages but not for bb> Malayalam :( Please correct me if I'm wrong.
No, you are right. A soft hyphen is inserted into text to help a layout system figure out where to break a word with a hyphen at the end of the line.
If Malayalam (and indeed other languages) don't use hyphens at the end of lines to indicate broken words, then clearly the users of that language will not normally be inserting the soft hyphens. But if by chance they do, well, the hyphen will be displayed if it falls at the end of a line.

On Thu, Sep 24, 2009 at 12:02:07PM +0200, Chris Lilley wrote:
On Monday, September 21, 2009, 6:19:38 PM, bulia wrote:
bb> But still, bb> soft hyphen is a legal Unicode character and we must support it bb> anyway. I'm just wondering if it is supposed to be treated differently bb> depending on language; from what I read, it looks like it is always bb> replaced with a visible hyphen when a break occurs. So it may be a bb> workable solution for English and other languages but not for bb> Malayalam :( Please correct me if I'm wrong.
No, you are right. A soft hyphen is inserted into text to help a layout system figure out where to break a word with a hyphen at the end of the line.
If Malayalam (and indeed other languages) don't use hyphens at the end of lines to indicate broken words, then clearly the users of that language will not normally be inserting the soft hyphens. But if by chance they do, well, the hyphen will be displayed if it falls at the end of a line.
If I correctly understand the above, then it conflicts with section 5.4 of Unicode Annex #14 (http://www.unicode.org/unicode/reports/tr14/#SoftHyphen):
“Depending on the language and the word, that may produce different visible results — for example:
* Simply inserting a hyphen glyph * Inserting a hyphen glyph and changing spelling in the divided word parts * Not showing any visible change and simply breaking at that point * Inserting a hyphen glyph at the beginning of the new line”
Bulia, Chris, what is the source of your information? Is there a conflict of standards, or were you going by an informal source?
pjrm.

My interpretation of the Unicode standard (version 5.1) as it applies to U+00AD occurring within the content of a <text> or <tspan> element is that U+00AD is for indicating possible break points, but <text> and <tspan> in SVG never do line wrapping, so U+00AD should always be invisible and non-advancing when it occurs in the content of <text> or <tspan>.
I believe that the behaviour of rendering each U+00AD character as an (english) hyphen glyph is from Unicode versions prior to 4.0, when U+00AD was changed from a dash (Pd) to an ignorable formatting character (Cf).
In other words, I believe that Inkscape's existing behaviour is correct for <text> and <tspan>.
I should say that this is based primarily on UAX#14 §5.4 (http://www.unicode.org/unicode/reports/tr14/#SoftHyphen), perhaps supplemented by Unicode FAQ entry http://www.unicode.org/faq/unsup_char.html#3.
If <flowRoot> were part of SVG, or if we were to change our representation of flowed text to either <textArea> or <foreignObject> of xhtml, then I believe that the appropriate behaviour is as per the aforementioned http://www.unicode.org/unicode/reports/tr14/#SoftHyphen, i.e. that the appropriate rendering should depend on "the word and the language".
For the moment, I won't say much about what test to use to determine which hyphenation behaviour to use (Malayalam vs English vs Arabic etc.); though the approach I've used in other text-rendering software is to pretty much ignore xml:lang, instead looking at the script (g_unichar_get_script) of the character preceding the U+00AD (skipping any characters that give G_UNICODE_SCRIPT_INHERITED). If there is no such previous character on the same line, or if the script is one of the many for which I don't know the correct behaviour, then I don't add a glyph. (This actually seems to be the right behaviour for most scripts, and is also technically correct in terms of the above-referenced FAQ entry.)
pjrm.

- Finally and most importantly, while it did add correct hyphen
points to the text, Inkscape didn't treat them correctly: it just broke words at those points but didn't insert visible hyphens as it should (in English text). I know that some languages need to insert hyphens and some don't. What is the proper way to fix this? Should Inkscape determine this based on xml:lang?
I also cc: our text expert, Richard Hughes. Richard, what would it take to implement adding visible hyphens on breaks in flowed text?
It's one of those things that is supposed to work, but since we've never had a normal UI that inserts them (Ctrl-U doesn't count) they've never been subjected to any real testing. I'll take a look.
I wasn't aware that some languages don't use visible hyphens. I could hope that Pango reports that the soft hyphen is a zero-width invisible character for those languages, but unfortunately hoping for something does not necessarily make it true.
R.

For selecting the correct language for the hyphenation(/spelling) we need to specify the document language in the document preferences. Also we should have a default document language in the Inkscape preferences.
My 2ct. Adib. ---
On Mon, Sep 21, 2009 at 7:50 PM, Richard Hughes <cyreve@...400...> wrote:
- Finally and most importantly, while it did add correct hyphen
points to the text, Inkscape didn't treat them correctly: it just broke words at those points but didn't insert visible hyphens as it should (in English text). I know that some languages need to insert hyphens and some don't. What is the proper way to fix this? Should Inkscape determine this based on xml:lang?
I also cc: our text expert, Richard Hughes. Richard, what would it take to implement adding visible hyphens on breaks in flowed text?
It's one of those things that is supposed to work, but since we've never had a normal UI that inserts them (Ctrl-U doesn't count) they've never been subjected to any real testing. I'll take a look.
I wasn't aware that some languages don't use visible hyphens. I could hope that Pango reports that the soft hyphen is a zero-width invisible character for those languages, but unfortunately hoping for something does not necessarily make it true.
R.
Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel

I also cc: our text expert, Richard Hughes. Richard, what would it take to implement adding visible hyphens on breaks in flowed text?
It's one of those things that is supposed to work, but since we've never had a normal UI that inserts them (Ctrl-U doesn't count) they've never been subjected to any real testing. I'll take a look.
I wasn't aware that some languages don't use visible hyphens. I could hope that Pango reports that the soft hyphen is a zero-width invisible character for those languages, but unfortunately hoping for something does not necessarily make it true.
Hmm. Pango returns us a glyph index of 0xfffffff for the soft hyphen character, so we can never display it. This is a bit of a problem because the soft hyphen character does have a glyph and it is that glyph that we are supposed to draw, but Pango won't tell us what it is. We could attempt to poke around in the font and look for the normal hyphen glyph, but (a) that might look different and (b) it would subvert any script-specific handling that Pango did for the aforementioned languages without visible hyphens.
The post [1] is a proposal to change the behaviour to the current one, but I can't see a commit that implements the proposal so it may not have happened like that. The post [2] is much more recent and is basically asking the same question I am; no reply.
In any case, it seems very unlikely that any kind of fix is going to be suitable for 0.47.
Richard.
[1] http://mail.gnome.org/archives/gtk-i18n-list/2003-May/msg00016.html [2] http://mail.gnome.org/archives/gtk-i18n-list/2009-April/msg00008.html

On Mon, Sep 21, 2009 at 4:24 PM, Richard Hughes <cyreve@...400...> wrote:
In any case, it seems very unlikely that any kind of fix is going to be suitable for 0.47.
That's too bad... but in any case, why don't we push them on this a little harder...

On Mon, Sep 21, 2009 at 04:35:58PM -0400, bulia byak wrote:
On Mon, Sep 21, 2009 at 4:24 PM, Richard Hughes <cyreve@...400...> wrote:
In any case, it seems very unlikely that any kind of fix is going to be suitable for 0.47.
That's too bad... but in any case, why don't we push them on this a little harder...
We have enough trouble with other SVG renderers doing basic things like coloured text, let alone language-sensitive soft-hyphen handling.
Librsvg and a Webkit-based browser I have do the same as current inkscape, while firefox 3.0 and batik svn render all soft hyphen characters the same as hyphen characters, even when they occur in the middle of a line.
Accordingly, I don't think we should encourage use of soft hyphens in SVG, and I think we should implement basic features like underlines before trying to implement behaviour that hasn't even been codified in any standard yet.
More important would be to facilitate nesting SVG inside of xhtml documents as supported by Firefox (and Amaya, I'm told).
To a lesser extent, it may be good to support xhtml foreignObject's inside SVG; though this isn't usually as good as SVG in a textual document, which is better at moving graphics around in response to changing text length.
(If it's useful, I can provide either a library to render (simple) XHTML to a cairo surface, or a program to take XHTML on stdin and write SVG on stdout.)
By delegating hyphenation to software designed for text rendering, we have more chance of getting good hyphenation not just in SVG documents (which pretty much never use body text, let alone hyphenation) but in text documents where hyphenation is more valuable. As has already been noted, hyphenation is difficult: both in choosing where hyphens may be placed (which itself is an unsolved problem for some languages), and in choosing how to render those hyphenations (choice of hyphen glyph if any, respelling), and in choosing line breaks to avoid hyphenating when possible. Hyphenation is usually used in the context of justification, and that too is difficult to do well. (Think kashidas, alternate glyphs, choosing the trade off between word spacing, glyph stretching, letter spacing, and at what point one should simply have space at the end of the line.)
pjrm.

On 9/22/09, Peter Moulder <Peter.Moulder@...38...> wrote:
We have enough trouble with other SVG renderers doing basic things like coloured text, let alone language-sensitive soft-hyphen handling.
Frankly, I would be happy if they would just always render a hyphen on breaks. This would work for a majority of languages. Current behavior is only useful for a minority.
The level of implementation of various standards is no excuse, IMHO, to not implement a particular feature if it is doable in principle and often requested by the users.
Librsvg and a Webkit-based browser I have do the same as current inkscape, while firefox 3.0 and batik svn render all soft hyphen characters the same as hyphen characters, even when they occur in the middle of a line.
That's a sad picture, I agree.
Accordingly, I don't think we should encourage use of soft hyphens in SVG,
On the contrary, we should use our clout to encourage a better support for it. If the standard is deficient, we should select an interpretation that makes most sense and make it a de facto standard.
and I think we should implement basic features like underlines before trying to implement behaviour that hasn't even been codified in any standard yet.
I absolutely don't see why one should preclude the other. Both are things requested by users. It's just that for hyphenation, there's a seemingly easy way to achieve it without changing Inkscape itself, hence this extension. Too bad if it has to be turned down because of a poor standard and a lazy software community that doesn't push the standard to become more useful.
By delegating hyphenation to software designed for text rendering, we have more chance of getting good hyphenation not just in SVG documents (which pretty much never use body text, let alone hyphenation)
On the contrary, Inkscape is already used a lot for single-page leaflets, scientific posters, and other documents with a lot of flowed text. If we add hyphenation, it will be used even more.
hyphenation is more valuable. As has already been noted, hyphenation is difficult: both in choosing where hyphens may be placed (which itself is an unsolved problem for some languages), and in choosing how to render those hyphenations (choice of hyphen glyph if any, respelling), and in choosing line breaks to avoid hyphenating when possible.
That a problem is difficult to solve in the general case is no excuse to not try to solve it in a special case that will be useful in 95% of situations.

My take on hyphenation: - It absolutely needs to be a core feature (because then we can add other nice things like fit text to frame), and it's a high priority one since it's very useful for scientific posters. This "market" is more important than it looks since it is often the first exposure to vector drawing apps for college students, and thus a valuable entry point. - We should use libhyphen instead of creating our own library, and use OO.o hyphenation patterns that are installed on the system - in the same way we use Aspell. For Windows we can make them downloadable at install if they are big (though it will require some NSIS wizardry). - It could be a vector effect in theory, but in practice it shouldn't be more complicated than a checkbox option in the text tool that defaults to "on" - At the XML level, we should insert soft hyphens only at line ends, instead of everywhere a break is possible. We can recover the original text by removing all soft hyphens. (This might conflict with custom hyphenation though.) While rendering, we feed Pango with a string that has the soft hyphens replaced with normal ones for the time being. - Language selection: there should be a context menu item that lets you select the language for the given text or a set of them, and a document-wide default for new texts, which in turn defaults to LANG. I think it's sufficient to provide object-level granularity using xml:lang - higher granularity would require XML and UI kludges. Asking email clients and browsers about language information is totally wrong, because hyphenation is supposed to be an edit-only feature: the SVG contains information about whether the text is hyphenated or not, a 'rendering' of hyphenated text, and maybe the original unhyphenated text (but we should try to avoid that extra copy of minimally altered text). - XHTML foreign objects in SVG: bad idea. We wander into browser-land and it would destroy our compatibility with e.g. librsvg, which will never handle such things.
Regards, Krzysztof

On 9/22/09, Krzysztof Kosiński <tweenk.pl@...400...> wrote:
My take on hyphenation:
- It absolutely needs to be a core feature (because then we can add other
nice things like fit text to frame)
...we already have it :)
- We should use libhyphen instead of creating our own library, and use OO.o
hyphenation patterns that are installed on the system - in the same way we use Aspell. For Windows we can make them downloadable at install if they are big (though it will require some NSIS wizardry).
if someone can code that, it would be fantastic
- It could be a vector effect in theory, but in practice it shouldn't be
more complicated than a checkbox option in the text tool that defaults to "on"
not really a vector effect, but just an attribute of the flowRoot element which must go into the Inkscape-only branch of a switch entirely, as planned long ago, with an auto-updated svg:text equivalent in the other branch

If the question is "should we push other renderers to change their rendering of soft hyphen inside a flowRoot", then I believe the question reduces to "should we push Batik to change its rendering of soft hyphen inside a flowRoot". For other renderers, any push would be in the form of contributing to SVG 1.2 Full, and I believe the working group aren't looking at that until SVG 1.2 Mobile has progressed.
(As for Batik, I gather that there's very little active development, though we could probably get it in if we were to write the patch ourselves.)
If the question is "should we push other renderers to change their rendering of soft hyphen in <text>/<tspan>, and change in what way", then we might first ask how Inkscape should handle hyphenation when converting a flowRoot to a <text> of <tspan>'s. I suspect that the practical answer is that we would generate tspans that contain the actual text we want rendered, including any respelling and whatever hyphen glyphs we think are appropriate for the language/script.
If we want to recover the original unflowed text (to allow subsequent editing within Inkscape, for example), the easiest option for us is to retain the whole flowRoot (whether in an inkscape-namespace element, or inside a <switch> as a sibling to the <text>). This also allows recovering forced line breaks from the original text.
If instead we introduce a special attribute to recover line breaks and respellings, then we might as well use that same mechanism to recover original hyphenation.
pjrm.

On Tue, Sep 22, 2009 at 16:39, bulia byak <buliabyak@...400...> wrote:
By delegating hyphenation to software designed for text rendering, we have more chance of getting good hyphenation not just in SVG documents (which pretty much never use body text, let alone hyphenation)
On the contrary, Inkscape is already used a lot for single-page leaflets, scientific posters, and other documents with a lot of flowed text. If we add hyphenation, it will be used even more.
Indeed, Inkscape is used for scientific posters which often include quite a bit of text. Considering that the usual alternative is PowerPoint (yes, really, people are designing 1.5x1 m posters in PPT) using Inkscape as it works now is already much better. Having (good) hyphenation would definitely be an immense plus. I add "good" because, even if scientific researchers are usually not known for their sense of design (see above...), they are used to press-quality text layout in articles or latex-level hyphenation support. So hyphenation is an area where they might be able to tell the difference.
An alternative would be to have an easy way to edit SVGs with Inkscape from within a Scribus document. Scribus has support for hyphenation (not excellent as far as I can tell, but good enough) and produces very high quality PDFs that facilitate printing. Would it be even imaginable to have a right-click option "Edit in Inkscape" for SVGs in Scribus?
JiHO --- http://maururu.net

On Mon, Sep 21, 2009 at 11:40:16AM -0400, bulia byak wrote:
- Finally and most importantly, while it did add correct hyphen
points to the text, Inkscape didn't treat them correctly: it just broke words at those points but didn't insert visible hyphens as it should (in English text). I know that some languages need to insert hyphens and some don't. What is the proper way to fix this? Should Inkscape determine this based on xml:lang?
I haven't checked whether the SVG standard makes specific mention of soft hyphen, but the following may be relevant:
http://www.unicode.org/unicode/reports/tr14/#SoftHyphen
Presumably unicode script range would also be relevant, and may even be the primary mechanism, using xml:lang only when script isn't enough to determine hyphen appearance.
Though the above document doesn't actually give details of what rules to use, unfortunately, i.e. it doesn't refer us to any details of how U+00ad should be rendered in different circumstances, just what things it should depend on, and gives some examples of possible renderings (typically without mentioning the circumstances).
I might also draw attention to the wikipedia page for hyphen (http://en.wikipedia.org/wiki/Hyphen), whose references section points to some controversy as to how U+00ad should be handled.
pjrm.

On Tuesday, September 22, 2009, 2:05:55 AM, Peter wrote:
PM> On Mon, Sep 21, 2009 at 11:40:16AM -0400, bulia byak wrote:
- Finally and most importantly, while it did add correct hyphen
points to the text, Inkscape didn't treat them correctly: it just broke words at those points but didn't insert visible hyphens as it should (in English text). I know that some languages need to insert hyphens and some don't. What is the proper way to fix this? Should Inkscape determine this based on xml:lang?
PM> I haven't checked whether the SVG standard makes specific mention of soft PM> hyphen, but the following may be relevant:
As you point out, the definition of the soft hyphen is in Unicode, not in SVG. SVG does not override the Unicode standard, but notmatively references it.
PM> http://www.unicode.org/unicode/reports/tr14/#SoftHyphen
I agree that a soft hyphen is the correct way to indicate a line-break opportunity in Western languages which use hyphenation on word breaks. The soft hyphen should not render unless it is at the end of the line.
PM> I might also draw attention to the wikipedia page for hyphen PM> (http://en.wikipedia.org/wiki/Hyphen), whose references section points to some PM> controversy as to how U+00ad should be handled.
It seems fairly clear to me:
"the concept of a soft hyphen was introduced to allow manual specification of a place where a hyphenated break was allowed without forcing a line break in an inconvenient place if the text was later reflowed. In contrast, a hyphen that is always displayed and printed is called a hard hyphen"
"When flowing text, a system may consider the soft hyphen to be a point at which a word may be broken, and display a hyphen at the end of the broken line; if the line is not broken at that point the hyphen is not displayed"

On Thu, Sep 24, 2009 at 11:55:37AM +0200, Chris Lilley wrote:
PM> I might also draw attention to the wikipedia page for hyphen PM> (http://en.wikipedia.org/wiki/Hyphen), whose references section points to some PM> controversy as to how U+00ad should be handled.
It seems fairly clear to me: [Quotations from main text of said page.]
The text Chris quotes is from the main text of that Wikipedia page rather than from a standard as I'd initially thought on reading this message. I can't find this text in any authoritative source, and it seems to be discussing soft hyphens generally, without considering differences between different standards.
I think the reason that Chris quoted from the main text is because I wrongly indicated that it was the “references section” that pointed to controversy (I should have said ‘dissent’ or ‘disagreement’ or ‘discussion suggesting further change’ in how U+00AD should be handled); when in fact the page has both a ‘References’ section and an ‘External links’ section, and it's the latter that points to this discussion, specifically with the 3rd and 4th URIs:
* Jukka Korpela, Soft hyphen (SHY) - a hard problem? [http://www.cs.tut.fi/~jkorpela/shy.html]
* Markus Kuhn, Unicode interpretation of SOFT HYPHEN breaks ISO 8859-1 compatibility. Unicode Technical Committee document L2/03-155R, June 2003. [http://www.cl.cam.ac.uk/~mgk25/ucs/L2/03155r-kuhn-soft-hyphen.pdf]
I'm not particularly suggesting that people read the above, rather I'm just pointing out that it's still possible for a future revision of Unicode to revert to the behaviour of one of the earlier standards.
pjrm.
participants (9)
-
Aurélio A. Heckert
-
bulia byak
-
Chris Lilley
-
JiHO
-
Krzysztof Kosiński
-
Peter Moulder
-
Richard Hughes
-
Santhosh Thottingal
-
the Adib