Hello David,
You're right, it involves using both split lines and remove manual kerns in that order. If you have for instance two columns of text, the import will collect it into one <text> object with several <tspan> and place every single character using a dx and dy position.
To keep the whole thing in place, you need to split the lines, then remove the manual kernings. If you do the remove manual kerns first, you end up with a single column of text with tspan. The benefit is that you can edit the text more easily, but it doesn't look remotely like the pdf document.
A function that would collect grouped text and place it into a single text with tspans in a similar way would be extremely useful, even if it doesn't allow for editing that well. Text search and retrieval would become a lot easier with it.
I think however that a smarter algorithm that would collect characters in the same string and place them on a baseline defined by the tspan element and it's relative position to the text tag would be a giant step towards more useable PDF text import. And it seems the same goes for PS text.
BTW if the pdf import comes across unknown characters, it will use that as escape codes and create a new tspan rather than using a undefined glyph to fill the hole (have that when "()" are in the original text generated from a Chinese version windows for instance. Probably some gb2312 to Unicode conversion bug.)
Anyway,.. keep up the good work, typesetting hasn't been Inkscape's strong point traditionally, so any work on that is greatly appreciated.
Cheers
Jelle
On Mon, 03 Dec 2012 10:44:31 +0800, inkscape-devel-request@lists.sourceforge.net wrote:
Split text goes the other way - it breaks strings into smaller pieces. Maybe you meant text -> Remove Manual Kerns? AFAIK there is no function that merges two <text>'s, other than by doing it manually: cutting one and pasting it into the end of the other. Which will generally move the second <text>. The "split text" extension also moves the component pieces, and does the entire <text> not just a selected substring, but I guess I could modify it to be better behaved. There is another problem with some PS files - they drop all the spaces. So that "this is text" becomes the character set {t,h,i,s,i,s,t,e,x,t}. The code I'm working on will have an option to try to reinsert the spaces based on the letter spacing. Regards, David Mathog mathog@...1176... Manager, Sequence Analysis Facility, Biology Division, Caltech
On 02-Dec-2012 22:00, Jelle wrote:
BTW if the pdf import comes across unknown characters, it will use that as escape codes and create a new tspan rather than using a undefined glyph to fill the hole (have that when "()" are in the original text generated from a Chinese version windows for instance. Probably some gb2312 to Unicode conversion bug.)
PDFs have other problems - the biggest being that they need not encode the text at all ( they can just make images of the page), or they may not use a standard encoding like ASCII or Unicode. The encoding may not even be 1 glyph : 1 letter. For the latter issue search for "PDF ligatures", for a discussion of alternate encodings see, for instance:
http://stackoverflow.com/questions/1983561/why-a-pdf-document-could-be-not-s...
If you run into a PDF with a nonstandard encoding you will still be able to select "text" in a PDF reader, but if you cut and paste it into another program it will be garbage Also search will not find anything unless you search for the garbage and not what the words say on the screen. For instance, you might see "Atom" in Acrobat, but in the PDF it might be actually be encoded as "v2ht" (or any other apparently random combination of letters).
Regards,
David Mathog mathog@...1176... Manager, Sequence Analysis Facility, Biology Division, Caltech
participants (2)
-
Jelle
-
mathog