Straigthening text after pdf import

3 Dec 2012


      Hello David,
You're right, it involves using both split lines and remove manual kerns  
in that order. If you have for instance two columns of text, the import  
will collect it into one <text> object with several <tspan> and place  
every single character using a dx and dy position.
To keep the whole thing in place, you need to split the lines, then remove  
the manual kernings. If you do the remove manual kerns first, you end up  
with a single column of text with tspan. The benefit is that you can edit  
the text more easily, but it doesn't look remotely like the pdf document.
A function that would collect grouped text and place it into a single text  
with tspans in a similar way would be extremely useful, even if it doesn't  
allow for editing that well. Text search and retrieval would become a lot  
easier with it.
I think however that a smarter algorithm that would collect characters in  
the same string and place them on a baseline defined by the tspan element  
and it's relative position to the text tag would be a giant step towards  
more useable PDF text import. And it seems the same goes for PS text.
BTW if the pdf import comes across unknown characters, it will use that as  
escape codes and create a new tspan rather than using a undefined glyph to  
fill the hole (have that when "()" are in the original text generated from  
a Chinese version windows for instance. Probably some gb2312 to Unicode  
conversion bug.)
Anyway,.. keep up the good work, typesetting hasn't been Inkscape's strong  
point traditionally, so any work on that is greatly appreciated.
Cheers
Jelle
On Mon, 03 Dec 2012 10:44:31 +0800,  
inkscape-devel-request@lists.sourceforge.net wrote:
...
Split text goes the other way - it breaks strings into smaller pieces.
Maybe
you meant text -> Remove Manual Kerns?  AFAIK there is no function that
merges two
<text>'s, other than by doing it manually: cutting one and pasting it
into the end
of the other.  Which will generally move the second <text>.  The "split
text" extension
also moves the component pieces, and does the entire <text> not just a
selected
substring, but I guess I could modify it to be better behaved.
There is another problem with some PS files - they drop all the spaces.
So that
"this is text" becomes the character set {t,h,i,s,i,s,t,e,x,t}.  The
code I'm working
on will have an option to try to reinsert the spaces based on the
letter spacing.
Regards,
David Mathog
mathog@...1176...
Manager, Sequence Analysis Facility, Biology Division, Caltech

Jelle

mathog

tags (0)

participants (2)