Re: [Inkscape-devel] Straigthening text after pdf import

3 Dec 2012


      On 02-Dec-2012 22:00, Jelle wrote:
...
BTW if the pdf import comes across unknown characters, it will use 
that as
escape codes and create a new tspan rather than using a undefined 
glyph to
fill the hole (have that when "()" are in the original text generated 
from
a Chinese version windows for instance. Probably some gb2312 to 
Unicode
conversion bug.)
PDFs have other problems - the biggest being that they need not encode 
the text at all (
they can just make images of the page), or they may not use a
standard encoding like ASCII or Unicode.  The encoding may not even be 
1 glyph : 1 letter.
For the latter issue search for "PDF ligatures", for a discussion of 
alternate encodings
see, for instance:
http://stackoverflow.com/questions/1983561/why-a-pdf-document-could-be-not-s...
If you run into a PDF with a nonstandard encoding you will still be 
able to select "text"
in a PDF reader, but if you cut and paste it into another program it 
will be garbage  Also
search will not find anything unless you search for the garbage and not 
what the words say
on the screen.  For instance, you might see "Atom" in Acrobat, but in 
the PDF it might be
actually be encoded as "v2ht" (or any other apparently random 
combination of letters).
Regards,
David Mathog
mathog@...1176...
Manager, Sequence Analysis Facility, Biology Division, Caltech