Re: [Inkscape-devel] patch: merge pdf import via poppler-cairo into native importer

17 Jun 2014


      On 17-Jun-2014 10:47, Josh Andler wrote:
...
With the poppler-based import it is converting text to paths (or 
symbols
from what I saw), but it still leaves the "import text as text" active 
and
looking as if it will be how it functions. Is not importing the text as
text via Poppler a bug or what we should expect?
Please post a link to an example (any PDF somewhere on the web) and 
describe the region you are talking about.  PDFs have a broad spectrum 
of how text is encoded and without seeing the example we can't be sure 
where the issue is.  These PDFs all look the same on the screen, but...
The best case encodes each letter as its own glyph, and the encoding 
matches up with unicode.  These are searchable with text strings in PDF 
viewers.
Not quite so good, the PDF merges pairs of letters like "fi" into one 
glyph.  This looks good as a graphic but causes hiccups when it hits any 
text processing code.  These are mostly searchable, but (in this 
example) "file" would not be found, even when the word is staring the 
user in the face on the screen. Compensating is possible in special 
cases if there is a table of the common instances of this, and the 
import code splits out the merged glyphs.
Really bad, PDF embeds glyphs and assigns arbitrary codes for each 
glyph, so that there is no correspondence whatsoever between the 
encoding and UTF or ASCII.  These cannot be searched and will be 
imported as gibberish.  Miserable to straighten out, but in theory it 
can be done by first searching each embeded glyph from the PDF against 
the corresponding font (assuming that can be identified), to make a 
table that back translates to unicode.
Impossible - PDF converts text to graphic draws and there are no 
explicit glyphs whatsoever.  This normally only shows up if the 
exporting program has told the program generating the PDF to do this.  
It might happen without the user being aware of it if the export code 
passes through a stage where it is forced to bitmap text in order to 
satisfy some special effect the program uses that cannot be represented 
in PDF (or in some cases in Postscript, if the PDF driver goes through 
the usual Windows Postscript driver.)  For instance, if the transparency 
of the text is 0.001 (instead of 0.0) many postscript drivers are going 
to bitmap that text to satisfy the required transparency.  The end user 
may not be aware that transparency is not 0, since on screen the two 
cases are indistinguishable.
Regards,
David Mathog
mathog@...1176...
Manager, Sequence Analysis Facility, Biology Division, Caltech