Re: [Inkscape-devel] patch: merge pdf import via poppler-cairo into native importer

17 Jun 2014

      I'm primarily concerned about what the dialog is presenting option-wise vs
what you get in the actual imported document. So, if we show "import text
as text" as active, it should do that.
The handful of random files I had around here as well as a couple I found
in the tracker are importing the glyphs as outlines/symbols... here are a
couple examples:
https://bugs.launchpad.net/inkscape/+bug/429709/+attachment/718312/+files/%5...
from https://bugs.launchpad.net/inkscape/+bug/429709
https://bugs.launchpad.net/inkscape/+bug/275655/+attachment/362795/+files/ma...
from https://bugs.launchpad.net/inkscape/+bug/275655
As for text importing issues, this is unrelated, but an interesting one
which it seems poppler sees the text and imports the outlines but our
regular importer does not is
https://bugs.launchpad.net/inkscape/+bug/275655/+attachment/3156351/+files/t...
Cheers,
Josh
On Tue, Jun 17, 2014 at 11:38 AM, mathog <mathog@...1176...> wrote:
...
On 17-Jun-2014 10:47, Josh Andler wrote:
...
With the poppler-based import it is converting text to paths (or
symbols
from what I saw), but it still leaves the "import text as text" active
and
looking as if it will be how it functions. Is not importing the text as
text via Poppler a bug or what we should expect?
Please post a link to an example (any PDF somewhere on the web) and
describe the region you are talking about.  PDFs have a broad spectrum
of how text is encoded and without seeing the example we can't be sure
where the issue is.  These PDFs all look the same on the screen, but...
The best case encodes each letter as its own glyph, and the encoding
matches up with unicode.  These are searchable with text strings in PDF
viewers.
Not quite so good, the PDF merges pairs of letters like "fi" into one
glyph.  This looks good as a graphic but causes hiccups when it hits any
text processing code.  These are mostly searchable, but (in this
example) "file" would not be found, even when the word is staring the
user in the face on the screen. Compensating is possible in special
cases if there is a table of the common instances of this, and the
import code splits out the merged glyphs.
Really bad, PDF embeds glyphs and assigns arbitrary codes for each
glyph, so that there is no correspondence whatsoever between the
encoding and UTF or ASCII.  These cannot be searched and will be
imported as gibberish.  Miserable to straighten out, but in theory it
can be done by first searching each embeded glyph from the PDF against
the corresponding font (assuming that can be identified), to make a
table that back translates to unicode.
Impossible - PDF converts text to graphic draws and there are no
explicit glyphs whatsoever.  This normally only shows up if the
exporting program has told the program generating the PDF to do this.
It might happen without the user being aware of it if the export code
passes through a stage where it is forced to bitmap text in order to
satisfy some special effect the program uses that cannot be represented
in PDF (or in some cases in Postscript, if the PDF driver goes through
the usual Windows Postscript driver.)  For instance, if the transparency
of the text is 0.001 (instead of 0.0) many postscript drivers are going
to bitmap that text to satisfy the required transparency.  The end user
may not be aware that transparency is not 0, since on screen the two
cases are indistinguishable.
Regards,
David Mathog
mathog@...1176...
Manager, Sequence Analysis Facility, Biology Division, Caltech

HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Inkscape-devel mailing list
Inkscape-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/inkscape-devel