Re: [Inkscape-devel] patch: merge pdf import via poppler-cairo into native importer
On 17-Jun-2014 10:47, Josh Andler wrote:
With the poppler-based import it is converting text to paths (or symbols from what I saw), but it still leaves the "import text as text" active and looking as if it will be how it functions. Is not importing the text as text via Poppler a bug or what we should expect?
Please post a link to an example (any PDF somewhere on the web) and describe the region you are talking about. PDFs have a broad spectrum of how text is encoded and without seeing the example we can't be sure where the issue is. These PDFs all look the same on the screen, but...
The best case encodes each letter as its own glyph, and the encoding matches up with unicode. These are searchable with text strings in PDF viewers.
Not quite so good, the PDF merges pairs of letters like "fi" into one glyph. This looks good as a graphic but causes hiccups when it hits any text processing code. These are mostly searchable, but (in this example) "file" would not be found, even when the word is staring the user in the face on the screen. Compensating is possible in special cases if there is a table of the common instances of this, and the import code splits out the merged glyphs.
Really bad, PDF embeds glyphs and assigns arbitrary codes for each glyph, so that there is no correspondence whatsoever between the encoding and UTF or ASCII. These cannot be searched and will be imported as gibberish. Miserable to straighten out, but in theory it can be done by first searching each embeded glyph from the PDF against the corresponding font (assuming that can be identified), to make a table that back translates to unicode.
Impossible - PDF converts text to graphic draws and there are no explicit glyphs whatsoever. This normally only shows up if the exporting program has told the program generating the PDF to do this. It might happen without the user being aware of it if the export code passes through a stage where it is forced to bitmap text in order to satisfy some special effect the program uses that cannot be represented in PDF (or in some cases in Postscript, if the PDF driver goes through the usual Windows Postscript driver.) For instance, if the transparency of the text is 0.001 (instead of 0.0) many postscript drivers are going to bitmap that text to satisfy the required transparency. The end user may not be aware that transparency is not 0, since on screen the two cases are indistinguishable.
Regards,
David Mathog mathog@...1176... Manager, Sequence Analysis Facility, Biology Division, Caltech
I'm primarily concerned about what the dialog is presenting option-wise vs what you get in the actual imported document. So, if we show "import text as text" as active, it should do that.
The handful of random files I had around here as well as a couple I found in the tracker are importing the glyphs as outlines/symbols... here are a couple examples: https://bugs.launchpad.net/inkscape/+bug/429709/+attachment/718312/+files/%5... from https://bugs.launchpad.net/inkscape/+bug/429709
https://bugs.launchpad.net/inkscape/+bug/275655/+attachment/362795/+files/ma... from https://bugs.launchpad.net/inkscape/+bug/275655
As for text importing issues, this is unrelated, but an interesting one which it seems poppler sees the text and imports the outlines but our regular importer does not is https://bugs.launchpad.net/inkscape/+bug/275655/+attachment/3156351/+files/t...
Cheers, Josh
On Tue, Jun 17, 2014 at 11:38 AM, mathog <mathog@...1176...> wrote:
On 17-Jun-2014 10:47, Josh Andler wrote:
With the poppler-based import it is converting text to paths (or symbols from what I saw), but it still leaves the "import text as text" active and looking as if it will be how it functions. Is not importing the text as text via Poppler a bug or what we should expect?
Please post a link to an example (any PDF somewhere on the web) and describe the region you are talking about. PDFs have a broad spectrum of how text is encoded and without seeing the example we can't be sure where the issue is. These PDFs all look the same on the screen, but...
The best case encodes each letter as its own glyph, and the encoding matches up with unicode. These are searchable with text strings in PDF viewers.
Not quite so good, the PDF merges pairs of letters like "fi" into one glyph. This looks good as a graphic but causes hiccups when it hits any text processing code. These are mostly searchable, but (in this example) "file" would not be found, even when the word is staring the user in the face on the screen. Compensating is possible in special cases if there is a table of the common instances of this, and the import code splits out the merged glyphs.
Really bad, PDF embeds glyphs and assigns arbitrary codes for each glyph, so that there is no correspondence whatsoever between the encoding and UTF or ASCII. These cannot be searched and will be imported as gibberish. Miserable to straighten out, but in theory it can be done by first searching each embeded glyph from the PDF against the corresponding font (assuming that can be identified), to make a table that back translates to unicode.
Impossible - PDF converts text to graphic draws and there are no explicit glyphs whatsoever. This normally only shows up if the exporting program has told the program generating the PDF to do this. It might happen without the user being aware of it if the export code passes through a stage where it is forced to bitmap text in order to satisfy some special effect the program uses that cannot be represented in PDF (or in some cases in Postscript, if the PDF driver goes through the usual Windows Postscript driver.) For instance, if the transparency of the text is 0.001 (instead of 0.0) many postscript drivers are going to bitmap that text to satisfy the required transparency. The end user may not be aware that transparency is not 0, since on screen the two cases are indistinguishable.
Regards,
David Mathog mathog@...1176... Manager, Sequence Analysis Facility, Biology Division, Caltech
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ Inkscape-devel mailing list Inkscape-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/inkscape-devel
participants (2)
-
Josh Andler
-
mathog