On 17-Jun-2014 10:47, Josh Andler wrote:
With the poppler-based import it is converting text to paths (or symbols from what I saw), but it still leaves the "import text as text" active and looking as if it will be how it functions. Is not importing the text as text via Poppler a bug or what we should expect?
Please post a link to an example (any PDF somewhere on the web) and describe the region you are talking about. PDFs have a broad spectrum of how text is encoded and without seeing the example we can't be sure where the issue is. These PDFs all look the same on the screen, but...
The best case encodes each letter as its own glyph, and the encoding matches up with unicode. These are searchable with text strings in PDF viewers.
Not quite so good, the PDF merges pairs of letters like "fi" into one glyph. This looks good as a graphic but causes hiccups when it hits any text processing code. These are mostly searchable, but (in this example) "file" would not be found, even when the word is staring the user in the face on the screen. Compensating is possible in special cases if there is a table of the common instances of this, and the import code splits out the merged glyphs.
Really bad, PDF embeds glyphs and assigns arbitrary codes for each glyph, so that there is no correspondence whatsoever between the encoding and UTF or ASCII. These cannot be searched and will be imported as gibberish. Miserable to straighten out, but in theory it can be done by first searching each embeded glyph from the PDF against the corresponding font (assuming that can be identified), to make a table that back translates to unicode.
Impossible - PDF converts text to graphic draws and there are no explicit glyphs whatsoever. This normally only shows up if the exporting program has told the program generating the PDF to do this. It might happen without the user being aware of it if the export code passes through a stage where it is forced to bitmap text in order to satisfy some special effect the program uses that cannot be represented in PDF (or in some cases in Postscript, if the PDF driver goes through the usual Windows Postscript driver.) For instance, if the transparency of the text is 0.001 (instead of 0.0) many postscript drivers are going to bitmap that text to satisfy the required transparency. The end user may not be aware that transparency is not 0, since on screen the two cases are indistinguishable.
Regards,
David Mathog mathog@...1176... Manager, Sequence Analysis Facility, Biology Division, Caltech