------------------------------------------------------------------------------On 17-Jun-2014 10:47, Josh Andler wrote:
> With the poppler-based import it is converting text to paths (or
> symbols
> from what I saw), but it still leaves the "import text as text" active
> and
> looking as if it will be how it functions. Is not importing the text as
> text via Poppler a bug or what we should expect?
Please post a link to an example (any PDF somewhere on the web) and
describe the region you are talking about. PDFs have a broad spectrum
of how text is encoded and without seeing the example we can't be sure
where the issue is. These PDFs all look the same on the screen, but...
The best case encodes each letter as its own glyph, and the encoding
matches up with unicode. These are searchable with text strings in PDF
viewers.
Not quite so good, the PDF merges pairs of letters like "fi" into one
glyph. This looks good as a graphic but causes hiccups when it hits any
text processing code. These are mostly searchable, but (in this
example) "file" would not be found, even when the word is staring the
user in the face on the screen. Compensating is possible in special
cases if there is a table of the common instances of this, and the
import code splits out the merged glyphs.
Really bad, PDF embeds glyphs and assigns arbitrary codes for each
glyph, so that there is no correspondence whatsoever between the
encoding and UTF or ASCII. These cannot be searched and will be
imported as gibberish. Miserable to straighten out, but in theory it
can be done by first searching each embeded glyph from the PDF against
the corresponding font (assuming that can be identified), to make a
table that back translates to unicode.
Impossible - PDF converts text to graphic draws and there are no
explicit glyphs whatsoever. This normally only shows up if the
exporting program has told the program generating the PDF to do this.
It might happen without the user being aware of it if the export code
passes through a stage where it is forced to bitmap text in order to
satisfy some special effect the program uses that cannot be represented
in PDF (or in some cases in Postscript, if the PDF driver goes through
the usual Windows Postscript driver.) For instance, if the transparency
of the text is 0.001 (instead of 0.0) many postscript drivers are going
to bitmap that text to satisfy the required transparency. The end user
may not be aware that transparency is not 0, since on screen the two
cases are indistinguishable.
Regards,
David Mathog
mathog@...1176...
Manager, Sequence Analysis Facility, Biology Division, Caltech
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Inkscape-devel mailing list
Inkscape-devel@...1794...s.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/inkscape-devel