On 02-Dec-2012 22:00, Jelle wrote:
BTW if the pdf import comes across unknown characters, it will use that as escape codes and create a new tspan rather than using a undefined glyph to fill the hole (have that when "()" are in the original text generated from a Chinese version windows for instance. Probably some gb2312 to Unicode conversion bug.)
PDFs have other problems - the biggest being that they need not encode the text at all ( they can just make images of the page), or they may not use a standard encoding like ASCII or Unicode. The encoding may not even be 1 glyph : 1 letter. For the latter issue search for "PDF ligatures", for a discussion of alternate encodings see, for instance:
http://stackoverflow.com/questions/1983561/why-a-pdf-document-could-be-not-s...
If you run into a PDF with a nonstandard encoding you will still be able to select "text" in a PDF reader, but if you cut and paste it into another program it will be garbage Also search will not find anything unless you search for the garbage and not what the words say on the screen. For instance, you might see "Atom" in Acrobat, but in the PDF it might be actually be encoded as "v2ht" (or any other apparently random combination of letters).
Regards,
David Mathog mathog@...1176... Manager, Sequence Analysis Facility, Biology Division, Caltech