[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
8.01 losing searchable text selecting pdf pages
From: |
Karl Berry |
Subject: |
8.01 losing searchable text selecting pdf pages |
Date: |
Sun, 21 Mar 2004 12:54:32 -0500 |
With GNU Ghostscript 8.01 under GNU/Linux (Red Hat 9), selecting pages
from a pdf file seems to lose any searchable text that might be there.
Here's what I mean:
- If you view the second page of the attached in.pdf with xpdf (among
other viewers), you can search for, for example "vac", and find the
string, the pdftotext program that comes with xpdf can extract
the main text, etc. (It was created with an HP 6100 scanner.)
- Then I select the second page using gs:
gs -q -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=out.pdf \
-dFirstPage=2 -dLastPage=2 in.pdf -c quit
- Now, viewing out.pdf (also attached), no text is searchable, and
pdftotext doesn't find any text.
This may be related to the warnings that gs emits when processing in.pdf:
**** Warning: Fonts with Subtype = /TrueType should be embedded.
But Times-Roman is not embedded.
**** Warning: Fonts with Subtype = /TrueType should be embedded.
But Times-Italic is not embedded.
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Adobe PDF Library 5.0 <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
I do not know enough about pdf to know if this is fixable, but I wanted
to report it, as the resulting pdf's are not great for posting on the
web, for example, since they can't be searched.
BTW, I did find an alternate method of selection from pdf pages which
preserves searchable text, using the ConTeXt program texexec with
--pdfselect, but gs is a lot faster.
Thanks,
karl
P.S. Is anyone there? I reported a problem with ps2pdf misconverting
some figures back on March 8, but didn't get an acknowledgement.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- 8.01 losing searchable text selecting pdf pages,
Karl Berry <=