[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract
From: |
Karsten Hilbert |
Subject: |
Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR |
Date: |
Mon, 25 Jan 2010 23:41:03 +0100 |
> Some discussion of PDF indexing and scraping of PDFs makes me ask about
> GNUmed's ability to search for text across a patient record:
>
> 1) when a PDF was generated from source text (such as a word processor and
> "print to pdf") the text within the PDF remains recognizable to software,
> albeit not in human readable form.
AFAIK, that entirely depends on the mode in which it was generated.
It well behooves PDF generators to choose a mode that somehow preserves
text but AFAIK there's other modes where there's no text anymore.
> Is GNUmed presently only able to query
> information stored-as-human-readable text?
Even worse, it cannot query over *any* information in any
of the documents in the archive regardless of format.
> 2) there exists apparently a form of PDF called "searchable" in which a
> PDF can be created (or appended) to contain both an image layer (such as a
> scanned paper document) but to *also* hold, in a separate layer within the
> same document (file), ASCII or perhaps UTF-8 text, as may have been generated
> through OCR or perhaps when the PDF did already contain identifiable text
> (only non-human-readable within the PDF format), into a layer of
> human-readable text.
That sounds mighty useful to me.
> For GNUmed to be able to access such a layer in within-patient searches,
> would it be necessary for such PDFs to have been imported twice, and/or to
> use some additional tool to "split" the document into two parts (one an
> image part, and one the text part)?
It would be possible to implement the access to the text part inside
GNUmed. Actually using that in a search would, however, presently
require exporting each and every document and trying to search it.
That could, indeed, only be mitigated by splitting the text part
into a separate for-search table upon import.
Except that GNUmed already has that table: blobs.doc_desc, of which
there can by any number per document. In fact, we should probably
extend the per-patient and across-patients search to look at those !
Which would then enable practices to implement just what you wanted -
they'd have to import the text version themselves, but it'd be usable
for finding stuff.
:-)
Karsten
--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
- [Gnumed-devel] Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/05
- [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/15
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/15
- [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/25
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR,
Karsten Hilbert <=
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/25
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/26
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26