[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract
From: |
Jim Busser |
Subject: |
Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR |
Date: |
Tue, 26 Jan 2010 07:48:40 -0800 |
On 2010-01-26, at 7:20 AM, Karsten Hilbert wrote:
>> That could, indeed, only be mitigated by splitting the text part
>> into a separate for-search table upon import.
>>
>> Except that GNUmed already has that table: blobs.doc_desc, of which
>> there can by any number per document. In fact, we should probably
>> extend the per-patient and across-patients search to look at those !
>
> Which we apparently already do, of course :-)
>
> One concept of the GNUmed document archive that it tries
> hard to *not* concern itself with the particulars of the
> document part file types. It delegates that as much as at
> all possible. Hence splitting / appropriately importing PDF
> parts is up to the environment.
I am only wondering what constrains or otherwise defines the ability of GNUmed
(postgres) to "look inside" a part no matter its type. Is it as simple as
GNUmed looking for ASCII or UTF-8 text strings? If in this case the PDF has
some combination of
- images + PDF-formatting-encumbered-non-readable text AND
- a layer of human readable text
(if the latter is, by luck, a layer in a "searchable PDF")
1) should GNUmed then be able to find this document part?
2) will this be incredibly slow, or does GNUmed (postgres) index all of the
text that is readable "in" the parts?
- [Gnumed-devel] Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/05
- [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/15
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/15
- [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/25
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/25
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Jim Busser, 2010/01/25
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR,
Jim Busser <=
- Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR, Karsten Hilbert, 2010/01/26