[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: cvs (or something!) on very large scales with non-source code object
Greg A. Woods
Re: cvs (or something!) on very large scales with non-source code objects
Fri, 1 Feb 2002 15:22:34 -0500 (EST)
[ On Friday, February 1, 2002 at 10:24:37 (-0500), Nigel Kerr wrote: ]
> Subject: cvs (or something!) on very large scales with non-source code
> i have several million objects ("very large scales"): roughly half of
> them are bitonal TIFF files, scanned page images of printed material;
> the other half are OCR'd text of those same TIFF files. there are a
> relatively small number of other kinds of files: metadata about chunks
> of these data, and auxilliary images of parts of some of the pages.
> right now the top level chunks of this corpus number about 3,000, with
> sub-chunks inside those top-level chunks.
> at any moment, it might be discovered that there is an error or
> problem with any of these objects, that will need to be fixed:
> the TIFF file might be bad/corrupt/unclear
> the ocr'd text might be bad/corrupt/unclear
> the metadata might be found to be wrong
> the auxilliary images might be bad/corrupt/unclear
As I said in my other message -- that's a very interesting problem you
have there! ;-)
(sounds a lot like what the guys at catalogues.google.com are doing,
though with more exacting detail than would be necessary for searching
scanned catalogue pages)
How important is tracking the actual changes made to the TIFF's or the
auxilliary images? Can you get by with simply replacing them (perhaps
with a remark made about this replacement in the metadata file(s))?
Is the OCR'ed text and metadata kept in ASCII (or other diff-able text)
Are you able to deal with making changes only to individual top-level
chunks at any one time?
How important is it to allow concurrent editing of the text/metadata?
CVS is clearly not suitable for tracking changes to binary data,
especially not in the scale of your corpus. However the other parts may
be maintined with CVS, depending on how well you can break the entire
corpus into manageable chunks, and perhaps depending on how much you can
afford to manipulate several copies of all these files.
Greg A. Woods
+1 416 218-0098; <address@hidden>; <address@hidden>; <address@hidden>
Planix, Inc. <address@hidden>; VE3TCP; Secrets of the Weird <address@hidden>