[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cvs (or something!) on very large scales with non-source code object

From: Nigel Kerr
Subject: Re: cvs (or something!) on very large scales with non-source code objects
Date: Fri, 01 Feb 2002 15:46:29 -0500
User-agent: Gnus/5.090005 (Oort Gnus v0.05) Emacs/20.7 (sparc-sun-solaris2.8)

Greg seems to catch the main points made so far (about cvs and binary
files and scale) and ask some questions, so i'm choosing to continue
from Greg's message:

Quoth address@hidden (Greg A. Woods):

> How important is tracking the actual changes made to the TIFF's or the
> auxilliary images?  Can you get by with simply replacing them (perhaps
> with a remark made about this replacement in the metadata file(s))?

an interesting question.  in my tenure with this project, we've only
ever been interested in "the state of this TIFF or .txt on DATE" a
handful of times, and it was usually to check to see what was visible
at the time.  i thinking that having all the historical versions of
the TIFF files may not be as important as having all the historical
versions of the metadata binding all this together.

> Is the OCR'ed text and metadata kept in ASCII (or other diff-able text)
> form?

yes, the ocr'd text is plain old text, in theory containing characters
as ambitious as iso-8859-1.  the metadata is also text, and can
contain utf8.  each "page" of the ocr'd text is a separate file at
this time.

> Are you able to deal with making changes only to individual top-level
> chunks at any one time?

not quite sure i understand this question, but perhaps if i explain:
we can make a change to any part of the corpus at any time: it might
be a single text file, it could be a single TIFF file (and possibly a
new version of text to go with a vastly improved TIFF), the binding

> How important is it to allow concurrent editing of the text/metadata?

not very: there are a small number of people who operate on this data,
and they are assigned pieces and parcels to work on exclusively until
done.  we don't now have a system-based locking mechanism other than
how the assignments are made (a social process amongst staff).

> CVS is clearly not suitable for tracking changes to binary data,
> especially not in the scale of your corpus.  However the other parts may
> be maintined with CVS, depending on how well you can break the entire
> corpus into manageable chunks, and perhaps depending on how much you can
> afford to manipulate several copies of all these files.

can folks speculate on what makes for the largest manageable chunk for
cvs?  if i make each of my 3,000 top-level chunks into a cvs module,
what senses might we call that manageable or unmanageable?  is 10,000
objects in a cvs module too much, just fine?


Donald Sharp suggests to keep looking, any ideas about where/who are
other places to look?  known individuals or outfits that might have
expertise here?

i do really appreciate the comments made, and i get the sense that i'm
not crazy in thinking that this is a large and not-simple problem.
thanks much!


> (sounds a lot like what the guys at are doing,
> though with more exacting detail than would be necessary for searching
> scanned catalogue pages)

tangentially, i've been asked if we can have a "highlight the term on
the page image feature like the google catalog service" recently as
well.  the bar never stays in one place, and it never moves lower...

reply via email to

[Prev in Thread] Current Thread [Next in Thread]