[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cvs (or something!) on very large scales with non-source code objects

From: Nigel Kerr
Subject: cvs (or something!) on very large scales with non-source code objects
Date: Fri, 01 Feb 2002 10:24:37 -0500
User-agent: Gnus/5.090005 (Oort Gnus v0.05) Emacs/20.7 (sparc-sun-solaris2.8)

good folk,

i ask this forum because i'm not at all sure where start looking for
ideas on how to address my problems.  cvs may not be the right tool
for what i have, but any ideas or suggestions or redirections to other
fora are welcome and desired.

i have several million objects ("very large scales"): roughly half of
them are bitonal TIFF files, scanned page images of printed material;
the other half are OCR'd text of those same TIFF files.  there are a
relatively small number of other kinds of files: metadata about chunks
of these data, and auxilliary images of parts of some of the pages.
right now the top level chunks of this corpus number about 3,000, with
sub-chunks inside those top-level chunks.

at any moment, it might be discovered that there is an error or
problem with any of these objects, that will need to be fixed:

    the TIFF file might be bad/corrupt/unclear
    the ocr'd text might be bad/corrupt/unclear
    the metadata might be found to be wrong
    the auxilliary images might be bad/corrupt/unclear

we might make a change to a small number of things at a time, we might
also make a batch change to thousands of things at a time.  back when
we had less than 500 top-level chunks, our life was relatively easy:
we had a home-grown edit-history-type system that basically:

    moved the old file FILE to

    moved the new version of FILE into place

    wrote in a date-stamped log file a message meaning "i changed
    this!", where the message phrased differently depending on what
    got changed.

    used the doughty mirror perl script on our different machines to
    get the changed data from the master to the slave machines.

we're still using that system.  we get about 400,000 new items a month
in between 30-50 new top-level chunks (a top-level varies in size
considerably).  the increases in size of our corpus will never slow

our stated *goals* for using this system are two-fold:

    a method for communicating from the master to the slave machines
    about what has changed, and what they should try to update.

    a record of what all has changed ever, so that if we had to start
    from original source media (the cd-roms the data arrive to us on),
    we could, and only update what needed updating.

i don't have much problem with the first goal: we need some
communication method from master to slave.  i am increasingly nervous
about the second goal as we get larger and larger, and am looking for
other ways to address or consider that problem.

it might be that we:

    give up on "record of what all has changed ever", and try to go
    for "record of what all has changed since the last time we had a
    complete checkpoint of our corpus", and keep using our change
    system, and give up on the "restore from original media" idea.

    use a version control system that can handle millions of things
    (which would be?!) changing, and the master-to-slave transport of
    changes efficiently.

    keep going about things as we have, and just hope we never have to
    restore from scratch.

    something else?

anyone here approached this kind of problem, know someone who has, or
have any ideas about it?  people/places i can seek advice from?
anything is appreciated, thank you.

nigel kerr

reply via email to

[Prev in Thread] Current Thread [Next in Thread]