[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CVS corrupts binary files ...

From: Paul Sander
Subject: Re: CVS corrupts binary files ...
Date: Tue, 8 Jun 2004 21:21:16 -0700

>--- Forwarded mail from address@hidden

>Greg writes:
>> CVS is designed _only_ for tracking changes in
>> human written text files.

>Paul writes:
>> Keep in mind also that there's a difference
>> between "binary files" and "mergeable files".
>> The two concepts are in fact orthogonal; there
>> are mergeable binary types (given a suitable
>> tool), and there are unmergeable text types. CVS
>> is bad at storing unmergeable files, no matter
>> whether or not they're binary files. CVS can be
>> easily modified to support mergeable binary
>> types, as I've demonstrated, without significant
>> impact to its design.

>In my view, CVS was designed to add a model of
>concurrent modification and automatic merges on
>top of the previously existing Revision Control
>System representation of files. The removal of
>exclusive locking for changes is the fundamental
>reason that CVS exists.

>It may be that the diff3 algorithm is not always
>the best one suited to do such mergers.=20

Well said.

However, I'm not 100% convinced that "automatic merges"
is a prerequisite of the concurrency model.  I see no
reason why a command like "cvs update" could not spawn
a graphical merge tool (even for C source code, for
example).  However, such actions, should they stall,
must not stop others from doing their own merges or
from committing new changes.

>For example, using a UTF16 character set in a file
>for example may prove to be difficult to merge
>even if the text in the file is only a "simple"
>Chinese representation. Perhaps something like
>the xcin project will eventually provide a diff3
>for use in this case.

>It may be desirable to mark UTF8 or UTF16 files as
>being 'binary' in order to preserve the text more
>exactly across operating systems that are not
>(yet) friendly to such text.

>For this reason, I take Paul's side on the issue
>of the orthogonal nature of the discussion of
>files that may or may not be "merged" using
>automatic tooling of some sort.

Thanks!  :-)

>I also share Greg's bias that using CVS to save
>arbitrary binary data and/or derived objects is
>not something that is a core competence of CVS.

Saving derived objects is definitely not a best practice
in SCM, at least not in the source control system.  Whether
or not arbitrary (or opaque) binary data should or should
not be stored in CVS is a sticky question, because it may
very well be source code (i.e. data that can be created or
modified only by human intervention), in which case I
believe it should be stored in the source control system.

For merges, opaque data must be handled appropriately.  One
way is to take Greg's approach and boot it out completely.
I believe a better way is to apply a simple selection tool
that takes the place of a merge tool.  (After all, any data
type is mergeable if you can swap out the entire contents of
a file in one chunk, right?  :-)

>For myself, I have no objection to a few small
>icons being checked into a repository that will
>also be holding sources that use them (of course,
>I would usually favor them being convereted into a
>text representation such as xbm format or the
>like). I have seen where using very large binary
>objects can cause problems for both users and

It's important to note that xbm format is also an unmergeable
data type, at least with diff3, even if such files do not
contain non-printable ASCII characters.  The reason is that
it's hard to edit an image without seeing it as an image.

I agree about storing large binary files in CVS; it would be nice
if there were multiple storage managers to choose from, depending
on their suitability to the data at hand.  But given that RCS
works (though admittedly not necessarily well) in all cases, it's
good enough (for 95%+ of the files thrown at it) that I don't see
a reason to change at this moment.  ('Course, I'd be happy to
participate in a separate discussion about creating an abstraction
layer over RCS and plugging in other storage managers...  :-)

>I have also seen problems where folks checkin
>derived objects such as PostScript files that are
>pure text files, but normally are not merged
>effectively by a diff3 program during a normal
>'cvs update' of a file.

>I believe that adding flexibility to CVS as to
>what program should be used in place of diff3 for
>doing a merge operation is desirable.

>That said, I do not know the correct approach to
>take for allowing the cvs admin or user do such a
>merge with a non-diff3 tool. Some such tools are
>(by their nature) interactive and this does not
>seem to be a good fit with the CVS methodology.

I believe that the data type should be stored in a
newphrase in the admin section of the RCS file.
The bad thing about that is that if the RCS file is
recycled with a new data type, or if it contains
different data types on different branches, there
is no correct value for the newphrase.

Others have stated that the data type should be
stored with each version of the file.  That way
you can tell when a nonsensical merge is attempted.
But then the data type must be accurately maintained
with every commit.

Another way is to have the merge tool analyze the data
types of all of the contributors, and fail if they're
not all the same (or at least are not compatible given
the semantics of a content merge).

There are ways to address this in the general case,
but they involve very intrusive changes to the CVS
design.  (The bottom line here is to decouple
the data from its path in the workspace, which means
a new method of mapping RCS files to working copies
is needed.  Having done this, you can guarantee
that every revision stored in any RCS file contain
the same data type.)

>Some such programs may only be available on client
>machines while others would potentially be
>available on the server. I typically favor that
>such programs would be consdiered to be present on
>the server and NOT on the client.

Resources that maintain the integrity of the repository
and enforce process must necessarily be on the server.
The *info scripts, for example, fall under this category.
However, merge tools, like the tool used to edit commit
messages, should be configurable by the user on the client
side.  Allowing the user to choose his favorite tool can
do nothing but improve his productivity.

>The exact semantics and rules under which a
>substitution for a different program than diff3
>could be used for a merge operation need to be
>carefully considered before we jump into a change.

No doubt about that.

>I suspect that we would need to add a filetype
>recognizer into cvs as a preliminary step to help
>to classify the type of a file that is to be
>merged (or added or imported for that matter) in
>order to know which of the potentially large
>number of three-way merge programs and scripts
>should be used or considered for use during a
>given cvs merge operation.

There's also the question of _when_ to run the
recognizer.  Above I mentioned three distinct times
when such a mechanism might be used:  Add time, commit
time, and merge time.  Each has their advantages and

I think one viable compromise given the current
design would be to record an initial data type at
add time and propagate it with every commit.  The
user would be allowed to override the datatype with
every commit.  If a dead file is resurrected, the
old data type is remembered as a default.  When a
merge is done, the recorded data types of all of
the contributors are compared and some suitable
action is taken.

Suitable actions might be a failure if the contributors
are of different types, or to ignore the common ancestor
(i.e. perform a 2-way merge rather than 3-way) if the
ancestor differs from the contributors.  Or perhaps a
conversion to a universal format could be done (e.g. if
ancestor is Word, and the contributors are RTF and HTML
then they could be converted to a common format like XML)
before the merge and then the result be saved in the
expected format.

>I do not consider filetypes driven by the name of
>a file to be useful in such deliberations.

Certainly not in the general case.  Naming conventions
might be adequate on a per-shop or per-project basis,
and for some data types naming conventions can be very
accurate.  But I agree that a better method is needed
because in the general case the success rate at guessing
data types based on naming conventions alone is pretty low.

If it weren't for the "cvs import" command, punting might
be a possible solution:  Just require the data type as
input to the "cvs add" command.  But if large numbers of
files are to be added at once, something better is needed.
Alternatives include a file(1)-like mechanism to analyze
a file's content in addition to naming conventions, or
requiring a list of path/datatype pairs as an argument.

>If anyone has any suggestions or other patches
>for this kind of feature, I would be interested
>in hearing about them.

I'm sure this discussion will be quite lively!  :-)

>--- End of forwarded message from address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]