info-cvs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal to fix CVS binary file implementation


From: Paul Sander
Subject: Re: Proposal to fix CVS binary file implementation
Date: Thu, 28 Dec 2000 16:38:02 -0800

--- Forwarded mail from address@hidden (Greg Woods)

>[ On Thursday, December 21, 2000 at 00:22:41 (-0600), David L. Martin wrote: ]
>> Subject: Proposal to fix CVS binary file implementation
>>
>>  If the user wants to change the "binary-ness" of a file, this should
>> be performed using cvs admin, but never "on-the-fly" based on options
>> to checkout or update.

>Well, it's not quite that simple.  A file of the same name may have very
>different content on different branches, and if binary files were to be
>supported in some way then it would be necessary to think of the
>situation where a file might be a binary on one branch, but not on
>another, for example.  Indeed a "change" that might be checked in might
>convert the file from binary to non-binary, or vice versa.  I.e. in an
>ideal theoretical world, without constraints, the binariness of a file
>would be defined in the repository on a per-revision basis and could be
>instantly changed by "cvs checkin", which would imply that by default it
>must also be changed in the working directory by "cvs checkout" too.

>Unfortunately the current implementation using an RCS-format repository
>simply cannot possibly ever manage to represent this kind of
>per-revision file state.  RCS "keyword" flags apply to all revisions and
>all branches simultaneously.  The very notion of "per-file flags" is
>bogus and useless for any purpose in a revision control system, but
>that's what we've got to live with so long as we have RCS-format files
>for the repository.

Branches should never be used to separate files that should in actuality
have different version histories.  Using version-based attributes to
determine the type of the content perpetuates the problem and makes it
altogether unsolvable, at least in any practical sense.

The decision to use an unversioned database to map version histories to
working files (the filesystem) was simply wrong.  Adding one will solve
a great many problems, including this one, in a simple and elegant way.

>> We need to decouple, in concept and in implementation, "binary" and
>> "keyword expansion mode".  The binary nature of a file (which mandates
>> no EOL translation and no keyword substitution) is an immutable
>> attribute of the file which must always take precedence.  It should
>> not be adjustable using checkout or update.  I cannot think of any
>> circumstance where a binary file would ever need to be transiently
>> defined as anything different.

>This is obvious.  What's not obvious is how to deal with all of the
>other issues of binary file handling.  Since CVS does not now, and
>cannot by design, properly handle binary files, the tricks of handling
>keyword expansion may as well take precedence over any concept of
>binariness.

We have discussed this in detail, too.  CVS' design CAN accomodate
binary files.  It's just a question of how far we want to go with it.
Type managers can handle keyword expansion issues as well as provide
suitable content-sensitive merge tools.

>> My CVS Christmas wish: Can we (CVS users/developers) come to a
>> consensus to devise a fix to allow keyword expansion, binary files,
>> and merging to work harmoniously "out of the box" (e.g. in a way that
>> will make it into the main CVS code line?)  I believe we have a lot of
>> developers and CVS administrators implementing a variety of
>> workarounds.  I know this has been brought up several times in the
>> past and has resulted in many a flame war.

>Well the only way to make CVS allow binary files and merging to work
>harmoniously will be to change the fundamental laws of the universe and
>introduce some real magic into the world!  :-)

>Seriously the only way this can ever work is to give up on having a
>strictly RCS-based repository format.  If you're willing to throw away
>at least part of RCS for the back-end repository, and if you're willing
>to re-implement CVS to use a much more sophisticated database design
>that takes into account the far more complex requirements of handling
>binary data, then sure, you could do this.  You might as well start
>completely from scratch and simply attempt to retain the same
>command-line interface and perhaps some rudimentary backward
>compatability with the client/server protocol such that older clients
>can still do basic text-only operations against a new server.

RCS can still be used as the back-end versioning mechanism, even for
binary files.  Type managers add the potential for using alternatives
as well (e.g. compressing the containers).

>> A frequent argument against changing is that CVS was not designed to
>> handle binary files.  This may be true, but the introduction of the
>> -kb option tends to prove a willingness and desire in the CVS
>> development and user community to accomodate binary files.

>That's a completely invalid argument -- your conclusion is invalid.  The
>introduction of '-kb' comes from RCS, not CVS, and in no way implies
>that CVS can make any more than the most rudimentary use of it.  Just
>because RCS has a feature doesn't mean CVS uses it in the way you might
>logically conclude from its use in RCS (eg. branching, locking, keyword
>expansion, state fields, etc., etc., etc.).

One could say that CVS shouldn't interfere with the operation of the
tools it relies on, but that's yet another argument.

>>  Many (myself included) have implemented procedures or written scripts
>> to effectively exclude binary files from the merge operation, or to
>> perform some pre- or post-processing on the files or archives used in
>> the merge to correct the problems encountered using cvs update -kk.

>Either you have a different concept of merging, or you have not done
>exactly as you claim you have.

>> Others may construct their repositories so that binary files live in
>> their own directories in a sort of "binary prison", apart from the
>> ASCII source files, so that the binary files may be more easily
>> excluded from merge operations.  I don't think this is a good solution
>> because CVS then dictates repository structure, even when cohesive
>> functional grouping may dictate that ASCII and binary files should
>> coexist in the same directory.

>Well it's the only logical solution given the constraints of the tools
>at hand.....  The effect of this solution on repository structure is no
>where near as important as you seem to make it out to be.

"The tools at hand".  They can be modified, use the source!

>> I think it's time for us to close the loop and implement binary file
>> support in a manner which is more merge-friendly, one which
>> accomodates both ASCII and binary files in the same merge operation
>> (where merging of binary files results in *copies* being made and no
>> actual merging - with no binary file keyword expansion or EOL
>> translation).

>Who makes the choice of which "copy" survives?  How is this choice
>reversed if the original decision is incorrect?

The user does, just like with any conflict.  The choice can also be
specified on the command line if you want the operation to complete
without interaction (with the understanding that for some files the
choice made might be incorrect, but at least the user asked for it).
"cvs update" with the proper arguments reverses the decision if no
local changes were made or after the working copy is removed.

>> Here's what I would propose (and I underscore *propose*):
>>
>> 1) Maintain the current keyword expansion modes, as persisted in the
>> archive or in the local working area in the Entries file as "kv, kvl,
>> k, o, b, or v";
>>
>> AND
>>
>> 2)  EITHER:
>>
>>      a) Provide a new command line keyword expansion option "-km" on
>>      cvs update and cvs checkout to support merging.  The effect would
>>      be that the working area local keyword substitution mode would
>>      overridden to "k" for all but binary files, which would remain
>>      "b".
>>
>>      OR
>>
>>      b) Change the current behavior of update and checkout to never
>>      override the archive-stored default keyword substitution mode for
>>      binary files.
>> 
>> Any comments?  Wait, let me put on my Kevlar heat-resistant suit
>> first...

>I believe you proposal is somewhat naive in that it does not address any
>of the main issues of trying to manage binary files in a revision
>control system that's specifically designed to allow for concurent
>editing.

It's a lot better than what you, Greg, propose....

>Think about it: You've got some changes you are about to commit that
>include changes to a file which you've tagged as un-mergable (i.e. it is
>a binary, opaque, file).  As you run "cvs commit" you discover that
>someone else has simultaneously made changes to that file.  Now what?
>You can't even use "cvs diff" to find out what the heck they did!  You
>can only guess by investigating their revision comments and/or by asking
>them out-of-band.  If the file has some structure that's visible in some
>other medium than a text editor (eg. it's a JPEG) then you can perhaps
>visually compare your revision, the ancestor revision, and the other
>person's new revision.

Okay, now suppose you have a type manager that can invoke the proper merge
tool for the file's content.  The merge proceeds and the user resolves
conflicts normally.  No big deal.

Oh yeah, there's that problem where different versions might contain
different types of data.  Again, files containing different types of
data should have different version histories.  Unfortunately, CVS in
its current form requires a unique mapping between version histories
and working files, so people use it improperly because they have no
alternative that meets more pressing requirements.

>So, OK, you're willing to work around these issues with CVS to try to
>maintain some semblance of concurrent editing support.  Perhaps you're
>even willing to use the "cvs edit" hack and some externally imposed
>procedures and processes to prevent your users from concurrently editing
>binary files.

>Now what about the scenario when you go to merge two branches together
>and there are conflicting changes in binary files, but where both
>changes must be retained?  Suddenly your difficulties are twice as large
>and twice as hard to fix.

Hence, you employ a type manager to supply the proper merge tool.  The
CVS algorithm remains the same.  The selection of the diff tool (consult
the type manager vs. use a hard-coded one) is the only change needed.

That assumes, of course, that you can guarantee that all of the
contributors contain the same data type.  That's not an unreasonable
requirement, and it can be guaranteed with a versioned mapping between
working files and containers in the repository.

>What about the scenario where your repository has been around for a
>while and you find that users are beginning to want to re-use
>now-removed filenames, but with different attributes (eg. suddenly a
>file becomes a binary)?

Here is where the versioned mapping between working files and version
history is needed.  Files containing different types of data must have
unique version histories, even if they share a space in the filesystem
at different times or on different branches.

>The can of worms opened up by binary file support just gets deeper and
>wider the more you look at it.

Nope, they're the same problems we've discussed for years...

>While CVS as it stands has several features which make it generically
>attractive for general-purpose revision control, it cannot be stated too
>many times that CVS is *NOT* a general-purpose revision control system
>-- it is specifically a system *DESIGNED* to handle the special case of
>file formats which can easily be merged automatically with simple
>unix-style diff; and which as a result means it can specifically target
>the needs of those who must work in environments where concurrent
>editing must be allowed and encouraged.  This DESIGN implies that it has
>constraints on its operation which prevent it from being a truly general
>purpose tool.

Nevertheless, the algorithms that CVS implements are applicable generally,
and CVS can be extended to more general use.  There's nothing inherent in
its design that REQUIRES the use of a diff-based tool; that's just what
happens to be hard-coded in.

>Therefore if you do not like the DESIGN of CVS, and by definition the
>constraints it imposes on the resulting product, then DO NOT USE CVS,
>regardless of whatever other features it has which might make it
>attractive to you!

>I.e. if you want to handle binary files in a revision control
>environment then I strongly suggest that you'll be much further ahead if
>you simply take the ideas you like from CVS and start from scratch with
>a new design for a revision control system that is capable of handling
>the binary files you seem to need to handle.


>Of course if you throw away the silly idea of trying to support binary
>files in CVS with an RCS-format repository, and instead focus on
>extending CVS and the RCS file format definition it uses such that a
>file type can be specified on a per-revision (or at least per-branch)
>basis.  Also devise an extension that allows deltas to be defined with
>byte or (multi-byte) character offsets instead of line offsets.  You can
>then design tools which can do logical difference comparisons of
>variants and merges of changes with specific knowledge of these file
>types.  THEN you'll have a more powerful revision control system that
>can simultaneously handle changes to many file types in an intelligent
>manner.  Such a system could even do intelligent comparisons of
>text-based source files such that changes would be recognized on a
>code-structure level instead of on a text-line level as it is today.
>This is obviously a more intensive redesign, but one which will be
>infinitely more productive than any attempt to handle binary files in
>any way whatsoever.

This is completely the wrong approach.  There are way too many headaches
using version-specific typing, as you've observed.  It's better to constrain
the general case in a different dimension.

--- End of forwarded message from address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]