info-cvs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to treat XML files checked into CVS


From: Paul Sander
Subject: Re: How to treat XML files checked into CVS
Date: Wed, 16 Apr 2008 01:07:21 -0700


On Apr 15, 2008, at 8:34 PM, Arthur Barrett wrote:

The thread began on September 14, 2001, with the subject "giving up
CVS".  A patch was posted with the subject " Demo of extensible merge
(was Re: giving up CVS)".



Hmmm interesting... Not really sure if that covers all the possible
places a merge is initiated, but still insteresting.

At the time I posted the patch, CVS had a wrapper function around the 3-way merge. That wrapper function was called from every place where a merge could be initiated. (At least, this is applied to every merge that affected user-visible artifacts, i.e. files located in his sandbox.) So that was the proper place to insert the extensions.

Keep in mind that the patch was done as a proof of concept, and it was not intended to be a model for production code. Relying on naming conventions to ascertain data types is not robust enough, in my opinion. Reading the file for magic numbers or other identifying traits is better, and it might even be the best way given that CVS does not guarantee that every version stored in an RCS container has the same data type. (If it would, then something like a MIME type stored in the RCS file admin phrase would be best. But making that guarantee in CVS would require a redesign from the ground up, and calls out other well-worn arguments.)

Do you still see this as a requirement?

I most certainly do. This is one of several missing features that I consider to be essential for anything other than small toy projects. Over time, members of this forum have raised the issue of merging many types of data, including: Document formats like MS Word and Frame Maker; mark-up languages such as XML and HTML; image files include GIF, JPEG, and PNG; motion picture formats such as MPEG; composite data types like those used by NeXT Step and VLSI design tools. Even those who have opposed adding such extensibility have claimed to wish for better merge capability for their chosen programming languages than is possible with a diff3-based tool like the one supplied with CVS, by somehow bringing the tool "closer to" the language.

I will go so far as to claim that differencing and merging algorithms can be developed for every type of data, including those lost causes listed above. The degenerate case for merging is a simple selection, but even image files such as JPEG can have a meaningful merge if someone could design a proper user experience. I imagine a merge tool with four tiled images; three allow lasso-style selections and represent the contributor, ancestor, and working versions; the final one has all of the editing capabilities of, say, Photoshop, and into which selections from the other three images can be pasted. The final image is the one that replaces the working version and is eventually committed at the completion of a larger merge across the project.

But no one has built tools for this purpose. Apparently there just hasn't been much demand for them. But if we implement the hooks to our version control tools to enable this capability, the demand may slowly follow.

Do you have any response to the arguments I raised about how people use
merge tools?

I disagree that the setting up an external application to perform a merge is a complex and messy proposition. I agree that the merge tools for many data types would probably rely on a GUI. I also believe that some adjustment of the existing CVS user interface may be desirable; most data types don't lend themselves to the kind of conflict mark-ups that we're used to in ASCII file formats. So the - kb style of handling might be necessary for most data types, along with an additional "cvs merge" command that invokes the proper merge tool to resolve conflicts detected during past updates and remembered in CVS' sandbox metadata, using the working copy and the fetched ancestor and contributor copies. In situations where the mark-ups are useful, the merge tool might simply be the user's favorite text editor.

For situations in which merges are initiated by wrapper tools (such as WinMerge), such tools should embrace the full capability of the underlying tool to the extent that is practical. If using WinMerge causes merge history to be lost, then there's something wrong with the integration: Either there aren't enough hooks in the lower-level tool to give access to that level of detail, or the higher-level tool lacks the ability to invoke the lower functions properly. In either case, at least one of the tools wasn't thought out well enough to permit the kind of tight integration that is really needed.

CVSNT certainly already has an alternative 'diff' mechanism used
(optionally) to create the deltas for binary files (-kB), and I can see
it as a relatively painless proposition to add '-kE' to use an
extensible method if this is still relavent.

There's a big difference between the diff algorithm used to compute the deltas stored in the version containers, versus the diff algorithm used to present differences to the user. I claim that these can, and indeed should, be different in the typical case.

RCS uses a longest common substring algorithm as a form of compression to minimize the size of the deltas between adjacent versions. This makes the storage of version history efficient, and small deltas make for fast reconstruction of versions. This is as it should be.

However, suppose a Java programmer wants to see differences between two versions of his source file. It turns out that Java is a fairly well behaved hierarchical language (unlike C or C++, due to the macro preprocessor). This user would rather have his deltas presented to him in a way that reflects the structure of his program: The insertion, deletion, or modification of control structures and expressions without regard to cosmetic formatting or the history of how such control structures and expressions came into being. Differencing algorithms such as the one published by Sudarshan S. Chawathe are good for this, if they could be fitted with a good user interface. (See "Comparing Hierarchical Data in External Memory", Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999 for the published hierarchical diff algorithm. It might also be available for download from the University of Maryland.) This could be extended to implement a 3-way merge tool, too.

The bottom line here is that, given a working copy of the user's data, and the identities of the ancestor and contributing versions, the RCS differencing and patching algorithms would efficiently construct complete copies of the ancestor and contributing versions. Then a 3-way merge tool would that is specific to the type of data would be applied to the complete copies to give the user the view he wants of his data.

Getting back to the original topic, XML is also a well-behaved hierarchical data format. The type of modification that I propose for CVS would apply equally well to it.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]