Re: [bug-diffutils] New diff option to compare file/directory properties

bug-diffutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-diffutils] New diff option to compare file/directory properties

From:	Andreas Gruenbacher
Subject:	Re: [bug-diffutils] New diff option to compare file/directory properties
Date:	Fri, 15 Oct 2010 15:38:47 +0200
User-agent:	KMail/1.12.4 (Linux/2.6.36-rc7+; KDE/4.3.5; i686; ; )

On Thursday 14 October 2010 17:47:38 Duncan Moore wrote:
>   Hello Andreas
> 
> Thanks for the feedback.
> 
> On 11/10/2010 18:11, Andreas Gruenbacher wrote:
> > I think we should extend the diff file format so that additional 
> > information can be included.  Patches should still remain backwards
> > compatible as much as possible, though.
> 
> I'm not sure we can have existing implementations of 'patch' working 
> with the new 'diff --compare=...' output format, unless we have all the 
> differing properties on a single line,

We can relatively easily put additional information before the actual diff, 
like git does.  Existing patch implementations like GNU patch which ignore 
leading information which they don't understand will continue to apply diffs 
to regular files just fine.

> so I think it may be best to forget about backwards compatibility in this
> sense.

I strongly disagree.  People will continue to use existing implementations of 
patch, and it is important to ensure that those implementations will behave 
"reasonably" with extended patches.

> Of course, if the --compare=... option isn't used, then there is no
> change to the 'diff' output, and no backwards compatibility issues.

I think that some information like file permissions and symlinks should be 
included in patches by default.

> Firstly, I'll just state some of the points that the updates were 
> intended to address, taken from the 'diff' info manual:
> 
> Section 18.1

It is important to note that section 18.1 is about "some ideas for improving 
GNU diff and patch".  I agree with many of those ideas, but this section 
probably shouldn't be treated as if it was set in stone.



> - One should be able to use gnu 'diff' to generate a patch 
> from any pair of directory trees, and given the patch and a copy of one 
> such tree, use patch to generate a faithful copy of the other.
> 
> Section 18.1.2 Handling Changes to the Directory Structure.
> 'diff' and 'patch' do not handle some changes to directory structure. 
> For example, suppose one directory tree contains a directory named 'D' 
> with some subsidiary files, and another contains a file with the same 
> name 'D'. 'diff -r' does not output enough information for 'patch' to 
> transform the directory subtree into the file.
> There should be a way to tell 'patch' that a file's time stamp has 
> changed, even if its contents have not changed.
> These problems can be fixed by extending the 'diff' output format to 
> represent changes in directory structure, and extending 'patch' to 
> understand these extensions.
> 
> Section 18.1.3 Files that are Neither Directories Nor Regular Files
> Some files are neither directories nor regular files: they are unusual 
> files like symbolic links, device special files, named pipes, and 
> sockets. Currently, 'diff' treats symbolic links like regular files; it 
> treats other special files like regular files if they are specified at 
> the top level, but simply reports their presence when comparing 
> directories. This means that 'patch' cannot represent changes to such 
> files. For example, if you change which file a symbolic link points to, 
> 'diff' outputs the difference between the two files, instead of the 
> change to the symbolic link.
> 'diff' should optionally report changes to special files specially, and 
> 'patch' should be extended to understand these extensions.
> 


> Additionally, 'diff' is not a tool for just creating input for 'patch', 
> it's also for allowing the user to easily see what's different in file 
> contents.

Fair enough.  

> The aim was to carry this principal over for other file 
> properties, so that the user can selectively compare any file 
> properties, either with or without the traditional file contents 
> difference. This would be separate from it's use in creating patch 
> files, although the output format would be the same.

I don't see the point in excluding file content differences.  What would be a 
reasonably general use case for this?

> > We also need to define what diff files are supposed to be used for: for
> > example, they currently contain file modification times.  Those timestamps
> > are usually ignored when applying a diff (or left out in the first place).
> > It is not entirely clear which additional information to include by
> > default (on the diff side), and which information to ignore by default (on
> > the patch side).  I don't think file ownership should normally be stored
> > in diff files, for example.
> 
> My thinking was that their is no 'default' set of extra properties that 
> 'diff' outputs - the user has to specify them with keywords. With a 
> traditional patch file from 'diff' (i.e. --compare has not been used), 
> 'patch' would operate exactly as it does now. If --compare=all has been 
> used to supply extra information, then 'patch' would default to using 
> some subset (possibly null) of the available property information - 
> that's open to discussion, and more of a 'patch' matter than a 'diff' 
> one. And a 'patch' option like --use=time,mode etc could be used to 
> explicitly specify some subset of available properties to use.
> I have used the keyword 'all', to select all available properties. Maybe 
> we need to supplement this with another keyword, say 'basic', that only 
> selects basic properties likely to be of use in a patch file - e.g. 
> 'basic' would not include ownership. And 'patch' would default to using 
> the same 'basic' subset.

I don't think introducing all those funny keywords is the right way to go.  
Most people will not use any of them, and GNU diff should probably still 
include most or all additional information available as long as it doesn't 
cause backwards compatibility problems.

> > Probably the most widely used extended diff format today is that of git.  
The
> > "git diff" format always starts with a line of the form:
> >
> >     "diff --git old_name new_name"
> 
> I'm not familiar with the git 'diff' format, but I've had a look at the 
> link you've given. I'm not sure how desirable it is to be compatible 
> with this format. Only the mode information seems to be of  direct 
> relevance to 'diff'.

As is distinguishing between empty and non-existent files, filename quoting, 
symlink support.  Please have a closer look at how this format works to avoid 
re-inventing the wheel.  You can find some examples and "git diff" format 
parsing code in the GNU patch repository.

> Some of the changes are to do with files that have been renamed or copied.
> This is information that 'diff' couldn't be expected to generate,

Not trivially, but version control systems have shown that it is possible: 
some track renames and copies; others use heuristics to recognize similar 
files.

> although maybe the 'diff' output format should be able to accommodate this
> information even though it couldn't generate it itself.

Yes, the format should definitely be powerful enough to express copies and 
renames.

> Also, git diff uses 'old' and 'new' in the output, which are concepts that
> are not /directly/ relevant to 'diff' itself.

How are they not directly relevant?  Diff also basically compares two files; 
feel free to call them 'file1' and 'file2' or 'left' and 'right' or whatever 
instead of 'old' and 'new'.

> > The features that the "git diff" format adds include:
> >
> >   * Filename quoting using C string literals for filenames which require
> >     quoting.
> 
> Filename quoting is something I'm intending to include in 'diff' 
> separately. Currently I'm using shell quoting style, so only filenames 
> with unusual characters get quoted.

Only quoting unusual characters is a good idea because it means that the 
format will remain compatible with old versions of patch when "normal" 
filenames are used.

Also see here:

        http://marc.info/?l=git&m=112927316408690&w=2

> It would be easy to change this to some other style already in gnulib, but
> I'll leave it to others to decide what the best style should be.

One of the key differences between quoting in gnulib and in git is that gnulib 
treats filenames as sequences of characters which may be multi-byte, while git 
treats them as sequences of bytes.

I'm not sure about the reasoning behind ignoring multi-byte characters; maybe 
Paul or Junio can enlighten us.

> >   * Distinguishes between empty files and non-existing files.
> 
> I'm not sure what format git diff uses. In my updates, -N --compare=all 
> would do it. One file gets marked as "non-existent" (see example at 
> end), and there would be no contents difference output (so it's implicit 
> rather than explicit).

Please have a look at how this format works then.  There is some more 
information in this thread on the git mailing list:

        http://www.spinics.net/lists/git/msg142547.html

> >   * Support for symlinks.
> 
> Again, I'm not sure how git diff does this. See the example at the end 
> for one way we could incorporate it with --compare=...
>
> >   * Support for binary files and binary deltas is included.  (The
> >   "payload" is base85 encoded in this case.)
> 
> I wasn't intending to incorporate this, and am not sure if 'diff' 
> should. If it was incorporated in the future I think a separate option, 
> --binary-content say, might be preferable to adding it to the 
> --compare=... keyword list.

Fair enough for now.

> >   * The SHA-1 hash of the file before and after a change is included.
> 
> Again, I wasn't intending to incorporate this, and am not sure if 'diff' 
> should. A keyword 'hash' could be added to the --compare=.. list. It 
> would lead to some 'quirks' though e.g., diff -b reporting identical 
> contents, but the hashes being different.

Fine.

> >> --compare=[content,time,mode,size,owner,group,all,objects]
> >> This option provides a number of file properties to compare, for normal
> >> and both context output styles. Keyword 'content' means the file
> >> contents, [...]
> > I don't understand.  There are two things to distinguish here: what diff
> > looks at in order to decide if a file is considered "identical", and what
> > information diff includes in its output.  Which refers to which?
> 
> For most of the examples I gave, if any of the specified file properties 
> differ, then the whole set of specified file properties are printed for 
> that file object (all on one line). I then gave an alternative output 
> format, where only the selected properties that differ are printed (on 
> separate lines). I now think that we should be going with the second 
> format, or something like it.

I agree that one item per line is preferable.

> > Character and block special files don't make sense for me to include in a
> > diff: they are not portable by design, and make little to no sense to
> > apply in a different context (on a different machine or even after a
> > driver load or removal on the same machine).
> 
> Some special files may not be useful for 'patch', but 'diff' is not a 
> tool for just supplying input for 'patch'. As a tool for saying what's 
> different between files, I think 'diff' should be reporting it if these 
> types of files are present.

It does now already.  IMO it would be sufficient for GNU diff to report the 
file type and name either when a file is found only in one tree or when a file 
exists in both trees but refers to different file types or devices.  Device 
numbers should probably not end up in patches; this is not an archive format.

> [...]
> This means that the example above would become:
> 
>    % diff -u --compare=all aaa bbb
>    --- aaa
>    +++ bbb
>    --- time: 2010-09-23 20:51:06.984375000 +0100
>    +++ time: 2010-09-23 20:51:03.015625000 +0100
>    --- mode: rw-r--r--
>    +++ mode: rwxr--r--
>    @@ -1 +1 @@
>    -111
>    +222

No way: overloading the meaning of the "---" and "+++" headers would 
completely break every existing implementation of patch.

> Non-existent files would give this sort of output:
> [...]

Please have a look how GNU diff uses Epoch timestamps and how cvs / svn / git 
use /dev/null for creates and deletes.

> Symbolic links are normally dereferenced and treated like regular files,
> but a --no-dereference option could be used to treat symbolic links like 
> the other special files:

I believe this is a mistake; the way how GNU diff currently treats symlinks 
(as regular files) is likely to lead to undesired results.  At the very least, 
GNU diff should print a warning.

Symlinks never have any "file content" so the file the symlink points to could 
be treated as the "file content".

Thanks,
Andreas

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [bug-diffutils] New diff option to compare file/directory properties, Andreas Gruenbacher, 2010/10/11
- Re: [bug-diffutils] New diff option to compare file/directory properties, Duncan Moore, 2010/10/14
  - Re: [bug-diffutils] New diff option to compare file/directory properties, Andreas Gruenbacher <=
    - Re: [bug-diffutils] New diff option to compare file/directory properties, Paul Eggert, 2010/10/15
    - Re: [bug-diffutils] New diff option to compare file/directory properties, Junio C Hamano, 2010/10/17
    - Re: [bug-diffutils] New diff option to compare file/directory properties, Duncan Moore, 2010/10/19

Prev by Date: Re: [bug-diffutils] New diff option to compare file/directory properties
Next by Date: [bug-diffutils] Does patch ignore Linux ACLs?
Previous by thread: Re: [bug-diffutils] New diff option to compare file/directory properties
Next by thread: Re: [bug-diffutils] New diff option to compare file/directory properties
Index(es):
- Date
- Thread