bug-diffutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-diffutils] bug#17492: diff handles moved text poorly; should not ne


From: Scott McPeak
Subject: [bug-diffutils] bug#17492: diff handles moved text poorly; should not need --minimal
Date: Wed, 14 May 2014 02:08:02 -0700
User-agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0

GNU diff appears to handle moved text poorly using its default diff algorithm. At the end, I show a script that will reproduce the problem. The essence is you have a large file that contains many blank lines, then move a large-ish block of text (~10% of file) from near the top to near the bottom. The default diff algorithm treats all the blank lines as unchanged, thereby treating virtually all of the non-blank lines in the file as changed. This makes the diff useless for comprehending the change and basically destroys the history shown by a command like "git annotate".

Now, after spending considerable time investigating this flaw, I have discovered the --minimal flag, and see that it will get the right answer, and that git accepts this flag in many (all?) cases where it matters. But it is burdensome to add this flag to every command that uses diff (as I would not know in advance that any particular change will be affected by this bug), and few people know about the flag, so even if I use it, I still can't realistically move text in my files and expect others to be able to understand the resulting diff. Furthermore, the manual warns that --minimal can be very slow; "git annotate" is already quite slow (10s+) on realistically sized repos, so compounding that with --minimal on every invocation is not appealing.

I think diff should, by default, handle the common case of moved text better.

Naively, it does not seem like detecting moved text should be very difficult or expensive. Since the presence of the blank lines is a required element, a simple hack of avoiding anchoring on blank lines might go a long way here. Better would be to avoid anchoring on any line content that is common.

I tested with diff 2.8.1 and diff 3.3 (latest as of writing).

-Scott

#!/bin/sh
# diff-bug-repro.sh: demonstrate diff bug with moving text

# Print lines with contents [$1,$2], separated by blank lines.
#
# The blank lines are important to this bug, because they seem
# to be treated by diff as anchor points: lines that haven't
# changed and hence the diff should try to work around them.
# Without the blank lines, diff works correctly.
genlines() {
  n=$1;
  while [ $n -le $2 ]; do
    echo $n;
    echo;
    n=`expr $n + 1`;
  done
}

# Simulates some original file that happens to have a lot of blank
# lines, as one might expect in source code, prose, HTML, etc.
echo "creating orig.txt: file with [1,5000], separated by blank lines..."
genlines 1 5000 > orig.txt

# Simulates a modification of orig.txt where a large block of text,
# represented by [11,499], is moved from near the beginning of the
# file to near the end of the file.
echo "creating new.txt: file with [1,10][500,4990][11,499][4991,5000], separated by blank lines..."
genlines 1 10 > new.txt
genlines 500 4990 >> new.txt
genlines 11 499 >> new.txt
genlines 4991 5000 >> new.txt

# The diff output contains 14948 lines, which is the bug.  It should
# only need about 2000 lines: 1000 to describe deleting the block from
# the start and 1000 lines to describe adding it at the end.
echo "diff -u orig.txt new.txt | wc -l"
diff -u orig.txt new.txt | wc -l

# EOF





reply via email to

[Prev in Thread] Current Thread [Next in Thread]