bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: msgmerge speedup: fstrcmp and diffseq improvements


From: Ralf Wildenhues
Subject: Re: msgmerge speedup: fstrcmp and diffseq improvements
Date: Mon, 15 Sep 2008 23:03:18 +0200
User-agent: Mutt/1.5.18 (2008-05-17)

Hi Bruno,

Thanks for the nice feedback, and for promptly addressing my patches!
Your changes to them all looked good to me.

* Bruno Haible wrote on Mon, Sep 15, 2008 at 01:50:52AM CEST:
> 
> > 3) When computing one string pair distance, avoid computing the path
> > (i.e., the edit script).
> 
> We are already doing this optimization: The NOTE_INSERT and NOTE_DELETE
> macros don't store any object in memory; they only increment a counter.

Yes.  But if you're not interested in the edit script, you can avoid
the divide and conquer algorithm described in section 4b of the Myers
paper, and instead use the one described earlier in section 3.
Although, I haven't checked whether this helps at all; and it would
likely need a split up from sharing the code with diffutils.

> > 4) Avoid quadratic scaling in the total number of msgids (i.e., number
> > of strings in .po file times number of strings in the .pot file).
[...]

> Reducing the closest-neighbor problem to a clustering problem is probably
> not promising: algorithms for clustering are known to be efficient when
> the domain provides some partitioning criteria, but if it doesn't - like
> here in the case of arbitrary strings - the clustering is hard.

This relies on the specific set of strings: lots of very similar strings
may be hard to separate with a given metric.  Using more than one can
help.  Or we could just accept that there can be difficult cases, and
only try to fix the simple ones.

Anyway, the criterion for when to start looking at this is simple:
As soon as bound computations and quadratic pre-sorting take up a
sizeable amount of the time, when compared to time spent in compareseq,
and screen all but O(N) string pairs from compareseq comparison, the set
of bound computations may serve as metric with which to cluster.

My gut feeling is that merging of text files a couple of hundred KB
in size should not take more than a second roughly, on modern hardware.
I'm confident; see below.  :-)

> But here is a different idea. What you have implemented is the "alpha"-
> pruning of a chess playing program: stop analyzing moves which are worse
> than the best move known so far. The second trick that chess programs use
> is to sort the order in which they consider the moves so that the good
> moves get likely analyzed first - with the consequence that more mediocre
> moves can be discarded quickly. When I apply this idea to msgmerge (attached
> draft patch), I get once more a 2x speedup:
>   "msgmerge af.po coreutils.pot"
>   16.5 sec. with all your patches,
>    7.6 sec. with this additional sorting.

Very nice.  Your draft patch doesn't directly apply to the current CVS
tree; I haven't tested and looked at it in detail yet.

See below for a character frequency bound that is only a little more
expensive than string length comparison (sort of a 1-gram version of
msgl-fsearch).  Here, it gives me almost another factor of two speedup;
I would be interested to see whether it still helps at all on top of
your patch.

With this, time spent in compareseq/diag goes down to roughly 90% from
more than 98%.  Averaged over coreutils/po/, the string length bound
hits 34% of the time, and the frequency bound in 80% of all remaining
pairs.  Still, after these two screening tests, 98.5% of the compareseq
calls return 'true'.

> > 5) Use a different similarity measure.  Not considered here.
> 
> It is the different similarity measure (in this case: from a 4-gram index)
> which can help sorting the list of message to be considered.

Ah, that's good.

Cheers,
Ralf

2008-09-15  Ralf Wildenhues  <address@hidden>

        * lib/fstrcmp.c (frequency_bound): New function.
        (fstrcmp_bounded): Use it as a second, slightly less quick,
        upper bound on the edit count.

diff --git a/lib/fstrcmp.c b/lib/fstrcmp.c
index cd0d7b7..33fa561 100644
--- a/lib/fstrcmp.c
+++ b/lib/fstrcmp.c
@@ -99,6 +99,28 @@ keys_init (void)
 /* Ensure that keys_init is called once only.  */
 gl_once_define(static, keys_init_once)
 
+/* Compute a lower bound on the edit count of STRING1 and STRING2, summing the
+   difference of the occurrences for each character in the strings.  */
+static int
+frequency_bound (const char *string1, const char *string2)
+{
+#if CHAR_BIT != 8
+  "error: port this code to be efficient on systems with CHAR_BIT != 8"
+#endif
+  int f[1 << CHAR_BIT];                /* frequency difference field */
+  int i, diff;
+  const unsigned char *s;
+
+  memset (f, 0, sizeof f);
+  for (s = string1; *s; s++)
+    f[*s]++;
+  for (s = string2; *s; s++)
+    f[*s]--;
+  diff = 0;
+  for (i = 0; i < sizeof f / sizeof *f; ++i)
+    diff += abs (f[i]);
+  return diff;
+}
 
 double
 fstrcmp_bounded (const char *string1, const char *string2, double lower_bound)
@@ -141,6 +163,20 @@ fstrcmp_bounded (const char *string1, const char *string2, 
double lower_bound)
       if (upper_bound < lower_bound)
        /* Return an arbitrary value < LOWER_BOUND.  */
        return 0.0;
+
+      /* Compute a slightly less quick upper bound using character frequencies,
+        ignoring the position of the characters:
+        If f (c) is the difference between the number of times that character C
+        occurs in STRING1, and the number of times it occurs in STRING2, then
+          edit_count >= sum_c | f(c) |.
+       */
+      upper_bound =
+       (double) (xvec_length + yvec_length - frequency_bound (string1, 
string2))
+       / (xvec_length + yvec_length);
+
+      if (upper_bound < lower_bound)
+       /* Return an arbitrary value < LOWER_BOUND.  */
+       return 0.0;
     }
 
   /* set the info for each string.  */




reply via email to

[Prev in Thread] Current Thread [Next in Thread]