bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#58168: string-lessp glitches and inconsistencies


From: Eli Zaretskii
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Tue, 04 Oct 2022 08:55:34 +0300

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Mon, 3 Oct 2022 21:48:14 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 2 okt. 2022 kl. 07.36 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> >> Comparison between objects is not only useful when someone cares about 
> >> their order, as in presenting a sorted list to the user. Often what is 
> >> important is an ability to impose an order, preferably total, for use in 
> >> building and searching data structures. I came across this bug when 
> >> implementing a string set.
> > 
> > Always converting to multibyte handles this case, doesn't it?
> 
> I don't think it does -- string= treats raw bytes in unibyte and multibyte 
> strings as distinct; converting to multibyte does not preserve (in)equality.

If the fact that string= says strings are not equal, but string-lessp
says they are equal, is what bothers you, we could document that
results of comparing unibyte and multibyte strings are unspecified, or
document explicitly that string= and string-lessp behave differently
in this case.

IOW, I see no reason to worry about 100% consistency here: the order
is _really_ undefined in these cases, and trying to make it defined
will not produce any tangible gains, IMNSHO.  It could very well
produce bugs and regressions, OTOH.  So it sounds like a net loss to
me, in practical terms.

> >> Actually I was talking about multibyte-multibyte comparisons.
> > 
> > Then why did you mention raw bytes? their multibyte representation
> > presents no performance problems
> 
> In a way they do -- the way raw bytes are represented (they start with C0 or 
> C1) causes memcmp to sort them between U+007F and U+0080. If we accept that 
> then comparisons are fast since memcmp will compare many character per 
> data-dependent branch. The current code requires several data-dependent 
> branches for each character.

Once again, slowing down string-lessp when raw-bytes are involved
shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
fall back to a slower comparison.  memchr is fast enough to not slow
down the "usual" case.  Would that be a good solution?

Alternatively, we could introduce a new primitive which could assume
multibyte or plain-ASCII unibyte strings without checking, and then
code which is sure raw-bytes cannot happen, and needs to compare long
strings, could use that for speed.

> While we could probably bring down the comparison cost slightly by clever 
> hand-coding, it's unlikely to be even nearly as fast as a memcmp and much 
> messier. Since users are unlikely to care much about the ordering between raw 
> bytes and something else (as long as there is an order), it would be a cheap 
> way to improve performance while at the same time fixing the string< / 
> string= mismatch.

The assumption that "users are unlikely to care" is a pure conjecture,
and we have no way of validating it.  So I don't want us to act on
such an assumption.

What about one of the alternatives above instead?

> > You can compare under the assumption that a unibyte string is
> > pure-ASCII until you bump into the first non-ASCII one.  If that
> > happens, abandon the comparison, convert the unibyte string to its
> > multibyte representation, and compare again.
> 
> I don't quite see how that would improve performance but may be missing 
> something.

Then maybe I didn't understand the performance problems that you had
in mind.  Suppose you describe them in more detail, preferably with
examples?  E.g., are you saying that unibyte strings that are
pure-ASCII also cause performance problems?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]