bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#58168: string-lessp glitches and inconsistencies


From: Mattias Engdegård
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Thu, 6 Oct 2022 11:05:04 +0200

4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:

> If the fact that string= says strings are not equal, but string-lessp
> says they are equal, is what bothers you, we could document that
> results of comparing unibyte and multibyte strings are unspecified, or
> document explicitly that string= and string-lessp behave differently
> in this case.

(It's not just string= but `equal` since they use the same comparison.)
But it's just a part of a set of related problems:

* string< / string= inconsistency
* undesirable string< ordering (unibyte strings are treated as Latin-1)
* bad string< performance 

Ideally we should be able to do something about all three at the same time 
since they are interrelated. At the very least it's worth a try.

Just documenting the annoying parts won't make them go away -- they still have 
to be coded around by the user, and it doesn't solve any performance problems 
either.

> I see no reason to worry about 100% consistency here: the order
> is _really_ undefined in these cases, and trying to make it defined
> will not produce any tangible gains,

Yes it would: better performance and wider applicability. Even when the order 
isn't defined the user expects there to be some order between distinct strings.

> Once again, slowing down string-lessp when raw-bytes are involved
> shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
> fall back to a slower comparison.  memchr is fast enough to not slow
> down the "usual" case.  Would that be a good solution?

There is no reason a comparison should need to look beyond the first mismatch; 
anything else is just algorithmically slow. Long strings are likely to differ 
early on. Any hack that has to special-case raw bytes will add costs.

The best we can hope for is hand-written vectorised code that does everything 
in one pass but it's still slower than just a memcmp.
Even then our chosen semantics make that more difficult (and slower) than it 
needs to be: for example, we cannot assume that any byte with the high bit set 
indicates a mismatch when comparing unibyte strings with multibyte, since we 
equate unibyte chars with Latin-1. It's a decision that we will keep paying for.

> Alternatively, we could introduce a new primitive which could assume
> multibyte or plain-ASCII unibyte strings without checking, and then
> code which is sure raw-bytes cannot happen, and needs to compare long
> strings, could use that for speed.

That or variants thereof are indeed alternatives but users would be forgiven to 
wonder why we don't make what we have fast instead?

> E.g., are you saying that unibyte strings that are
> pure-ASCII also cause performance problems?

They do because we have no efficient way of ascertaining that they are 
pure-ASCII. The long-term solution is to make multibyte strings the default in 
more cases but I'm not proposing such a change right now.

I'll see to where further performance tweaking of the existing code can take us 
with a reasonable efforts, but there are hard limits to what can be done.
And thank you for your comments!






reply via email to

[Prev in Thread] Current Thread [Next in Thread]