[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#58168: string-lessp glitches and inconsistencies
From: |
Mattias Engdegård |
Subject: |
bug#58168: string-lessp glitches and inconsistencies |
Date: |
Thu, 6 Oct 2022 11:05:04 +0200 |
4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:
> If the fact that string= says strings are not equal, but string-lessp
> says they are equal, is what bothers you, we could document that
> results of comparing unibyte and multibyte strings are unspecified, or
> document explicitly that string= and string-lessp behave differently
> in this case.
(It's not just string= but `equal` since they use the same comparison.)
But it's just a part of a set of related problems:
* string< / string= inconsistency
* undesirable string< ordering (unibyte strings are treated as Latin-1)
* bad string< performance
Ideally we should be able to do something about all three at the same time
since they are interrelated. At the very least it's worth a try.
Just documenting the annoying parts won't make them go away -- they still have
to be coded around by the user, and it doesn't solve any performance problems
either.
> I see no reason to worry about 100% consistency here: the order
> is _really_ undefined in these cases, and trying to make it defined
> will not produce any tangible gains,
Yes it would: better performance and wider applicability. Even when the order
isn't defined the user expects there to be some order between distinct strings.
> Once again, slowing down string-lessp when raw-bytes are involved
> shouldn't be a problem. So, if memchr finds a C0 or C1 in a string,
> fall back to a slower comparison. memchr is fast enough to not slow
> down the "usual" case. Would that be a good solution?
There is no reason a comparison should need to look beyond the first mismatch;
anything else is just algorithmically slow. Long strings are likely to differ
early on. Any hack that has to special-case raw bytes will add costs.
The best we can hope for is hand-written vectorised code that does everything
in one pass but it's still slower than just a memcmp.
Even then our chosen semantics make that more difficult (and slower) than it
needs to be: for example, we cannot assume that any byte with the high bit set
indicates a mismatch when comparing unibyte strings with multibyte, since we
equate unibyte chars with Latin-1. It's a decision that we will keep paying for.
> Alternatively, we could introduce a new primitive which could assume
> multibyte or plain-ASCII unibyte strings without checking, and then
> code which is sure raw-bytes cannot happen, and needs to compare long
> strings, could use that for speed.
That or variants thereof are indeed alternatives but users would be forgiven to
wonder why we don't make what we have fast instead?
> E.g., are you saying that unibyte strings that are
> pure-ASCII also cause performance problems?
They do because we have no efficient way of ascertaining that they are
pure-ASCII. The long-term solution is to make multibyte strings the default in
more cases but I'm not proposing such a change right now.
I'll see to where further performance tweaking of the existing code can take us
with a reasonable efforts, but there are hard limits to what can be done.
And thank you for your comments!
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/01
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/01
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/02
- bug#58168: string-lessp glitches and inconsistencies,
Mattias Engdegård <=
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/06
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/07
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/08
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/14
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/14
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/17