bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#58168: string-lessp glitches and inconsistencies


From: Eli Zaretskii
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Thu, 06 Oct 2022 14:06:26 +0300

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Thu, 6 Oct 2022 11:05:04 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > If the fact that string= says strings are not equal, but string-lessp
> > says they are equal, is what bothers you, we could document that
> > results of comparing unibyte and multibyte strings are unspecified, or
> > document explicitly that string= and string-lessp behave differently
> > in this case.
> 
> (It's not just string= but `equal` since they use the same comparison.)
> But it's just a part of a set of related problems:
> 
> * string< / string= inconsistency
> * undesirable string< ordering (unibyte strings are treated as Latin-1)
> * bad string< performance 

That doesn't seem different (and the ordering part is not necessary,
IMO).

> Ideally we should be able to do something about all three at the same time 
> since they are interrelated. At the very least it's worth a try.

It depends on the costs and the risks.  All the rest being equal, yes,
solving those would be desirable.  But it isn't equal, and the costs
and the risks of your proposals outweigh the advantages in my book,
sorry.

> Just documenting the annoying parts won't make them go away -- they still 
> have to be coded around by the user, and it doesn't solve any performance 
> problems either.

That's not a catastrophe, because we are already there (sans the
documentation), and because these cases are rare in real life.

> > I see no reason to worry about 100% consistency here: the order
> > is _really_ undefined in these cases, and trying to make it defined
> > will not produce any tangible gains,
> 
> Yes it would: better performance and wider applicability.

These are not tangible enough IMO.

> Even when the order isn't defined the user expects there to be some order 
> between distinct strings.

No, if the order is undefined, the caller cannot expect any order.
Cf. NaN comparisons with numerical values.

> > Once again, slowing down string-lessp when raw-bytes are involved
> > shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
> > fall back to a slower comparison.  memchr is fast enough to not slow
> > down the "usual" case.  Would that be a good solution?
> 
> There is no reason a comparison should need to look beyond the first 
> mismatch; anything else is just algorithmically slow. Long strings are likely 
> to differ early on. Any hack that has to special-case raw bytes will add 
> costs.

You missed me here.  Why are you suddenly talking about mismatches?
And if only mismatches matter here, why is it a problem to use memchr
in the first place?

> > Alternatively, we could introduce a new primitive which could assume
> > multibyte or plain-ASCII unibyte strings without checking, and then
> > code which is sure raw-bytes cannot happen, and needs to compare long
> > strings, could use that for speed.
> 
> That or variants thereof are indeed alternatives but users would be forgiven 
> to wonder why we don't make what we have fast instead?

Because the fast versions can break when the assumptions are false.
We already have similar stuff in encoding/decoding area: there are
fast optimized functions that require the caller to make sure some
assumptions hold.

> > E.g., are you saying that unibyte strings that are
> > pure-ASCII also cause performance problems?
> 
> They do because we have no efficient way of ascertaining that they are 
> pure-ASCII.

If we declare that comparing with unibyte non-ASCII produces
unspecified results, we don't have to worry about that: it becomes the
worry of the caller.

> The long-term solution is to make multibyte strings the default in more cases 
> but I'm not proposing such a change right now.

I don't think we will ever get there, FWIW.  Raw bytes in strings are
a fact of life, whether we like it or not.

> I'll see to where further performance tweaking of the existing code can take 
> us with a reasonable efforts, but there are hard limits to what can be done.
> And thank you for your comments!

Thanks.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]