bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#58168: string-lessp glitches and inconsistencies


From: Eli Zaretskii
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Sat, 01 Oct 2022 08:22:03 +0300

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 30 Sep 2022 22:04:47 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Unibyte strings should never be compared with
> > multibyte, unless they are both pure-ASCII.
> 
> It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII 
> multibyte).

Not relevant: I meant unibyte non-ASCII strings.  The ASCII case is
easy and un-problematic, and is really just a straw-man here.

> If you mean that all strings (literals in particular) should be multibyte by 
> default then I agree and at some point we should take that step, but it would 
> be quite a breaking change. Perhaps less in practice than we fear, though...

That's not what I meant.  I think unibyte strings are with us for the
observable future.

> > Unibyte characters don't belong to this order.  They
> > should be converted to multibyte representation to be sensibly
> > comparable.
> 
> Oh I agree to some extent but we can't really raise an error if someone tries 
> so we might as well return something reasonable and coherent.

It depends on the use case, but in general I see no problem with
signaling errors when we cannot produce reasonably correct results.
For example, string-to-unibyte does signal an error in some cases.

> Besides, there are more good reasons for ordering strings (both multibyte and 
> unibyte) than might be apparent at first.

Examples, please.

> Working from the assumption that we can't change string= to equate raw bytes 
> in unibyte and multibyte strings, we need to invent an order between normally 
> incommensurate values

I don't agree with the conclusion.  It is not the only possible
conclusion.  Signaling an error is another one, and I'm sure we could
think of more.

> It's also a matter of performance -- string< has been improved recently but 
> currently we compare text in Latin and Swahili much faster than French and 
> Arabic; it would be nice to close that gap. UTF-8 is designed so that 
> comparing strings by scalar values can be done byte-wise, but the way we 
> encode raw bytes make them sort right between ASCII and Latin-1. Given that 
> the specific order doesn't matter much, we could just run with that.

I see no reason to make comparison of unibyte and multibyte strings
perform better.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]