bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#58168: string-lessp glitches and inconsistencies


From: Mattias Engdegård
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Fri, 30 Sep 2022 22:04:47 +0200

29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz@gnu.org>:

> Unibyte strings should never be compared with
> multibyte, unless they are both pure-ASCII.

It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII 
multibyte).
If you mean that all strings (literals in particular) should be multibyte by 
default then I agree and at some point we should take that step, but it would 
be quite a breaking change. Perhaps less in practice than we fear, though...

>> So, what can be done? The current string< implementation uses the character 
>> order
>> 
>> ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF
>> 
>> in conflict with string= which unifies unibyte and multibyte ASCII but not 
>> raw bytes and Latin-1.
> 
> It would be unimaginable to unify raw bytes with Latin-1.  Raw bytes
> are not Latin-1 characters, they can stand for any characters, or for
> no characters at all.

Completely agreed! Let's try to fix that, then.

> Unibyte characters don't belong to this order.  They
> should be converted to multibyte representation to be sensibly
> comparable.

Oh I agree to some extent but we can't really raise an error if someone tries 
so we might as well return something reasonable and coherent. Besides, there 
are more good reasons for ordering strings (both multibyte and unibyte) than 
might be apparent at first.

Working from the assumption that we can't change string= to equate raw bytes in 
unibyte and multibyte strings, we need to invent an order between normally 
incommensurate values which sounds odd but is actually fine; this is 
occasionally done and can be quite useful.

It's also a matter of performance -- string< has been improved recently but 
currently we compare text in Latin and Swahili much faster than French and 
Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing 
strings by scalar values can be done byte-wise, but the way we encode raw bytes 
make them sort right between ASCII and Latin-1. Given that the specific order 
doesn't matter much, we could just run with that.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]