[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#58168: string-lessp glitches and inconsistencies
From: |
Eli Zaretskii |
Subject: |
bug#58168: string-lessp glitches and inconsistencies |
Date: |
Sat, 01 Oct 2022 08:22:03 +0300 |
> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 30 Sep 2022 22:04:47 +0200
> Cc: 58168@debbugs.gnu.org
>
> 29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz@gnu.org>:
>
> > Unibyte strings should never be compared with
> > multibyte, unless they are both pure-ASCII.
>
> It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII
> multibyte).
Not relevant: I meant unibyte non-ASCII strings. The ASCII case is
easy and un-problematic, and is really just a straw-man here.
> If you mean that all strings (literals in particular) should be multibyte by
> default then I agree and at some point we should take that step, but it would
> be quite a breaking change. Perhaps less in practice than we fear, though...
That's not what I meant. I think unibyte strings are with us for the
observable future.
> > Unibyte characters don't belong to this order. They
> > should be converted to multibyte representation to be sensibly
> > comparable.
>
> Oh I agree to some extent but we can't really raise an error if someone tries
> so we might as well return something reasonable and coherent.
It depends on the use case, but in general I see no problem with
signaling errors when we cannot produce reasonably correct results.
For example, string-to-unibyte does signal an error in some cases.
> Besides, there are more good reasons for ordering strings (both multibyte and
> unibyte) than might be apparent at first.
Examples, please.
> Working from the assumption that we can't change string= to equate raw bytes
> in unibyte and multibyte strings, we need to invent an order between normally
> incommensurate values
I don't agree with the conclusion. It is not the only possible
conclusion. Signaling an error is another one, and I'm sure we could
think of more.
> It's also a matter of performance -- string< has been improved recently but
> currently we compare text in Latin and Swahili much faster than French and
> Arabic; it would be nice to close that gap. UTF-8 is designed so that
> comparing strings by scalar values can be done byte-wise, but the way we
> encode raw bytes make them sort right between ASCII and Latin-1. Given that
> the specific order doesn't matter much, we could just run with that.
I see no reason to make comparison of unibyte and multibyte strings
perform better.
- bug#58168: string-lessp glitches and inconsistencies,
Eli Zaretskii <=
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/01
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/02
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/06
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/06
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/07
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/08