bug#58168: string-lessp glitches and inconsistencies

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#58168: string-lessp glitches and inconsistencies

From:	Eli Zaretskii
Subject:	bug#58168: string-lessp glitches and inconsistencies
Date:	Sat, 01 Oct 2022 08:22:03 +0300

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Fri, 30 Sep 2022 22:04:47 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > Unibyte strings should never be compared with
> > multibyte, unless they are both pure-ASCII.
> 
> It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII 
> multibyte).

Not relevant: I meant unibyte non-ASCII strings.  The ASCII case is
easy and un-problematic, and is really just a straw-man here.

> If you mean that all strings (literals in particular) should be multibyte by 
> default then I agree and at some point we should take that step, but it would 
> be quite a breaking change. Perhaps less in practice than we fear, though...

That's not what I meant.  I think unibyte strings are with us for the
observable future.

> > Unibyte characters don't belong to this order.  They
> > should be converted to multibyte representation to be sensibly
> > comparable.
> 
> Oh I agree to some extent but we can't really raise an error if someone tries 
> so we might as well return something reasonable and coherent.

It depends on the use case, but in general I see no problem with
signaling errors when we cannot produce reasonably correct results.
For example, string-to-unibyte does signal an error in some cases.

> Besides, there are more good reasons for ordering strings (both multibyte and 
> unibyte) than might be apparent at first.

Examples, please.

> Working from the assumption that we can't change string= to equate raw bytes 
> in unibyte and multibyte strings, we need to invent an order between normally 
> incommensurate values

I don't agree with the conclusion.  It is not the only possible
conclusion.  Signaling an error is another one, and I'm sure we could
think of more.

> It's also a matter of performance -- string< has been improved recently but 
> currently we compare text in Latin and Swahili much faster than French and 
> Arabic; it would be nice to close that gap. UTF-8 is designed so that 
> comparing strings by scalar values can be done byte-wise, but the way we 
> encode raw bytes make them sort right between ASCII and Latin-1. Given that 
> the specific order doesn't matter much, we could just run with that.

I see no reason to make comparison of unibyte and multibyte strings
perform better.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii <=
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/01
  - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/02
    - bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/03
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/04
    - bug#58168: string-lessp glitches and inconsistencies, Richard Stallman, 2022/10/04
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/04
    - bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/06
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/06
    - bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/07
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/08

Prev by Date: bug#58158: 29.0.50; [overlay] Interval tree iteration considered harmful
Next by Date: bug#58168: string-lessp glitches and inconsistencies
Previous by thread: bug#58158: 29.0.50; [overlay] Interval tree iteration considered harmful
Next by thread: bug#58168: string-lessp glitches and inconsistencies
Index(es):
- Date
- Thread