bug#58168: string-lessp glitches and inconsistencies

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#58168: string-lessp glitches and inconsistencies

From:	Eli Zaretskii
Subject:	bug#58168: string-lessp glitches and inconsistencies
Date:	Sun, 02 Oct 2022 08:36:46 +0300

> From: Mattias Engdegård <mattias.engdegard@gmail.com>
> Date: Sat, 1 Oct 2022 21:57:45 +0200
> Cc: 58168@debbugs.gnu.org
> 
> 1 okt. 2022 kl. 07.22 skrev Eli Zaretskii <eliz@gnu.org>:
> 
> > It depends on the use case, but in general I see no problem with
> > signaling errors when we cannot produce reasonably correct results.
> > For example, string-to-unibyte does signal an error in some cases.
> 
> That's fine because that function is documented to do so and always has, but 
> making previously possible comparisons raise errors shouldn't be done lightly.

I didn't say "lightly", nor do I think so.  We need to discuss
specific use cases.

An alternative is to always convert unibyte non-ASCII strings to their
multibyte representation before comparing.

> Comparison between objects is not only useful when someone cares about their 
> order, as in presenting a sorted list to the user. Often what is important is 
> an ability to impose an order, preferably total, for use in building and 
> searching data structures. I came across this bug when implementing a string 
> set.

Always converting to multibyte handles this case, doesn't it?

> >> It's also a matter of performance -- string< has been improved recently 
> >> but currently we compare text in Latin and Swahili much faster than French 
> >> and Arabic; it would be nice to close that gap. UTF-8 is designed so that 
> >> comparing strings by scalar values can be done byte-wise, but the way we 
> >> encode raw bytes make them sort right between ASCII and Latin-1. Given 
> >> that the specific order doesn't matter much, we could just run with that.
> > 
> > I see no reason to make comparison of unibyte and multibyte strings
> > perform better.
> 
> Actually I was talking about multibyte-multibyte comparisons.

Then why did you mention raw bytes? their multibyte representation
presents no performance problems, AFAIU.

> You were probably thinking about comparisons between unibyte strings that 
> contain raw bytes and multibyte strings, and those are indeed not very 
> performance-sensitive. However there is no way to detect whether a unibyte 
> string contains non-ASCII chars without looking at every byte, and comparing 
> unibyte ASCII with multibyte is definitely of interest. Strings are still 
> unibyte by default.

You can compare under the assumption that a unibyte string is
pure-ASCII until you bump into the first non-ASCII one.  If that
happens, abandon the comparison, convert the unibyte string to its
multibyte representation, and compare again.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/01
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/01
  - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii <=
    - bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/03
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/04
    - bug#58168: string-lessp glitches and inconsistencies, Richard Stallman, 2022/10/04
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/04
    - bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/06
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/06
    - bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/07
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/08
    - bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/14
    - bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/14

Prev by Date: bug#57556: 28.1; Eshell not finding executables in PATH when tramp-integration loaded
Next by Date: bug#58225: 29.0.50; esh-var-test/interp-cmd-external test fails on macOS (10.13.6)
Previous by thread: bug#58168: string-lessp glitches and inconsistencies
Next by thread: bug#58168: string-lessp glitches and inconsistencies
Index(es):
- Date
- Thread