[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#58168: string-lessp glitches and inconsistencies
From: |
Mattias Engdegård |
Subject: |
bug#58168: string-lessp glitches and inconsistencies |
Date: |
Mon, 3 Oct 2022 21:48:14 +0200 |
2 okt. 2022 kl. 07.36 skrev Eli Zaretskii <eliz@gnu.org>:
>> Comparison between objects is not only useful when someone cares about their
>> order, as in presenting a sorted list to the user. Often what is important
>> is an ability to impose an order, preferably total, for use in building and
>> searching data structures. I came across this bug when implementing a string
>> set.
>
> Always converting to multibyte handles this case, doesn't it?
I don't think it does -- string= treats raw bytes in unibyte and multibyte
strings as distinct; converting to multibyte does not preserve (in)equality.
>> Actually I was talking about multibyte-multibyte comparisons.
>
> Then why did you mention raw bytes? their multibyte representation
> presents no performance problems
In a way they do -- the way raw bytes are represented (they start with C0 or
C1) causes memcmp to sort them between U+007F and U+0080. If we accept that
then comparisons are fast since memcmp will compare many character per
data-dependent branch. The current code requires several data-dependent
branches for each character.
While we could probably bring down the comparison cost slightly by clever
hand-coding, it's unlikely to be even nearly as fast as a memcmp and much
messier. Since users are unlikely to care much about the ordering between raw
bytes and something else (as long as there is an order), it would be a cheap
way to improve performance while at the same time fixing the string< / string=
mismatch.
> You can compare under the assumption that a unibyte string is
> pure-ASCII until you bump into the first non-ASCII one. If that
> happens, abandon the comparison, convert the unibyte string to its
> multibyte representation, and compare again.
I don't quite see how that would improve performance but may be missing
something.
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/01
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/01
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/02
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/06
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/06
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/07
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/08
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/14
- bug#58168: string-lessp glitches and inconsistencies, Eli Zaretskii, 2022/10/14
- bug#58168: string-lessp glitches and inconsistencies, Mattias Engdegård, 2022/10/17