[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: index sorting in texi2any in C issue with spaces
From: |
Gavin Smith |
Subject: |
Re: index sorting in texi2any in C issue with spaces |
Date: |
Wed, 31 Jan 2024 20:10:56 +0000 |
On Wed, Jan 31, 2024 at 10:15:08AM +0100, Patrice Dumas wrote:
> Hello,
>
> I implemented index sorting in C with XS interface in texi2any.
> When unicode collation is wanted, based on my understanding of
> Eli suggestions, a collation locale is set to "en_US.utf-8", by
> newlocale (LC_COLLATE_MASK, "en_US.utf-8", 0)
> and then strxfrm_l is used (which should be the same as using
> strcoll_l). With conversion in C/with XS set with environment variable
> TEXINFO_XS_CONVERT=1 and for now only for HTML, if TEST customization
> variable is not set.
It seems like a pretty obscure interface. It is barely
documented - newlocale is in the Linux Man Pages but not the
glibc manual, and strxfrm_l was only in the Posix standard
(https://pubs.opengroup.org/onlinepubs/9699919799/functions/strxfrm.html).
I don't know of any other way of accessing the collation functionality.
Do you know how portable it is? The documentation for the corresponding
Gnulib module says the following:
Portability problems not fixed by Gnulib:
This function is missing on many platforms: FreeBSD 6.0, NetBSD 5.0,
OpenBSD 6.0, Minix 3.1.8, AIX 5.1, HP-UX 11, IRIX 6.5, Solaris 11.3,
Cygwin 1.7.x, mingw, MSVC 14, Android 4.4.
<https://www.gnu.org/software/gnulib/manual/html_node/strxfrm_005fl.html>
Could it be possible to have an option of "current locale" collation
which could use more standard interfaces?
Moreover, en_US.utf-8 will use collation appropriate for (US) English.
There may be language-specific "tailoring" for other languages (e.g.
Swedish) that the user may wish to use instead. Hence, it may be
a good idea to allow use of a user-specified locale for collation through
the C code.
> On my debian GNU/Linux, the result is good except for the treatment of
> spaces. Indeed, spaces (and non alphanumeric characters, but it is
> not really an issue) are ignored when sorting, which sticks to the Unicode
> collation standard, but leads to an awkward sorting for indices, for
> example 'H r' is sorted after 'Ha'. In perl, it is possible to
> customize the Unicode::Collate collation, we use 'variable' =>
> 'Non-Ignorable'.
I think either way is in accordance with the collation standard. The
standard gives four options and "Non-ignorable" is one of them:
http://www.unicode.org/reports/tr10/#Variable_Weighting
I doubt it is possible to customize the collation of a locale with
a function such as newlocale. I expect the collation order is fixed
when the locale is defined.
I found some locale definition files on my system under
/usr/share/i18n/locales (location mention in man page of the "locale"
command) and there is a file iso14651_t1_common which appears to be
based on the Unicode Collation tables. I have only skimmed this file
and don't understand the file format well (it's supposed to be documented
in the output of "man 5 locale"), but is really part of glibc internals.
In that file, space has a line
<U0020> IGNORE;IGNORE;IGNORE;<U0020> % SPACE
which appears to define space as a fourth-level collation element,
corresponding to the Shifted option at the link above:
"Shifted: Variable collation elements are reset to zero at levels one
through three. In addition, a new fourth-level weight is appended..."
In the Default Unicode Collation Element Table (DUCET), space has the line
0020 ; [*0209.0020.0002] # SPACE
with the "*" character denoting it as a "variable" collation element.
I expect it would require creating a glibc locale to change the collation
order, which is not something we can do.