use getSortKey in Unicode::Collate

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

use getSortKey in Unicode::Collate

From:	Gavin Smith
Subject:	use getSortKey in Unicode::Collate
Date:	Mon, 13 Feb 2023 08:47:26 +0000

> Other than that I do not have much other idea than disabling it, for
> instance if documentlanguage is en.  The result with Unicode::Collate is
> better for accented letters, but not so useful in english.  There could
> even be a customization variable to use Unicode::Collate even in
> english.

Another possibility is to use getSortKey:

       "$sortKey = $Collator->getSortKey($string)"
           -- see 4.3 Form Sort Key, UTS #10.

           Returns a sort key.

           You compare the sort keys using a binary comparison and get the
           result of the comparison of the strings using UCA.

              $Collator->getSortKey($a) cmp $Collator->getSortKey($b)

                 is equivalent to

              $Collator->cmp($a, $b)

>From perlperf man page:

       Using a subroutine as part of your sort is a powerful way to get
       exactly what you want, but will usually be slower than the built-in
       alphabetic "cmp" and numeric "<=>" sort operators.  It is possible to
       make multiple passes over your data, building indices to make the
       upcoming sort more efficient, and to use what is known as the "OM"
       (Orcish Maneuver) to cache the sort keys in advance.  The cache lookup,
       while a good idea, can itself be a source of slowdown by enforcing a
       double pass over the data - once to setup the cache, and once to sort
       the data.  Using "pack()" to extract the required sort key into a
       consistent string can be an efficient way to build a single string to
       compare, instead of using multiple sort keys, which makes it possible
       to use the standard, written in "c" and fast, perl "sort()" function on
       the output, and is the basis of the "GRT" (Guttman Rossler Transform).
       Some string combinations can slow the "GRT" down, by just being too
       plain complex for its own good.

We could try caching sort keys and see if it is fast enough.  If so, we
could still use Unicode::Collate without any setting for this.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: texi2any is too slow because of Unicode::Collate, (continued)

Prev by Date: Re: texi2any 7.0 performance regression (non-XS)
Next by Date: Re: use getSortKey in Unicode::Collate
Previous by thread: Re: texi2any is too slow because of Unicode::Collate
Next by thread: Re: use getSortKey in Unicode::Collate
Index(es):
- Date
- Thread