[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
use getSortKey in Unicode::Collate
From: |
Gavin Smith |
Subject: |
use getSortKey in Unicode::Collate |
Date: |
Mon, 13 Feb 2023 08:47:26 +0000 |
> Other than that I do not have much other idea than disabling it, for
> instance if documentlanguage is en. The result with Unicode::Collate is
> better for accented letters, but not so useful in english. There could
> even be a customization variable to use Unicode::Collate even in
> english.
Another possibility is to use getSortKey:
"$sortKey = $Collator->getSortKey($string)"
-- see 4.3 Form Sort Key, UTS #10.
Returns a sort key.
You compare the sort keys using a binary comparison and get the
result of the comparison of the strings using UCA.
$Collator->getSortKey($a) cmp $Collator->getSortKey($b)
is equivalent to
$Collator->cmp($a, $b)
>From perlperf man page:
Using a subroutine as part of your sort is a powerful way to get
exactly what you want, but will usually be slower than the built-in
alphabetic "cmp" and numeric "<=>" sort operators. It is possible to
make multiple passes over your data, building indices to make the
upcoming sort more efficient, and to use what is known as the "OM"
(Orcish Maneuver) to cache the sort keys in advance. The cache lookup,
while a good idea, can itself be a source of slowdown by enforcing a
double pass over the data - once to setup the cache, and once to sort
the data. Using "pack()" to extract the required sort key into a
consistent string can be an efficient way to build a single string to
compare, instead of using multiple sort keys, which makes it possible
to use the standard, written in "c" and fast, perl "sort()" function on
the output, and is the basis of the "GRT" (Guttman Rossler Transform).
Some string combinations can slow the "GRT" down, by just being too
plain complex for its own good.
We could try caching sort keys and see if it is fast enough. If so, we
could still use Unicode::Collate without any setting for this.
- Re: texi2any is too slow because of Unicode::Collate, (continued)
- Re: texi2any is too slow because of Unicode::Collate, pertusus, 2023/02/11
- Re: texi2any is too slow because of Unicode::Collate, Eli Zaretskii, 2023/02/12
- Re: texi2any is too slow because of Unicode::Collate, Gavin Smith, 2023/02/12
- Re: texi2any is too slow because of Unicode::Collate, pertusus, 2023/02/12
- Re: texi2any is too slow because of Unicode::Collate, Gavin Smith, 2023/02/12
- texi2any 7.0 performance regression (non-XS), Gavin Smith, 2023/02/12
- Re: texi2any 7.0 performance regression (non-XS), pertusus, 2023/02/12
- Re: texi2any 7.0 performance regression (non-XS), Gavin Smith, 2023/02/12
- Re: texi2any 7.0 performance regression (non-XS), pertusus, 2023/02/12
- Re: texi2any is too slow because of Unicode::Collate, Eli Zaretskii, 2023/02/12
use getSortKey in Unicode::Collate,
Gavin Smith <=