[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: use getSortKey in Unicode::Collate
From: |
Patrice Dumas |
Subject: |
Re: use getSortKey in Unicode::Collate |
Date: |
Mon, 13 Feb 2023 12:42:54 +0100 |
On Mon, Feb 13, 2023 at 08:47:26AM +0000, Gavin Smith wrote:
> > Other than that I do not have much other idea than disabling it, for
> > instance if documentlanguage is en. The result with Unicode::Collate is
> > better for accented letters, but not so useful in english. There could
> > even be a customization variable to use Unicode::Collate even in
> > english.
>
> Another possibility is to use getSortKey:
>
> "$sortKey = $Collator->getSortKey($string)"
> -- see 4.3 Form Sort Key, UTS #10.
>
> Returns a sort key.
>
> You compare the sort keys using a binary comparison and get the
> result of the comparison of the strings using UCA.
>
> $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
>
> is equivalent to
>
> $Collator->cmp($a, $b)
>
> From perlperf man page:
>
> Using a subroutine as part of your sort is a powerful way to get
> exactly what you want, but will usually be slower than the built-in
> alphabetic "cmp" and numeric "<=>" sort operators. It is possible to
> make multiple passes over your data, building indices to make the
> upcoming sort more efficient, and to use what is known as the "OM"
> (Orcish Maneuver) to cache the sort keys in advance. The cache lookup,
> while a good idea, can itself be a source of slowdown by enforcing a
> double pass over the data - once to setup the cache, and once to sort
> the data. Using "pack()" to extract the required sort key into a
> consistent string can be an efficient way to build a single string to
> compare, instead of using multiple sort keys, which makes it possible
> to use the standard, written in "c" and fast, perl "sort()" function on
> the output, and is the basis of the "GRT" (Guttman Rossler Transform).
> Some string combinations can slow the "GRT" down, by just being too
> plain complex for its own good.
>
> We could try caching sort keys and see if it is fast enough. If so, we
> could still use Unicode::Collate without any setting for this.
Ok, I'll propose a change, it would be simpler to avoid any use of
Unicode::Collate
by using getSortKey too.
--
Pat
- Re: texi2any is too slow because of Unicode::Collate, (continued)
- Re: texi2any is too slow because of Unicode::Collate, Eli Zaretskii, 2023/02/12
- Re: texi2any is too slow because of Unicode::Collate, Gavin Smith, 2023/02/12
- Re: texi2any is too slow because of Unicode::Collate, pertusus, 2023/02/12
- Re: texi2any is too slow because of Unicode::Collate, Gavin Smith, 2023/02/12
- texi2any 7.0 performance regression (non-XS), Gavin Smith, 2023/02/12
- Re: texi2any 7.0 performance regression (non-XS), pertusus, 2023/02/12
- Re: texi2any 7.0 performance regression (non-XS), Gavin Smith, 2023/02/12
- Re: texi2any 7.0 performance regression (non-XS), pertusus, 2023/02/12
- Re: texi2any is too slow because of Unicode::Collate, Eli Zaretskii, 2023/02/12
use getSortKey in Unicode::Collate, Gavin Smith, 2023/02/13