Re: use getSortKey in Unicode::Collate

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: use getSortKey in Unicode::Collate

From:	Patrice Dumas
Subject:	Re: use getSortKey in Unicode::Collate
Date:	Mon, 13 Feb 2023 12:42:54 +0100

On Mon, Feb 13, 2023 at 08:47:26AM +0000, Gavin Smith wrote:
> > Other than that I do not have much other idea than disabling it, for
> > instance if documentlanguage is en.  The result with Unicode::Collate is
> > better for accented letters, but not so useful in english.  There could
> > even be a customization variable to use Unicode::Collate even in
> > english.
> 
> Another possibility is to use getSortKey:
> 
>        "$sortKey = $Collator->getSortKey($string)"
>            -- see 4.3 Form Sort Key, UTS #10.
> 
>            Returns a sort key.
> 
>            You compare the sort keys using a binary comparison and get the
>            result of the comparison of the strings using UCA.
> 
>               $Collator->getSortKey($a) cmp $Collator->getSortKey($b)
> 
>                  is equivalent to
> 
>               $Collator->cmp($a, $b)
> 
> From perlperf man page:
> 
>        Using a subroutine as part of your sort is a powerful way to get
>        exactly what you want, but will usually be slower than the built-in
>        alphabetic "cmp" and numeric "<=>" sort operators.  It is possible to
>        make multiple passes over your data, building indices to make the
>        upcoming sort more efficient, and to use what is known as the "OM"
>        (Orcish Maneuver) to cache the sort keys in advance.  The cache lookup,
>        while a good idea, can itself be a source of slowdown by enforcing a
>        double pass over the data - once to setup the cache, and once to sort
>        the data.  Using "pack()" to extract the required sort key into a
>        consistent string can be an efficient way to build a single string to
>        compare, instead of using multiple sort keys, which makes it possible
>        to use the standard, written in "c" and fast, perl "sort()" function on
>        the output, and is the basis of the "GRT" (Guttman Rossler Transform).
>        Some string combinations can slow the "GRT" down, by just being too
>        plain complex for its own good.
> 
> We could try caching sort keys and see if it is fast enough.  If so, we
> could still use Unicode::Collate without any setting for this.

Ok, I'll propose a change, it would be simpler to avoid any use of 
Unicode::Collate
by using getSortKey too.

-- 
Pat

[Prev in Thread]

Current Thread

[Next in Thread]

Re: texi2any is too slow because of Unicode::Collate, (continued)

Prev by Date: use getSortKey in Unicode::Collate
Next by Date: Re: use getSortKey in Unicode::Collate
Previous by thread: use getSortKey in Unicode::Collate
Next by thread: Re: use getSortKey in Unicode::Collate
Index(es):
- Date
- Thread