Re: texi2any is too slow because of Unicode::Collate

bug-texinfo

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: texi2any is too slow because of Unicode::Collate

From:	Gavin Smith
Subject:	Re: texi2any is too slow because of Unicode::Collate
Date:	Sat, 11 Feb 2023 20:30:07 +0000

On Sat, Feb 11, 2023 at 10:02:55PM +0200, Eli Zaretskii wrote:
> > From: Gavin Smith <gavinsmith0123@gmail.com>
> > Date: Sat, 11 Feb 2023 19:46:12 +0000
> > 
> > On Sat, Feb 11, 2023 at 08:04:15PM +0100, Patrice Dumas wrote:
> > > Other than that I do not have much other idea than disabling it, for
> > > instance if documentlanguage is en.  The result with Unicode::Collate is
> > > better for accented letters, but not so useful in english.  There could
> > > even be a customization variable to use Unicode::Collate even in
> > > english.
> > 
> > I think it's a good idea to disable it for "en" at least, along with
> > a customization variable.
> 
> How many manuals set documentlanguage?  With the proliferation of
> documentencoding set to UTF-8, I think disabling the collation for
> "en" will be next to futile.

If I understand correctly, until recently more standard Perl facilities
were used for sorting the indices, but this produced worse results for
non-English text, such as that containing many accented characters.
Unicode::Collate is used to sort the indices "properly".  Use of UTF-8
may not be a relevant factor.

Could we investigate further which languages it causes a problem for?
The old method might be okay for more languages than just English.

> How come format_printindex takes such a large proportion of the
> processing?  Isn't that strange?  Index entries are usually a small
> proportion of the overall manual's text, so processing the manual
> should take the lion's share.  The index in the manual you were timing
> has about 8K entries, but the entire manual is 100K lines, so the
> index is less than 10% of the total volume.  How come its processing
> is so expensive?

It's the sorting of the index entries into alphabetical order, I presume.
There isn't a similar sorting process for the rest of the manual.

[Prev in Thread]

Current Thread

[Next in Thread]

texi2any is too slow because of Unicode::Collate, Gavin Smith, 2023/02/11
- Re: texi2any is too slow because of Unicode::Collate, Patrice Dumas, 2023/02/11
  - Re: texi2any is too slow because of Unicode::Collate, Gavin Smith, 2023/02/11
    - Re: texi2any is too slow because of Unicode::Collate, Eli Zaretskii, 2023/02/11
    - Re: texi2any is too slow because of Unicode::Collate, Gavin Smith <=
    - Re: texi2any is too slow because of Unicode::Collate, pertusus, 2023/02/11
    - Re: texi2any is too slow because of Unicode::Collate, Eli Zaretskii, 2023/02/12
    - Re: texi2any is too slow because of Unicode::Collate, Gavin Smith, 2023/02/12
    - Re: texi2any is too slow because of Unicode::Collate, pertusus, 2023/02/12
    - Re: texi2any is too slow because of Unicode::Collate, Gavin Smith, 2023/02/12
    - texi2any 7.0 performance regression (non-XS), Gavin Smith, 2023/02/12
    - Re: texi2any 7.0 performance regression (non-XS), pertusus, 2023/02/12
    - Re: texi2any 7.0 performance regression (non-XS), Gavin Smith, 2023/02/12
    - Re: texi2any 7.0 performance regression (non-XS), pertusus, 2023/02/12
    - Obsolete XS overrides are gone, Gavin Smith, 2023/02/27

Prev by Date: Re: texi2any is too slow because of Unicode::Collate
Next by Date: Re: texi2any is too slow because of Unicode::Collate
Previous by thread: Re: texi2any is too slow because of Unicode::Collate
Next by thread: Re: texi2any is too slow because of Unicode::Collate
Index(es):
- Date
- Thread