bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: texi2any is too slow because of Unicode::Collate


From: pertusus
Subject: Re: texi2any is too slow because of Unicode::Collate
Date: Sat, 11 Feb 2023 21:40:58 +0100

On Sat, Feb 11, 2023 at 08:30:07PM +0000, Gavin Smith wrote:
> > 
> > How many manuals set documentlanguage?  With the proliferation of
> > documentencoding set to UTF-8, I think disabling the collation for
> > "en" will be next to futile.
> 
> If I understand correctly, until recently more standard Perl facilities
> were used for sorting the indices, but this produced worse results for
> non-English text, such as that containing many accented characters.

Indeed, the default sort simply sort according to the order as Unicode
point, I believe.

> Unicode::Collate is used to sort the indices "properly".  Use of UTF-8
> may not be a relevant factor.

No, it is not, we always convert to the perl internal encoding
irrespective of the manual encoding.

> Could we investigate further which languages it causes a problem for?
> The old method might be okay for more languages than just English.

For french, and I belive all the languages with accented letters that
should sort next to the non accented letter, for instance e and é, the
sort is much better with Unicode::Collate.

> > How come format_printindex takes such a large proportion of the
> > processing?  Isn't that strange?  Index entries are usually a small
> > proportion of the overall manual's text, so processing the manual
> > should take the lion's share.  The index in the manual you were timing
> > has about 8K entries, but the entire manual is 100K lines, so the
> > index is less than 10% of the total volume.  How come its processing
> > is so expensive?
> 
> It's the sorting of the index entries into alphabetical order, I presume.
> There isn't a similar sorting process for the rest of the manual.

Exactly.  Given the size of the index, it may be the most extreme
slowdown, if it is more than linear in the size of the index.  As to why
Unicode::Collate is slow, I do not think it is easy to know.  It could
depend on the Unicode::Collate too.

-- 
Pat



reply via email to

[Prev in Thread] Current Thread [Next in Thread]