bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug in join: case comparisons don't work in multibyte locales


From: Bruno Haible
Subject: Re: bug in join: case comparisons don't work in multibyte locales
Date: Thu, 12 Mar 2009 12:39:16 +0100
User-agent: KMail/1.9.9

Pádraig Brady wrote:
> Note as well as folding case I think it might
> be useful to fold other forms like:
>   Enclosed:  \u24b6 -> A
>   Stylistic: \uff21-> A

These two transformations are already executed when you use ulc_casecmp
with the UNINORM_NFKD argument.

>   Diacritics:  À -> A

Very good point. The case-insensitive comparisons are used in contexts
where different people enter the same word / name / term. But in these
context, additional transformations need to be done, depending on
culture. I think Google's front end to the search engine does these
transformations. They are:
  - for French, to remove accents and diacritics,
  - for German, to transform umlauts (ü -> ue),
  - for Danish, probably to transform å -> aa,
  - and certainly much more for other languages (what is it for Chinese)?

> I.E. have more general function like:
> ulc_coll(fold={Case|Diactritics|Stylistic}, ...);

_coll or _cmp ? _coll is used when people want to put lists of names in
order. The use case where diacritics are ignored is to do lookups, not for
sorting.

Also, as mentioned above, I think which parts should be folded is locale
dependent. For French, it is ok to ignore diacritics when doing caseless
matching; for German, it is not.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]