[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

strip accents and sorting [was: BibTeX issues]

From: Roland Winkler
Subject: strip accents and sorting [was: BibTeX issues]
Date: Wed, 28 Aug 2019 22:26:38 -0500

On Wed Aug 28 2019 Eli Zaretskii wrote:
> > From: Roland Winkler <address@hidden>
> > If there was a generic function strip-accents, then BibTeX mode could
> > certainly use it within its bibtex-generate-autokey machinery.
> I don't think we have such a function, but it shouldn't be hard to
> write one, using the facilities in ucs-normalize.el.

Interesting! What are the intended use cases for ucs-normalize.el
and the algorithms that it implements?

I had never much thought about this.  But there is obviously a
problem when one tries to sort a database where the keys may contain
more fancy utf characters. (This problem must be well-known in the
utf world).  Naivly one might hope that the following lines are
properly sorted according to string-lessp


But (string-lessp "ä-umlaut" "ö-combine") gives nil so that sort-lines gives


Of course, this is due to the fact that a German umlaut can be
represented with its own character or with a combining diaeresis.
These two ways of presenting an umlaut look the same, but they are
not the same for string-lessp.

This can be particularly annoying when a database (be it BibTeX,
BBDB, or whatever) is often enough populated by copying records from
different sources that may represent such fancy utf characters in
different ways.

Now, one solution would be to simply strip off the combining
characters by decomposing the characters.  Or is there a possibility
to teach a sorting algorithm that the first letter of ä-combine is
"the same" as the first letter of ä-umlaut and all this should
appear near a-plain instead of past o-plain?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]