Re: iconv replacements

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: iconv replacements

From:	Bruno Haible
Subject:	Re: iconv replacements
Date:	Thu, 30 Jul 2020 11:39:43 +0200
User-agent:	KMail/5.1.3 (Linux/4.4.0-186-generic; KDE/5.18.0; x86_64; ; )

[Dropping bug-bison from CC]

> > Yes and no. The code is not making assumptions about a particular iconv()
> > implementation. But it needs to distinguish two categories of replacements
> > done by iconv():
> >   - those that are harmless (for example when replacing a Unicode TAG
> >     character U+E00xx with an empty output),
> >   - those that are better not presented to the user, if the programmer has
> >     specified a fallback (for example, replacing all non-ASCII characters
> >     with NUL, '?', or '*').
> >
> > The standards don't help in making the distinction.
> >
> > Therefore whether you consider said glibc and libiconv behaviour as
> > "non-conforming" or not is irrelevant.
> 
> Could you sketch briefly what you need?  We have identified some issues
> with the existing iconv interface.  If we add an enhancement, it would
> make sense to cover these requirements.

POSIX [1] says:

  "If iconv() encounters a character in the input buffer that is valid, but for
   which an identical character does not exist in the target codeset, iconv()
   shall perform an implementation-defined conversion on this character."

  "The iconv() function shall ... return the number of non-identical 
conversions performed."

This is sufficient for detecting that iconv() did something that the
application might or might not like.

For decent application behaviour in UTF-8, legacy 8-bit, and ASCII locales
I wrote a module 'unicodeio' that accepts an ASCII fallback given by the
programmer. For example, for the string "François Pinard" a fallback
"Francois Pinard" can be given, and for the string "•" a fallback "." can
be given.

In this code, it needs to analyze what iconv() actually did and distinguish
replacements that are OK (no need to activate the ASCII fallback) and those
that are worse than the ASCII fallback. For example:
  - Replacing 'ç' with '?' (NetBSD, Solaris 11) or '*' (musl) or NUL (IRIX)
    is worse than the ASCII fallback.
  - Replacing a Unicode tag character with an empty string is OK.
  - Replacing GREEK SMALL LETTER MU with MICRO SIGN is OK.
  - Replacing FULLWIDTH COLON with ':' is OK (most likely equivalent to the
    ASCII fallback).

That's my requirement from the application side. I don't know whether an
iconv() implementation can help here, given the limited interface of iconv.

Maybe there could be an alternative to //TRANSLIT in the iconv_open()
argument, that would specify e.g. that tag characters and <compat> and <wide>
replacements in UnicodeData.txt are OK but other replacements are not OK?
Where either
  - OK means a conversion that does not increment the return value,
  - "not OK" means a conversion that increments the return value,
or
  - OK means a conversion that increments the return value,
  - "not OK" means an error return (-1 / EILSEQ).

Bruno

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio), Bruno Haible, 2020/07/29
- Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio), Florian Weimer, 2020/07/30
  - Re: iconv replacements, Bruno Haible <=

Prev by Date: Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio)
Next by Date: [PATCH] Work around some Oracle Studio attribute bugs
Previous by thread: Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio)
Next by thread: [PATCH] Work around some Oracle Studio attribute bugs
Index(es):
- Date
- Thread