Re: Unicode

bug-gettext

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode

From:	Bruno Haible
Subject:	Re: Unicode
Date:	Tue, 28 Nov 2023 08:10:14 +0100

Robert Clausecker wrote:
> > > Note that this depends on country.  E.g. CJK users are not entirely happy 
> > > with
> > > Unicode as it butchers Han characters in various unpleasant ways.
> > 
> > Your information is outdated.
> > 1. It was only the Japanese community which was upset about Han unification.
> >    (Chinese and Korean people were happy with it.)
> > 2. Their issues were addressed in Unicode, through the addition of variation
> >    selectors (and probably also specialized fonts and/or tailoring in the
> >    rendering engines). The Japanese complaints have since then silenced 
> > down.
> 
> The problems persist as typefaces often implement variant selectors 
> incorrectly
> and/or feel obliged to select incorrect variants by default as to preserve a
> distinction between default and compatibility encodings.
> 
> What seems to be happening is rather that CJK communities have largely given 
> up
> on the fight and accepted that CJK will continue to suck in the future.

Do you have web references for this, from the last five years? I'd like to know.
In particular, I'd like to know if such complaints are specific to Linux or
apply to Microsoft and Apple operating systems as well.

The only complaint I've seen recently is that some Emojis look different on
iOS or on Android; nothing regarding Japanese.

> I have even
> previously contributed high-performance Unicode transcoding functions to
> simdutf that are now used by mainstream software.

simdutf is a marvellous gem, indeed.

> However, I am of the strong
> opinion that the option to use different character sets and encodings must be
> preserved indefinitely.  Moving to Unicode-only is a terrible move.

The software industry has decided differently, between 1995 and ca. 2013.
1995 was when Microsoft and Java came up with UTF-16 as internal representation
of strings in Win32 and in Java. Qt soon followed suit. JSON and YAML are
using UTF-8. XML 1.1 has only one ASCII-compatible encoding that is supported
across all platforms, namely UTF-8. And so on.

> > The fewer charset conversions need to be done, the more reliable the 
> > programs
> > become, and the more maintainable the code can become.
> 
> I agree.  This is why wide strings exist.

No, wide strings in C are not the solution. See
<https://www.gnu.org/software/gnulib/manual/html_node/The-wchar_005ft-type.html>

char32_t[] strings, assuming ISO C 23, are the solution.

> Unicode is also a complicated mess that is hard to implement correctly.
> Things as easy as "find the next character boundary" require complex tables
> and good knowledge of how it works.

Yes, and the programmer must first determine whether they want the next
character boundary, the next glyph boundary, or the next grapheme cluster
boundary...

But that's the price that needs to be paid for an encoding system that can
be used for Japanese, Vietnamese, and Hindi. *Evidently* you can't expect
Unicode to be as simple as ISO-8859-n, which was suitable only for languages
with alphabetic scripts.

> And without the multi-megabyte ICU
> package (or a suitable replacement), you are basically lost trying to do any
> non-trivial operations on Unicode strings.

There are smaller, alternative packages. GNU libunistring is one of them.
<https://www.gnu.org/software/libunistring/>

> The attack surface is huge, too,
> with multiple high-profile Unicode rendering related crashes in the last 
> years.

You mean the magic Unicode characters that crashed Apple iOS?
Yup, all code needs to be accompanied with tests (a.k.a. quality assurance).
Unicode rendering is not an exception to this rule.

> Other encodings will keep being relevant for as long as people write in
> non-ASCII scripts and need to represent these on systems that cannot / need 
> not
> deal with the full complexity of Unicode.

I disagree. In the area of general computing, 8-bit encodings are already
irrelevant. In the area of mobile communications, GSM with its 8-bit encoding
is also already on the way out, IIRC.

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

GNU gettext 0.22 broke non-Unicode msgids, Robert Clausecker, 2023/11/27
- Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/27
  - Message not available
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/27
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Robert Clausecker, 2023/11/27
    - Re: English for msgids, Bruno Haible, 2023/11/28
    - Re: English for msgids, Robert Clausecker, 2023/11/28
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/28
    - Re: Unicode, Bruno Haible <=
  - Message not available
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/28
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Robert Clausecker, 2023/11/28
    - Re: GNU gettext 0.22 broke non-Unicode msgids, Bruno Haible, 2023/11/28

Prev by Date: Re: GNU gettext 0.22 broke non-Unicode msgids
Next by Date: Re: GNU gettext 0.22 broke non-Unicode msgids
Previous by thread: Re: GNU gettext 0.22 broke non-Unicode msgids
Next by thread: Re: GNU gettext 0.22 broke non-Unicode msgids
Index(es):
- Date
- Thread