octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How should we treat invalid UTF-8?


From: Markus Mützel
Subject: How should we treat invalid UTF-8?
Date: Sat, 2 Nov 2019 13:24:22 +0100

Hi,

Some time ago, we decided to use UTF-8 as the default encoding in Octave.
In particular, a change to allow (and require!) UTF-8 in regular expressions 
[1] triggered a few bug reports and questions on the mailing lists that 
involved invalid UTF-8 (e.g. [2]).
Background: Some characters in UTF-8 are encoded with multiple bytes (e.g. the 
German umlaut "ä" is encoded as decimal [195 164]). As a consequence of how 
Unicode codepoints are encoded in UTF-8, there are some byte sequences that 
cannot be correctly decoded to a Unicode codepoint (e.g. a byte with the 
decimal value 228 on its own). Such byte sequences are called "invalid".
At the moment, we don't have any logic for handling those invalid byte 
sequences specially. This can lead to a whole lot of different errors and is 
not limited to the regexp family of functions. E.g. entering "char (228)" at 
the Octave prompt leads to a replacement character ("�") being displayed at the 
command window on Linux (at least for me on Ubuntu 19.04), but it completely 
breaks the command window on Windows (e.g. [3]).
Similarly, there are issues when using invalid UTF-8 for strings in plots.

There are different approaches for how to handle invalid byte sequences in 
UTF-8 (that are suggested by the standard). I can't find a direct reference 
right now. But here is what Wikipedia says about it: [4].
They can be mainly be assigned into these 3 groups:
1. Throw an error.
2. Replace each invalid byte with the same or different replacement characters.
3. Fall back to a different encoding for such bytes (e.g. ISO-8859-1 or CP1252).

Judging from some error reports, (western) users seem to expect that they get a 
micro sign on entering "char(181)" (and similarly for other printable 
characters at codepoints 128-255). If we implemented falling back to 
"ISO-8859-1" or "CP1252", we would follow that principle of least surprise in 
that respect.

However, it is not clear to me at which level we would implement that fallback 
conversion: For some users, it might feel "most natural" to see a "µ" 
everywhere when they use "char(181)" in their code. Others might be surprised 
if the conversion from one type (double) to another type (char) and back leads 
to a different result (different number of elements even!).
If we don't do the validation on creation of the char vector, there are 
probably a lot of places where strings should be validated before we use them.

A similar question arises when reading strings from a file (fopen, fread, 
fgets, fgetl, textscan, ...): Should we return the bytes as stored in the file? 
Or should we better assure that the strings are valid?

Matlab doesn't have the same problem (for western users) because they don't use 
UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters encoded in 
ISO-8859-1 have the same numeric value in UTF-16 (and equally in UCS-2).

I am slightly leaning towards implementing some sort of fallback mechanism (see 
e.g. bug #57107 [2] comment #17). But I'm open to any ideas of how to implement 
that exactly.

Another "solution" would be to review our initial decision to use UTF-8. 
Instead, we could follow Matlab and use a "uint16_t" for our "char" class. But 
that would probably involve some major changes and a lot of conversions on 
interfaces to libraries we use.

Markus

[1]: http://hg.savannah.gnu.org/hgweb/octave/rev/94d490815aa8
[2]: https://savannah.gnu.org/bugs/index.php?57107
[3]: https://savannah.gnu.org/bugs/index.php?57133
[4]: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences



reply via email to

[Prev in Thread] Current Thread [Next in Thread]