octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How should we treat invalid UTF-8?


From: Markus Mützel
Subject: Re: How should we treat invalid UTF-8?
Date: Mon, 4 Nov 2019 23:12:11 +0100

Am 04. November 2019 um 21:48 Uhr schrieb "Andrew Janke":
> Hi all,
>
> I'm coming around to the idea that Octave should be conservative and
> strict about encodings at I/O and library boundaries, and lean toward
> erroring out or using replacement characters, and not doing any
> mixed-encoding fallback mechanisms. At least for our basic stuff like
> fopen/fread/csvread. I think it would support higher-quality code, and
> it would be easier for users to understand and diagnose, given a little
> explanation.
>
> I don't think we can fully protect users from having to know about
> character encodings, and having to know what encoding their input data
> is in. And trying to get fancy there could make it harder to do the
> "right" thing when program correctness is important.

I agree.

> > There are different approaches for how to handle invalid byte
> sequences in UTF-8 [...]
>
> One note: I don't think this is strictly about invalid byte sequences in
> UTF-8, but rather invalid byte sequences in text data in any encoding.

We should primarily focus on UTF-8.

> My inclination is to handle invalid encoded byte sequences by:
>   1. When doing file input or output, raise an error immediately

I kind of like that approach. However, would that mean that a user would need 
to clean up any encoding errors with other tools before they would be able to 
read such files?

>     a) That probably (maybe?) goes for encoding-aware text-oriented
> network I/O, like urlread(), too.
What could a user do about encoding errors in sources that are beyond their 
influence?

>   2. When doing transcoding explicitly requested by the user (like a
> unicode2native() call), raise an error unless the user explicitly
> requested a character-replacement or fallback scheme. (This would be a
> change from current behavior.)

"unicode2native()" currently fails on invalid UTF-8. Imho, it would probably be 
better to have a separate function that provides a (configurable?) fallback 
conversion.

>   2. When passing text to a UI presentation element that Octave controls
> (like a GUI widget, a plot element, or terminal output), use the
> "invalid character" replacement character
> Where validation probably happens whenever you're crossing an encoding
> boundary or library/system-call boundary.

That might be hard (especially thinking of the command window on Windows). But 
it might be achievable.

> Doing "smart" fallback is a convenience for users who are using Octave
> interactively and looking at their data as its processed, so they can
> recognize garbage (if the data set is small enough). But for automated
> processes or stuff with long processing pipelines, it could end up
> silently passing incorrect data through, which isn't good. And I think
> it would be nice if Octave would support those scenarios. Raising an
> error at the point of the conversion failure makes sure that the
> user/maintainer notices the problem, and makes it easy to locate (and
> with a decent error message, hopefully easy to Google to figure out what
> went wrong).

I agree that a smart fallback mechanism (maybe even including some heuristics) 
is probably *not* what we want. But maybe we could use a more "straight 
forward" fallback mechanism. (If that exists.)

> > Matlab doesn't have the same problem (for western users) because they
> don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters
> encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally
> in UCS-2).
> >
> > Another "solution" would be to review our initial decision to use
> UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our
> "char" class. But that would probably involve some major changes and a
> lot of conversions on interfaces to libraries we use.
>
> I don't think that's why Matlab has it "easy" here. I think it's because
> a) all their text I/O is encoding-aware, and b) on Windows, they use the
> system default legacy code page as the default encoding, which gives you
> ISO-8859-1 in the West. The fact that Matlab's internal encoding is
> UCS-2 and that's an easy transformation from ISO-8859-1 is just an
> internal implementation detail.

I was more thinking of "Matlab compatibility" bug reports to come. Like: "Why 
does my code using char(181) work in Matlab but fail in Octave?"

> Matlab does have the opposite problem: if your input data is actually
> UTF-8 (which I think is the more common case these days) or if you want
> your code to be portable across OSes or regions, you need to explicitly
> specify UTF-8 or some other known encoding whenever your code does an
> fopen(). If you have UTF-8 data and do a plain fopen(), it'll silently
> garble your data.

Is that on all platforms? Or only on Windows?

> If we changed Octave char to be 16-bit UTF-16 code points, we'd still
> have the same problem of deciding what to use for a default encoding,
> and what to do when the input didn't match that encoding.

I agree. Those are two different pairs of shoes. One is: What should be the 
size of one char in Octave? The other is: What should be the default encoding 
for reading (and writing) 8bit sources?
But any fallback mechanism (if we wanted to have one) would depend on the 
answer to the former question.

Markus




reply via email to

[Prev in Thread] Current Thread [Next in Thread]