Re: How should we treat invalid UTF-8?

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How should we treat invalid UTF-8?

From:	Andrew Janke
Subject:	Re: How should we treat invalid UTF-8?
Date:	Mon, 4 Nov 2019 18:00:23 -0500
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

On 11/4/19 5:12 PM, "Markus Mützel" wrote:
> Am 04. November 2019 um 21:48 Uhr schrieb "Andrew Janke":
> 
>>> There are different approaches for how to handle invalid byte
>> sequences in UTF-8 [...]
>>
>> One note: I don't think this is strictly about invalid byte sequences in
>> UTF-8, but rather invalid byte sequences in text data in any encoding.
> 
> We should primarily focus on UTF-8.

I don't think I agree: we're designing how Octave handles strings and
encodings in general, and we live in an international, multi-encoding
world. We should come up with a system that works for multiple encodings.

>> My inclination is to handle invalid encoded byte sequences by:
>>   1. When doing file input or output, raise an error immediately
> 
> I kind of like that approach. However, would that mean that a user would need 
> to clean up any encoding errors with other tools before they would be able to 
> read such files?

Yes and no. Yes, ou would need to fix the read error somehow. In my
experience, this usually means that you're just using the wrong
encoding, and you just need to specify the right encoding instead of
modifying the input data. If there are actually are encoding errors,
then your data is corrupt, and you should fix it up before slurping it
into Octave chars. You could do this with external tools. Or if you
wanted to do it in Octave, you could just read the file in binary mode
and work with the raw encoded bytes, and then transcode it once cleaned up.

>>     a) That probably (maybe?) goes for encoding-aware text-oriented
>> network I/O, like urlread(), too.
> What could a user do about encoding errors in sources that are beyond their 
> influence?

What can a user do about any data corruption in a data source beyond
their influence? Talk to the source to get it fixed, or write a tool to
correct it yourself. You could do this in Octave using byte-oriented
binary I/O. Or we could provide a "read with fallback-to-replacement
character" function or mode as a convenience; I just don't think that
should be the default, because we shouldn't silently lose data unless asked.

You can always fall back and look at the raw bytes using byte-oriented
I/O and munge them there. I just think we should be conservative about
what actually makes it into chars using the text-oriented I/O.

>>   2. When doing transcoding explicitly requested by the user (like a
>> unicode2native() call), raise an error unless the user explicitly
>> requested a character-replacement or fallback scheme. (This would be a
>> change from current behavior.)
> 
> "unicode2native()" currently fails on invalid UTF-8. 

Ah! Okay, I just misread the helptext for it. I think that is a good
behavior.

> Imho, it would probably be better to have a separate function that provides a 
> (configurable?) fallback conversion.

I agree.

>>   2. When passing text to a UI presentation element that Octave controls
>> (like a GUI widget, a plot element, or terminal output), use the
>> "invalid character" replacement character
>> Where validation probably happens whenever you're crossing an encoding
>> boundary or library/system-call boundary.
> 
> That might be hard (especially thinking of the command window on Windows). 
> But it might be achievable.

Afraid I don't really know enough about GUI programming to be much help
here. But for Qt and Windows GUI widgets, they're not using C char *s
for their strings; they're using QString or Windows wchar_t/TCHAR/PWSTR
values, right? Which aren't UTF-8, so there's already gotta be a
translation point between Octave strings and the GUI toolkit's strings,
I'd think? That's where you'd slap your validation + fallback-char
replacement calls.

>> Doing "smart" fallback is a convenience for users who are using Octave
>> interactively and looking at their data as its processed, so they can
>> recognize garbage (if the data set is small enough). But for automated
>> processes or stuff with long processing pipelines, it could end up
>> silently passing incorrect data through, which isn't good. And I think
>> it would be nice if Octave would support those scenarios. Raising an
>> error at the point of the conversion failure makes sure that the
>> user/maintainer notices the problem, and makes it easy to locate (and
>> with a decent error message, hopefully easy to Google to figure out what
>> went wrong).
> 
> I agree that a smart fallback mechanism (maybe even including some 
> heuristics) is probably *not* what we want. But maybe we could use a more 
> "straight forward" fallback mechanism. (If that exists.)
> 
>>> Matlab doesn't have the same problem (for western users) because they
>> don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters
>> encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally
>> in UCS-2).
>>>
>>> Another "solution" would be to review our initial decision to use
>> UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our
>> "char" class. But that would probably involve some major changes and a
>> lot of conversions on interfaces to libraries we use.
>>
>> I don't think that's why Matlab has it "easy" here. I think it's because
>> a) all their text I/O is encoding-aware, and b) on Windows, they use the
>> system default legacy code page as the default encoding, which gives you
>> ISO-8859-1 in the West. The fact that Matlab's internal encoding is
>> UCS-2 and that's an easy transformation from ISO-8859-1 is just an
>> internal implementation detail.
> 
> I was more thinking of "Matlab compatibility" bug reports to come. Like: "Why 
> does my code using char(181) work in Matlab but fail in Octave?"

I think that's kind of a different issue than I/O encoding. And
achieving a degree of Matlab compatibility would be feasible.

On the one hand, mathematically speaking, Matlab's char(double) takes
the input doubles and narrows them to 16-bit and casts them to char,
squashing non-UCS-2 values to placeholders. You can also view Matlab's
char(double) as working like this: it treats the input doubles as
numeric Unicode code point values (not ISO-8859-1 or any other
encoding), and it converts those into the Matlab-native char (UCS-2)
values that represent those code, squashing out-of-range values to the
0xFFFF placeholder replacement character. We could get somewhat
equivalent behavior by having Octave's char(double) also treat its
inputs as Unicode code points, and have it return the Octave-native char
(UTF-8) vector that represents that sequence of code points.

char(181) in Matlab gives you the micro sign, which is 0x00B5 as a
1-long UCS-2 Matlab string. We could have char(181) in Octave also give
you the micro sign, which would be 0xC2 0xB5 as a 2-long UTF-8 Octave
char string.

>> Matlab does have the opposite problem: if your input data is actually
>> UTF-8 (which I think is the more common case these days) or if you want
>> your code to be portable across OSes or regions, you need to explicitly
>> specify UTF-8 or some other known encoding whenever your code does an
>> fopen(). If you have UTF-8 data and do a plain fopen(), it'll silently
>> garble your data.
> 
> Is that on all platforms? Or only on Windows?

It is not on all platforms. Just Windows. On Linux, it uses your locale,
which will give you UTF-8 as the default encoding on most modern
systems. And on macOS, for some  reason, it seems to default to
ISO-8859-1, even though that's not the system default encoding in any
sense that I'm aware of. So if you want cross-OS portable Matlab code,
you must always specify an encoding when calling fopen(), or write
OS-specific logic.

>> If we changed Octave char to be 16-bit UTF-16 code points, we'd still
>> have the same problem of deciding what to use for a default encoding,
>> and what to do when the input didn't match that encoding.
> 
> I agree. Those are two different pairs of shoes. One is: What should be the 
> size of one char in Octave? The other is: What should be the default encoding 
> for reading (and writing) 8bit sources?

Yep. And the first one is up for grabs. That depends on: What's a good
encoding in general? How much do we care about Matlab compatibility? Do
users need random access to characters within a string? Etc etc.

> But any fallback mechanism (if we wanted to have one) would depend on the 
> answer to the former question.

I don't think it does, really. Transcoding to your internal string
representation is a three-step process:

1. Parse the input byte sequence in the input encoding to get a sequence
of code points (characters) in the input character set.
2. Map those input character set code points to code points in your
internal character set.
3. Encode those internal character set code points into your internal
"char"/string type's encoding.

I think the fallback mechanism happens entirely in steps 1 and 2, as
long as your internal string representation uses a character set that
can represent whatever replacement characters you want. (And if you're
using Unicode, that's always true.) Whether Octave uses UTF-8, UTF-16,
or some other Unicode encoding only affects step 3.

Cheers,
Andrew

[Prev in Thread]

Current Thread

[Next in Thread]

How should we treat invalid UTF-8?, Markus Mützel, 2019/11/02
- Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
  - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/04
    - Re: How should we treat invalid UTF-8?, Andrew Janke <=
    - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/04
    - Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
    - Re: How should we treat invalid UTF-8?, John W. Eaton, 2019/11/05
    - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/06
    - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/06
    - Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/06

Prev by Date: Re: How should we treat invalid UTF-8?
Next by Date: Re: How should we treat invalid UTF-8?
Previous by thread: Re: How should we treat invalid UTF-8?
Next by thread: Re: How should we treat invalid UTF-8?
Index(es):
- Date
- Thread