Re: How should we treat invalid UTF-8?

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How should we treat invalid UTF-8?

From:	Andrew Janke
Subject:	Re: How should we treat invalid UTF-8?
Date:	Mon, 4 Nov 2019 23:46:02 -0500
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

On 11/4/19 6:29 PM, "Markus Mützel" wrote:
> Am 05. November 2019 um 00:00 Uhr schrieb "Andrew Janke":
>> On 11/4/19 5:12 PM, "Markus Mützel" wrote:

Okay. I feel good about where this conversation is with respect to file
I/O and encodings.

>> Afraid I don't really know enough about GUI programming to be much help
>> here. But for Qt and Windows GUI widgets, they're not using C char *s
>> for their strings; they're using QString or Windows wchar_t/TCHAR/PWSTR
>> values, right? Which aren't UTF-8, so there's already gotta be a
>> translation point between Octave strings and the GUI toolkit's strings,
>> I'd think? That's where you'd slap your validation + fallback-char
>> replacement calls.
> 
> Unfortunately, the command window is not a Qt widget. Especially the Windows 
> implementation is a beast because the Windows prompt that we use has so many 
> limitations. Especially when it comes to variable byte encodings. (See e.g. 
> the bug about output stopping completely after an invalid byte.)

I'm out of my depth here. Guess I need to add "learn the Windows
terminal widget" to my long TODO list.

>> char(181) in Matlab gives you the micro sign, which is 0x00B5 as a
>> 1-long UCS-2 Matlab string. We could have char(181) in Octave also give
>> you the micro sign, which would be 0xC2 0xB5 as a 2-long UTF-8 Octave
>> char string.
> 
> I prefer that approach of extending the logic to all Unicode code points to 
> my initial idea of only doing that for the first 256 (which seems odd now 
> thinking about it).
> But still: Do we really want this? That would lead to the same round trip 
> oddities.

We could totally round-trip it!

Define double(char) and char(double) to work along row vectors instead
of on individual elements. (You have to define char(double) this way for
the conversion I suggested above to make sense in a UTF-8 world anyway.)

Define double(char) as "takes a row vector of chars that contain UTF-8
(or whatever Octave's internal encoding is) and returns a row vector of
doubles that contain the sequence of Unicode code point values encoded
by those chars". That's the inverse of the char(double) I describe
above, and it should round-trip just fine.

Then you could say:

x = [ 121 117 109 hex2dec('1F34C') ];  % "yum🍌"
str = char (x);  % Get back 7-long char with values 0x79 0x75 0x6D 0xF0
0x9F 0x8D 0x8C
x2 = double (str); % Get back [121 117 109 127820]
isequal (x, x2);  % Returns true

Now, this wouldn't work for 2-D or higher-dimension arrays that contain
non-ASCII (>127) code points. But I think that's a small, acceptable
loss: IMHO, 2-D char arrays are terrible, and you should pretty much
never use them.

(For 2-D and higher arrays: still define it as operating
row-vector-wise, and if all the row vector operations result in outputs
that have compatible dimensions and can cat() cleanly, cat and return
that; else error with a "dimension mismatch" error.)

This even gives you a win over Matlab: UCS-2 can't represent U+1F34C, so
Matlab squashes the banana into the 0xFFFF replacement character, and x2
does not equal x.

> Still, I am more focussed on the char(double) issue. The round trip oddities 
> would disappear (or become much less prominent) if we used a wider char 
> representation.

True. Having 16-bit chars would mean you could mostly define these
operations *elementwise* instead of vector-wise, and then get nice
round-trip results and support higher-dimensional arrays as long as you
stay in the Basic Multilingual Plane. And indexing becomes more
intuitive, especially for less-experienced users. (And maybe better
MAT-file compatibility?)

Personally, I like being able to support non-BMP characters, because I
like working with emoji, and there's some mathematical symbols there
that may be of interest to Octave users creating plots or documents. So
unless you're doing sentiment analysis on Twitter or Slack streams or
something, the vast majority of your text is going to be all BMP. But as
long as your 16-bit char type passes through surrogate pairs unmolested,
you can still use non-BMP characters; it's just a bit less convenient to
construct them in code. And all that could easily be wrapped up in
user-defined helper functions.

My personal desire has always been to see Octave switch to a 16-bit
UCS-2 or UTF-16 char type, because maximal Matlab compatibility is my
highest hope. (I'm coming from an enterprise background where we'd like
to maybe some day use Octave to replace Matlab for some workloads, with
minimal porting effort for our existing M-code. And I think there are
maybe other people in that situation.) But maybe that's not what's best
for the Octave community as a whole, or feasible for the Octave developers.

Maybe the best way to approach this is to discuss what use cases or
coding techniques a wider char type enables. From what I can see, the
big thing you get with UCS-2 or UTF-32 (and UTF-16 if you're sloppy
about it) is random access for characters: for a char array str and
integer i, str(i) is a single character, which is also a scalar char.
That's very useful if you want to do character-wise manipulation of
strings. Do people want to do that in practice? I can think of lots of
toy examples like "reverse a string" or "replace a couple characters in
a string with something else". But these operations mostly show up in
tutorials and coding interviews. Is this something people actually want
to do in practice?

The one use case I can think of that I've actually done in recent years
on the job is parsing of fixed-width-field format records. Like you have
a weather station identifier in the format "TSSZZZZZ-nnnn" where "T" is
a one-letter code for the station type, "SS" is the 2-character state
abbreviation, "ZZZZZ" is the zip code, "nnnn" is the ID number, and so
on. With 16-bit chars, you can do this conveniently with direct
character indexing, and can vectorize the operation using a 2-D char
array. With 8-bit UTF-8 chars, you can still do that, but you have to do
an intermediate step where you call a function that maps character
indexes to (start_index, end_index) pairs that index into the byte
offsets of the char array.

Cheers,
Andrew

[Prev in Thread]

Current Thread

[Next in Thread]

How should we treat invalid UTF-8?, Markus Mützel, 2019/11/02
- Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
  - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/04
    - Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
    - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/04
    - Re: How should we treat invalid UTF-8?, Andrew Janke <=
    - Re: How should we treat invalid UTF-8?, John W. Eaton, 2019/11/05
    - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/06
    - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/06
    - Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/06

Prev by Date: Re: How should we treat invalid UTF-8?
Next by Date: Re: How should we treat invalid UTF-8?
Previous by thread: Re: How should we treat invalid UTF-8?
Next by thread: Re: How should we treat invalid UTF-8?
Index(es):
- Date
- Thread