Re: How should we treat invalid UTF-8?

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How should we treat invalid UTF-8?

From:	Markus Mützel
Subject:	Re: How should we treat invalid UTF-8?
Date:	Wed, 6 Nov 2019 10:57:59 +0100

Am 05. November 2019 um 05:46 Uhr schrieb "Andrew Janke":
> On 11/4/19 6:29 PM, "Markus Mützel" wrote:
> > Am 05. November 2019 um 00:00 Uhr schrieb "Andrew Janke":
> >> On 11/4/19 5:12 PM, "Markus Mützel" wrote:
> 
> Okay. I feel good about where this conversation is with respect to file
> I/O and encodings.
> 
> >> char(181) in Matlab gives you the micro sign, which is 0x00B5 as a
> >> 1-long UCS-2 Matlab string. We could have char(181) in Octave also give
> >> you the micro sign, which would be 0xC2 0xB5 as a 2-long UTF-8 Octave
> >> char string.
> > 
> > I prefer that approach of extending the logic to all Unicode code points to 
> > my initial idea of only doing that for the first 256 (which seems odd now 
> > thinking about it).
> > But still: Do we really want this? That would lead to the same round trip 
> > oddities.
> 
> We could totally round-trip it!
> 
> Define double(char) and char(double) to work along row vectors instead
> of on individual elements. (You have to define char(double) this way for
> the conversion I suggested above to make sense in a UTF-8 world anyway.)
> 
> Define double(char) as "takes a row vector of chars that contain UTF-8
> (or whatever Octave's internal encoding is) and returns a row vector of
> doubles that contain the sequence of Unicode code point values encoded
> by those chars". That's the inverse of the char(double) I describe
> above, and it should round-trip just fine.
> 
> Then you could say:
> 
> x = [ 121 117 109 hex2dec('1F34C') ];  % "yum🍌"
> str = char (x);  % Get back 7-long char with values 0x79 0x75 0x6D 0xF0
> 0x9F 0x8D 0x8C
> x2 = double (str); % Get back [121 117 109 127820]
> isequal (x, x2);  % Returns true

You are right. I was still thinking that we wanted to implement this as a 
fallback mechanism.
But if we always interpret double input to char() as "Unicode code points" 
(resembling UTF-32), round trips would be save.

Do we want that char on double (and vice versa) does more than a simple 
cast-like operation? If we can answer this question with "Yes", I think we 
could be close to a possible solution.

What about single and the integer classes as input to char()? It would probably 
be reasonable to do the same for them.

> Now, this wouldn't work for 2-D or higher-dimension arrays that contain
> non-ASCII (>127) code points. But I think that's a small, acceptable
> loss: IMHO, 2-D char arrays are terrible, and you should pretty much
> never use them.
> 
> (For 2-D and higher arrays: still define it as operating
> row-vector-wise, and if all the row vector operations result in outputs
> that have compatible dimensions and can cat() cleanly, cat and return
> that; else error with a "dimension mismatch" error.)
> 
> This even gives you a win over Matlab: UCS-2 can't represent U+1F34C, so
> Matlab squashes the banana into the 0xFFFF replacement character, and x2
> does not equal x.
> 
> > Still, I am more focused on the char(double) issue. The round trip oddities 
> > would disappear (or become much less prominent) if we used a wider char 
> > representation.
> 
> True. Having 16-bit chars would mean you could mostly define these
> operations *elementwise* instead of vector-wise, and then get nice
> round-trip results and support higher-dimensional arrays as long as you
> stay in the Basic Multilingual Plane. And indexing becomes more
> intuitive, especially for less-experienced users. (And maybe better
> MAT-file compatibility?)
> 
> Personally, I like being able to support non-BMP characters, because I
> like working with emoji, and there's some mathematical symbols there
> that may be of interest to Octave users creating plots or documents. So
> unless you're doing sentiment analysis on Twitter or Slack streams or
> something, the vast majority of your text is going to be all BMP. But as
> long as your 16-bit char type passes through surrogate pairs unmolested,
> you can still use non-BMP characters; it's just a bit less convenient to
> construct them in code. And all that could easily be wrapped up in
> user-defined helper functions.
> 
> My personal desire has always been to see Octave switch to a 16-bit
> UCS-2 or UTF-16 char type, because maximal Matlab compatibility is my
> highest hope. (I'm coming from an enterprise background where we'd like
> to maybe some day use Octave to replace Matlab for some workloads, with
> minimal porting effort for our existing M-code. And I think there are
> maybe other people in that situation.) But maybe that's not what's best
> for the Octave community as a whole, or feasible for the Octave developers.
> 
> Maybe the best way to approach this is to discuss what use cases or
> coding techniques a wider char type enables. From what I can see, the
> big thing you get with UCS-2 or UTF-32 (and UTF-16 if you're sloppy
> about it) is random access for characters: for a char array str and
> integer i, str(i) is a single character, which is also a scalar char.
> That's very useful if you want to do character-wise manipulation of
> strings. Do people want to do that in practice? I can think of lots of
> toy examples like "reverse a string" or "replace a couple characters in
> a string with something else". But these operations mostly show up in
> tutorials and coding interviews. Is this something people actually want
> to do in practice?

We have the Octave-specific unicode_idx() function that might help in these 
situations:
str = "aäbc";
str(unicode_idx (str)==2) % is the second character
But I agree that it adds complexity to use that function instead of simply 
indexing into the string.

We could also add more functions that could better support more use cases.

> The one use case I can think of that I've actually done in recent years
> on the job is parsing of fixed-width-field format records. Like you have
> a weather station identifier in the format "TSSZZZZZ-nnnn" where "T" is
> a one-letter code for the station type, "SS" is the 2-character state
> abbreviation, "ZZZZZ" is the zip code, "nnnn" is the ID number, and so
> on. With 16-bit chars, you can do this conveniently with direct
> character indexing, and can vectorize the operation using a 2-D char
> array. With 8-bit UTF-8 chars, you can still do that, but you have to do
> an intermediate step where you call a function that maps character
> indexes to (start_index, end_index) pairs that index into the byte
> offsets of the char array.

Also this use case doesn't work in Octave (but does in Matlab with the wider 
chars). But it's probably bad coding style anyway:
a = "a";
a(end+1) = "ä";

Markus

[Prev in Thread]

Current Thread

[Next in Thread]

How should we treat invalid UTF-8?, Markus Mützel, 2019/11/02
- Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
  - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/04
    - Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
    - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/04
    - Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
    - Re: How should we treat invalid UTF-8?, John W. Eaton, 2019/11/05
    - Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/06
    - Re: How should we treat invalid UTF-8?, Markus Mützel <=
    - Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/06

Prev by Date: Re: How should we treat invalid UTF-8?
Next by Date: changes to qt graphics initialization
Previous by thread: Re: How should we treat invalid UTF-8?
Next by thread: Re: How should we treat invalid UTF-8?
Index(es):
- Date
- Thread