[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How should we treat invalid UTF-8?
From: |
Markus Mützel |
Subject: |
Re: How should we treat invalid UTF-8? |
Date: |
Wed, 6 Nov 2019 10:57:59 +0100 |
Am 05. November 2019 um 05:46 Uhr schrieb "Andrew Janke":
> On 11/4/19 6:29 PM, "Markus Mützel" wrote:
> > Am 05. November 2019 um 00:00 Uhr schrieb "Andrew Janke":
> >> On 11/4/19 5:12 PM, "Markus Mützel" wrote:
>
> Okay. I feel good about where this conversation is with respect to file
> I/O and encodings.
>
> >> char(181) in Matlab gives you the micro sign, which is 0x00B5 as a
> >> 1-long UCS-2 Matlab string. We could have char(181) in Octave also give
> >> you the micro sign, which would be 0xC2 0xB5 as a 2-long UTF-8 Octave
> >> char string.
> >
> > I prefer that approach of extending the logic to all Unicode code points to
> > my initial idea of only doing that for the first 256 (which seems odd now
> > thinking about it).
> > But still: Do we really want this? That would lead to the same round trip
> > oddities.
>
> We could totally round-trip it!
>
> Define double(char) and char(double) to work along row vectors instead
> of on individual elements. (You have to define char(double) this way for
> the conversion I suggested above to make sense in a UTF-8 world anyway.)
>
> Define double(char) as "takes a row vector of chars that contain UTF-8
> (or whatever Octave's internal encoding is) and returns a row vector of
> doubles that contain the sequence of Unicode code point values encoded
> by those chars". That's the inverse of the char(double) I describe
> above, and it should round-trip just fine.
>
> Then you could say:
>
> x = [ 121 117 109 hex2dec('1F34C') ]; % "yum🍌"
> str = char (x); % Get back 7-long char with values 0x79 0x75 0x6D 0xF0
> 0x9F 0x8D 0x8C
> x2 = double (str); % Get back [121 117 109 127820]
> isequal (x, x2); % Returns true
You are right. I was still thinking that we wanted to implement this as a
fallback mechanism.
But if we always interpret double input to char() as "Unicode code points"
(resembling UTF-32), round trips would be save.
Do we want that char on double (and vice versa) does more than a simple
cast-like operation? If we can answer this question with "Yes", I think we
could be close to a possible solution.
What about single and the integer classes as input to char()? It would probably
be reasonable to do the same for them.
> Now, this wouldn't work for 2-D or higher-dimension arrays that contain
> non-ASCII (>127) code points. But I think that's a small, acceptable
> loss: IMHO, 2-D char arrays are terrible, and you should pretty much
> never use them.
>
> (For 2-D and higher arrays: still define it as operating
> row-vector-wise, and if all the row vector operations result in outputs
> that have compatible dimensions and can cat() cleanly, cat and return
> that; else error with a "dimension mismatch" error.)
>
> This even gives you a win over Matlab: UCS-2 can't represent U+1F34C, so
> Matlab squashes the banana into the 0xFFFF replacement character, and x2
> does not equal x.
>
> > Still, I am more focused on the char(double) issue. The round trip oddities
> > would disappear (or become much less prominent) if we used a wider char
> > representation.
>
> True. Having 16-bit chars would mean you could mostly define these
> operations *elementwise* instead of vector-wise, and then get nice
> round-trip results and support higher-dimensional arrays as long as you
> stay in the Basic Multilingual Plane. And indexing becomes more
> intuitive, especially for less-experienced users. (And maybe better
> MAT-file compatibility?)
>
> Personally, I like being able to support non-BMP characters, because I
> like working with emoji, and there's some mathematical symbols there
> that may be of interest to Octave users creating plots or documents. So
> unless you're doing sentiment analysis on Twitter or Slack streams or
> something, the vast majority of your text is going to be all BMP. But as
> long as your 16-bit char type passes through surrogate pairs unmolested,
> you can still use non-BMP characters; it's just a bit less convenient to
> construct them in code. And all that could easily be wrapped up in
> user-defined helper functions.
>
> My personal desire has always been to see Octave switch to a 16-bit
> UCS-2 or UTF-16 char type, because maximal Matlab compatibility is my
> highest hope. (I'm coming from an enterprise background where we'd like
> to maybe some day use Octave to replace Matlab for some workloads, with
> minimal porting effort for our existing M-code. And I think there are
> maybe other people in that situation.) But maybe that's not what's best
> for the Octave community as a whole, or feasible for the Octave developers.
>
> Maybe the best way to approach this is to discuss what use cases or
> coding techniques a wider char type enables. From what I can see, the
> big thing you get with UCS-2 or UTF-32 (and UTF-16 if you're sloppy
> about it) is random access for characters: for a char array str and
> integer i, str(i) is a single character, which is also a scalar char.
> That's very useful if you want to do character-wise manipulation of
> strings. Do people want to do that in practice? I can think of lots of
> toy examples like "reverse a string" or "replace a couple characters in
> a string with something else". But these operations mostly show up in
> tutorials and coding interviews. Is this something people actually want
> to do in practice?
We have the Octave-specific unicode_idx() function that might help in these
situations:
str = "aäbc";
str(unicode_idx (str)==2) % is the second character
But I agree that it adds complexity to use that function instead of simply
indexing into the string.
We could also add more functions that could better support more use cases.
> The one use case I can think of that I've actually done in recent years
> on the job is parsing of fixed-width-field format records. Like you have
> a weather station identifier in the format "TSSZZZZZ-nnnn" where "T" is
> a one-letter code for the station type, "SS" is the 2-character state
> abbreviation, "ZZZZZ" is the zip code, "nnnn" is the ID number, and so
> on. With 16-bit chars, you can do this conveniently with direct
> character indexing, and can vectorize the operation using a 2-D char
> array. With 8-bit UTF-8 chars, you can still do that, but you have to do
> an intermediate step where you call a function that maps character
> indexes to (start_index, end_index) pairs that index into the byte
> offsets of the char array.
Also this use case doesn't work in Octave (but does in Matlab with the wider
chars). But it's probably bad coding style anyway:
a = "a";
a(end+1) = "ä";
Markus
- How should we treat invalid UTF-8?, Markus Mützel, 2019/11/02
- Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
- Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/04
- Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
- Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/04
- Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/04
- Re: How should we treat invalid UTF-8?, John W. Eaton, 2019/11/05
- Re: How should we treat invalid UTF-8?, Markus Mützel, 2019/11/06
- Re: How should we treat invalid UTF-8?,
Markus Mützel <=
- Re: How should we treat invalid UTF-8?, Andrew Janke, 2019/11/06