octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #49348] Treat multi-byte characters as one cha


From: HJW
Subject: [Octave-bug-tracker] [bug #49348] Treat multi-byte characters as one character for char array
Date: Thu, 29 Oct 2020 10:38:22 -0400 (EDT)
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36

Follow-up Comment #15, bug #49348 (project octave):

Thank you for your comments (and your time, obviously).

I had forgotten you actually need uint64 to properly store all Unicode code
points (although you could consider that an encoding as well). Most
information sources only talk about 0x0000-0xFFFD when talking about Unicode.

Instead of trusting the Matlab documentation (which claims the default
encoding is UTF-8 (going so far as to imply chars are stored internally as
UTF-8)), I should have tested it. It turns out Matlab actually uses UTF-16.
That explains why I thought it uses uint16 internally: up to 0xFFFF they are
equivalent. Probably because of this I had assumed char was just a wrapper for
the numeric Unicode code point, which made intuitive sense to me. That is why
I expected a 1-to-1 mapping of code point entities and elements of char
arrays.

So it seems I need to change gears for Matlab as well. Thank you for taking
the time to educate me.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?49348>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]