|
From: | HJW |
Subject: | [Octave-bug-tracker] [bug #49348] Treat multi-byte characters as one character for char array |
Date: | Thu, 29 Oct 2020 10:38:22 -0400 (EDT) |
User-agent: | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36 |
Follow-up Comment #15, bug #49348 (project octave): Thank you for your comments (and your time, obviously). I had forgotten you actually need uint64 to properly store all Unicode code points (although you could consider that an encoding as well). Most information sources only talk about 0x0000-0xFFFD when talking about Unicode. Instead of trusting the Matlab documentation (which claims the default encoding is UTF-8 (going so far as to imply chars are stored internally as UTF-8)), I should have tested it. It turns out Matlab actually uses UTF-16. That explains why I thought it uses uint16 internally: up to 0xFFFF they are equivalent. Probably because of this I had assumed char was just a wrapper for the numeric Unicode code point, which made intuitive sense to me. That is why I expected a 1-to-1 mapping of code point entities and elements of char arrays. So it seems I need to change gears for Matlab as well. Thank you for taking the time to educate me. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?49348> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
[Prev in Thread] | Current Thread | [Next in Thread] |