[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Handle encoding of Octave strings
From: |
mmuetzel |
Subject: |
Re: Handle encoding of Octave strings |
Date: |
Thu, 17 May 2018 03:52:42 -0700 (MST) |
> What does Matlab do? If your choice is different, I am sure that we
> will see bug reports about it.
In Matlab:
>> str = 'aäbc'
str =
aäbc
>> str(1)
ans =
a
>> str(2)
ans =
ä
>> str(3)
ans =
b
>> str(4)
ans =
c
>> whos str
Name Size Bytes Class Attributes
str 1x4 8 char
So in Matlab one "char" has a size of 2 bytes. On the other hand, in Octave
one "char" has 1 byte.
Do we want to change the way Octave stores its char class? Initially I was
in favor of keeping the relation of 1 byte = 1 char (hence using UTF-8). But
it would make indexing more straight forward if we changed to UTF-16 (1
"char" = 2 bytes). At least when it comes to the BMP which encompasses
characters from most current scripts.
A first step towards this could be to add "from_u8", "to_u8", ("from_u16",
"to_u16") methods to our char class.
Than we would need to identify all places in the code where we construct
char arrays from external sources (.m files, terminal, reading from files,
...) and where we pass strings to external sources (library functions,
writing to files, ...).
When this is done we might be able to switch the internal representation
from C-"char" to "uint16_t" without breaking everything...
Do you think that this is feasible?
Markus
--
Sent from: http://octave.1599824.n4.nabble.com/Octave-Maintainers-f1638794.html