octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #63930] fprintf problem


From: Markus Mützel
Subject: [Octave-bug-tracker] [bug #63930] fprintf problem
Date: Tue, 28 Mar 2023 09:46:23 -0400 (EDT)

Follow-up Comment #32, bug #63930 (project octave):

Sorry for the long text. But this is kind of complicated and we should try to
give it some structure if we like to fix this.

Re-encoding the .m files doesn't necessarily mean that the output encoding
changes. We should really separate this into "independent" steps in our
minds:

1. The parser reads the code from the file.
In this step, the *input encoding* matters. That is what `__mfile_encoding__`
was originally meant to affect. 
In older versions of Matlab and Octave this was always the encoding matching
the system locale. In newer versions of Matlab, it seems to use heuristics to
parse files either in the encoding matching the system locale or UTF-8. In
"new-ish" versions of Octave, the encoding for the parser can be set using the
`__mfile_encoding__` function.

2. Content is written to an output stream with the `f*` family of functions.
In this step, the *output encoding* matters.
In Matlab, this always happened (and still does) using the encoding matching
the system locale afaict. In older versions of Octave, `char` arrays were
actually treated as byte arrays. So, they were just written as is. But that
caused issues when we actually needed to know which string a given byte array
represented. So, since some time `char` arrays are UTF-8 encoded strings in
Octave. Since before Octave 8 (or 7?), the `f*` family of functions didn't
convert anything. So, they either wrote whatever bytes the `char` array
contained (in old-Octave) - or UTF-8 (in newer Octave).
However, the latter caused incompatibilities with Matlab (which writes in the
system locale by default). (Or at least, it used to do that at some point.)
To be closer to Matlab's behavior, Octave was changed to use the encoding
selected by `__mfile_encoding__` (i.e., the input encoding used by the parser)
also for writing to files by default.
Since Octave now also implements the option to specify the encoding when
opening a stream, it is also possible to specify an encoding differing from
the encoding used by the parser.

There is also a third case where encoding matters: the encoding the built-in
editor uses when saving files. Theoretically, that could be yet another
differing encoding. But to avoid surprises, we decided to keep the encoding
that the parser uses and the encoding the editor uses in sync.

Additionally, there is a difference between using the GUI and the CLI: The
interpreter itself doesn't have options that persist between sessions. The GUI
does. That means that if you change the encoding used by the built-in editor
(and hence the parser), that setting will persist when you re-start Octave. In
contrast, if you change the encoding used by the parser with
`__mfile_encoding__` in the CLI, that setting won't persist (unless you do
that in one of the startup files).

As a side note: The reason why this seems to be less of an issue for Matlab
might be that they are using UTF-16 encoded strings for the internal
representation of their char arrays. Most strings consist only (or mainly) of
characters from within the BMP. So, they are represented by a single code unit
in UTF-16 (compared to surrogate pairs for characters outside the BMP).
Octave on the other hand chose to use UTF-8 for which only ASCII characters
can be represented with a single code unit (a byte). All other characters need
2 to 4 bytes to be represented. So, for Octave it is "much more likely" that a
buffer might "happen to end" in the middle of a multi-byte sequence.


Skipping the encoding for byte streams is probably a step in the correct
direction. But it might still make sense to check what Matlab is doing. E.g.,
does writing "ä" (or any other multi-byte character) to a byte stream with
encoding set to "UTF-8" differ from a file that is written as a text stream
with the same encoding?

For the remaining issue of how to correctly convert the encoding on-the-fly
when writing to a stream, the following "solutions" come to mind:
1. Change the internal representation of character arrays in Octave to UTF-16
(from currently UTF-8). That would make it less likely for the internal buffer
to end "in the middle of a code point". And it would also make Octave more
similar to Matlab in that respect.
A couple of years back, we decided against that because we were trying to
avoid character conversions on too many interfaces. But now, the conversion
facilities in Octave are ready (and we are converting on many interfaces
anyway). That won't be something we can even consider to do for a minor
release.

2. Try to improve our `codecvt_u8` that manages the conversion from or to
UTF-8. I haven't looked into that (and probably won't be able to do so in the
nearest future).
But the changes would probably need to happen close to this FIXME note:
https://hg.savannah.gnu.org/hgweb/octave/file/fcd97a68e5f7/liboctave/util/oct-string.cc#l635

3. As a short term "solution", we could change the default for output streams
back to UTF-8. If users need to write in a different encoding, using `fflush`
on the stream often might be a (unreliable) work-around.



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?63930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]