[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-bug-tracker] [bug #57596] Should the "len" argument of "fgetl" a
From: |
Andrew Janke |
Subject: |
[Octave-bug-tracker] [bug #57596] Should the "len" argument of "fgetl" and "fgets" mean bytes or characters? |
Date: |
Wed, 10 Jun 2020 09:58:44 -0400 (EDT) |
User-agent: |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:73.0) Gecko/20100101 Firefox/73.0 |
Follow-up Comment #8, bug #57596 (project octave):
Pretty sure we need to sort out what's going to happen with UTF-8 and char
semantics before doing this one. But otherwise, it's not that bad:
The character-wise behavior for composite characters and the like is pretty
well established by other languages and the Unicode standard, though:
- If you're doing characterwise UTF-8, then "one character" is one Unicode
code point, however many bytes that's encoded as.
- If you want to be Matlab-compatible and are doing UCS-2, then "one
character" is always one two-byte UCS-2 code unit/code point
- If you're doing UTF-16, "one character" should probably be one two-byte
UTF-16 code unit, not one Unicode code point.
- A Unicode combining character is still technically just one character and
one Unicode code point; you don't have to treat them specially at the I/O
level. It's up to the application code to determine the semantics of sequences
of characters that involve combining characters.
And if you encounter an invalid byte sequence, then I think you should, and
pretty much have to, either throw an error, or convert to the Unicode
"replacement character", and this behavior should probably be
caller-configurable on a per-filehandle basis, and throwing an error should
probably be the default.
UCS-2 has no invalid byte sequences.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?57596>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/