|
From: | Andrew Janke |
Subject: | Re: locale encoding and core functions |
Date: | Sat, 9 Mar 2019 12:07:46 -0500 |
User-agent: | Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.5.3 |
On 3/9/19 10:10 AM, "Markus Mützel" wrote:
Your idea with .encoding files in each directory sounds promising. Maybe we should use ".mfile-encoding" or some other name more specific.
Yes, ".mfile-encoding" or similar is better; ".encoding" is too generic and there's no standard for it.
I'd rather not traverse up the directory tree to look for that file. When should we stop looking for that file? Should we traverse up until root? What should be done in case we reach a directory without read access? I would also prefer to not parse each source file for a magic comment. Both of these options also sound like they might impact first run performance.
Now that I think some more, sticking with an .mfile-encoding for each PATH entry is probably best. Octave projects tend to have few source dirs, so it's not a burden on users. Avoids your performance concerns, easier to code, and it won't interact in surprising ways with .mfile-encoding files that users stick elsewhere in their directory tree (which might not be included in source control! e.g. maybe a user thinks editing ~/.mfile-encoding is the way to use it; now this feature is just making things more complicated.).
Figuring out the Matlab compatibility situation is difficult.I think anything we'd do in that respect would automatically beat Matlab that is ignorant to the source file encoding.
It's not about beating Matlab; it's about being able to exchange source file collections with them unmodified.
Reading between the lines (and using memories from the dim past), I think Matlab always treats .m source files as being in the system default encoding.That is what I gathered as well.Here's another weird edge case: If different .m files are going to be interpreted as being in different encodings, how do strings with "\x" escape sequences in those files work? Are those byte sequences produced by the "\x" escapes interpreted as being in the same encoding as that source file? Or are they always considered to be in the internal encoding used by Octave's string objects? More generally, what transcoding is applied to string literals in M source, and does the "\x" escape interpretation happen before or after that transcoding? In either of these scenarios, is it actually possible for a developer to portably write a string literal that uses \x escapes to encode multibyte international characters?Do we automatically escape \x sequences when parsing .m files? Or is this something the interpreter does when processing double quoted strings? In the latter case, I don't think that we have to worry about that.
I'm still unclear on whether Octave strings are internally always UTF-8, or are in the system default encoding. If they're UTF-8, this sounds fine; \x escapes are always UTF-8 bytes (code units). But if they're system default encoded, then the \x escape meaning will vary depending on the locale you're running Octave in.
Cheers, Andrew
[Prev in Thread] | Current Thread | [Next in Thread] |