octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale encoding and core functions


From: Andrew Janke
Subject: Re: locale encoding and core functions
Date: Sat, 9 Mar 2019 12:07:46 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.5.3


On 3/9/19 10:10 AM, "Markus Mützel" wrote:

Your idea with .encoding files in each directory sounds promising. Maybe we should use 
".mfile-encoding" or some other name more specific.

Yes, ".mfile-encoding" or similar is better; ".encoding" is too generic and there's no standard for it.

I'd rather not traverse up the directory tree to look for that file. When 
should we stop looking for that file? Should we traverse up until root? What 
should be done in case we reach a directory without read access?
I would also prefer to not parse each source file for a magic comment.
Both of these options also sound like they might impact first run performance.

Now that I think some more, sticking with an .mfile-encoding for each PATH entry is probably best. Octave projects tend to have few source dirs, so it's not a burden on users. Avoids your performance concerns, easier to code, and it won't interact in surprising ways with .mfile-encoding files that users stick elsewhere in their directory tree (which might not be included in source control! e.g. maybe a user thinks editing ~/.mfile-encoding is the way to use it; now this feature is just making things more complicated.).


Figuring out the Matlab compatibility situation is difficult.
I think anything we'd do in that respect would automatically beat Matlab that 
is ignorant to the source file encoding.

It's not about beating Matlab; it's about being able to exchange source file collections with them unmodified.

Reading between the lines (and using memories from the dim past), I
think Matlab always treats .m source files as being in the system
default encoding.
That is what I gathered as well.

Here's another weird edge case: If different .m files are going to be
interpreted as being in different encodings, how do strings with "\x"
escape sequences in those files work? Are those byte sequences produced
by the "\x" escapes interpreted as being in the same encoding as that
source file? Or are they always considered to be in the internal
encoding used by Octave's string objects? More generally, what
transcoding is applied to string literals in M source, and does the "\x"
escape interpretation happen before or after that transcoding? In either
of these scenarios, is it actually possible for a developer to portably
write a string literal that uses \x escapes to encode multibyte
international characters?
Do we automatically escape \x sequences when parsing .m files? Or is this 
something the interpreter does when processing double quoted strings?
In the latter case, I don't think that we have to worry about that.

I'm still unclear on whether Octave strings are internally always UTF-8, or are in the system default encoding. If they're UTF-8, this sounds fine; \x escapes are always UTF-8 bytes (code units). But if they're system default encoded, then the \x escape meaning will vary depending on the locale you're running Octave in.

Cheers,
Andrew



reply via email to

[Prev in Thread] Current Thread [Next in Thread]