[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Patch for unicode in varnames...

From: George
Subject: Re: Patch for unicode in varnames...
Date: Mon, 05 Jun 2017 18:39:33 -0400

On Mon, 2017-06-05 at 15:59 +0700, Peter & Kelly Passchier wrote:
> On 05/06/2560 15:52, George wrote:
> > 
> > there's not a reliable mechanism in place to run a script in a locale
> > whose character encoding doesn't match that of the script
> From my experience running such scripts is no problem, but correct
> rendering it might depend on the client/editor.
It depends on the source and target encodings. For most pairs of source and 
target encoding there is some case where reinterpreting a string from the
source encoding as a string in the target encoding (without proper conversion) 
will result in an invalid string in the target encoding.
For instance, if a script were written in ISO-8859-1, many possible sequences 
involving accented characters would actually be invalid in UTF-8. (For
UTF-8, a multi-byte character must start with a byte from the 0xC0-0xFF range, 
and be followed by the expected number of bytes in the 0x80-0xBF range.
So if you had "Pokémon" as an identifier in a Latin-1-encoded script (byte 
value 0xE9 between the "k" and "m") and then tried running that script in a
UTF-8 locale, that byte sequence (0xE9 0x6D) would actually be invalid in 
UTF-8, so Eduardo's patch would indicate that the identifier is invalid and
fail to run the script.
UTF-8 is a bit exceptional as variable-width encodings go, in that it is 
self-synchronizing, and there are many possible byte sequences that are not
valid UTF-8 byte sequences. So converting _from_ UTF-8 tends to be less 
problematic than converting _to_ UTF-8. But there are still corner cases where
reinterpreting a UTF-8 byte sequence as another variable-width encoding could 
result in failure (rather than just a strange character sequence). For
- If reinterpreting UTF-8 as GB-18030 (the current Chinese national standard) 
or EUC-JP (a Japanese encoding common on Unix systems), the end-byte of
a UTF-8 multi-byte character could be misinterpreted as the first byte of a 
GB-18030 or EUC multi-byte character. If that UTF-8 character is followed
by a byte that's not a valid continuation byte (for instance, any of the 
single-byte punctuation or numeral characters in 0x20-0x30), the string would
be invalid in the target encoding, and converting to a wide-character string 
would fail.
...So basically while there are many cases where a valid UTF-8 string could be 
reinterpreted without conversion to produce a valid string in another
encoding, there are cases where the conversion would fail as well.
It seems like Korn Shell's behavior is similar to Eduardo's patch: the 
session's locale settings determine the behavior of functions like isalpha()
and mbstowcs, and thus what is considered a "valid" identifier in various 
contexts. (Interestingly enough, "フフ" is a valid parameter name in a UTF8
script, but in an EUC-JP script it needs a leading underscore to work.)
Bash's present behavior seems a bit more cavalier: For instance, if a function 
name contains a byte outside the ASCII range, it's apparently accepted
regardless of the locale settings. (This works out for encodings like 
ISO-8859-15, UTF-8, and EUC-JP, where bytes in the ASCII range always represent
the corresponding ASCII character, but it's problematic for encodings like 
GB18030 where bytes in the ASCII range are sometimes part of multibyte
characters) - and while it works, it's not a great rule for how these 
characters (esp. whitespace) are treated in the syntax.
If Bash did go the route of using the locale to set the character encoding of a 
script, I think it would be best to have a mechanism a script can use
to define the character encoding for the whole script file up front, rather 
than setting LC_CTYPE to procedurally change the behavior of the shell.
This is because, in principle at least, the meaning of shell code shouldn't 
change based on the state of the shell. (That's not always the case, there
are compatibility options that enable or disable certain keywords, and some of 
those keywords have specific syntax associated with them...)  The
character encoding used to interpret a script can fundamentally change how a 
script is parsed (especially for encodings like GB18030 where bytes that
look like ASCII characters may actually be part of multi-byte characters) - so 
it should be allowed just once, at the start of parsing a file, rather
than at any point in the script's execution. And for scripts loaded with 
"source", such a script should be able to communicate its own character
encoding without impacting the locale settings of the shell loading the script.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]