bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Patch for unicode in varnames...


From: George
Subject: Re: Patch for unicode in varnames...
Date: Tue, 06 Jun 2017 03:37:13 -0400

On Mon, 2017-06-05 at 16:16 -0700, L A Walsh wrote:
> George wrote:
> > 
> > On Mon, 2017-06-05 at 15:59 +0700, Peter & Kelly Passchier wrote:
> >   
> > > 
> > > On 05/06/2560 15:52, George wrote:
> > >     
> > > > 
> > > > there's not a reliable mechanism in place to run a script in a locale
> > > > whose character encoding doesn't match that of the script
> > > >       
> > > From my experience running such scripts is no problem, but correct
> > > rendering it might depend on the client/editor.
> > > 
> > >     
> > It depends on the source and target encodings. For most pairs of source and 
> > target encoding there is some case where reinterpreting a string from
> > the
> > source encoding as a string in the target encoding (without proper 
> > conversion) will result in an invalid string in the target encoding.
> > For instance, if a script were written in ISO-8859-1, many possible 
> > sequences involving accented characters would actually be invalid in UTF-8.
> ---
>     Um... I think you are answering a case that is different than
> what is stated (i.e. locale being same as used in script).  So no
> conversion should take place.
Eduardo's patch for allowing Unicode in parameter names uses information from 
the locale setting to transform the string and test whether the
characters in it are valid characters (alphanumeric or underscore) for 
parameter names. But that logic can only work correctly if the character set
configured in the locale is the same as the character set of the script. 
Otherwise, the byte sequence is interpreted with the wrong character
encoding, which can cause the script to fail with an "invalid parameter name" 
error if:
1: The byte sequence in the script is not a valid string in the new encoding, or
2: The character sequence obtained by applying the new encoding results in an 
invalid parameter name.
As it stands, it's possible in Bash to use bytes in the 0x80-0xFF range as part 
of function names, for instance, because the Bash parser treats all of
these byte values as valid "word" characters. This makes the Bash parser fairly 
"encoding neutral", which is why scripts using non-ASCII characters in
command names or function names work even if the script is run on a different 
locale in current versions of Bash. Bash just ignores the issue, and
that works for a fair number of encodings. (Encodings like GB-18030 or 
Shift-JIS are exceptions, because byte values in the 0x00-0x7F range can be
part of multi-byte character, so Bash's parser may misinterpret part of a 
multi-byte character as something else.)
...However, I wasn't talking about currently-released versions of Bash, I was 
talking about Eduardo's patch.
Broadly speaking I think the approach taken in Eduardo's patch (interpreting 
the byte sequence according to the rules of its character encoding) is
better than the approach taken in current versions of Bash (letting 0x80-0xFF 
slide through the parser) - but that approach only works if you know the
correct character encoding to use when processing the script. The information 
has to be provided in the script somehow.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]