bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFE: Please allow unicode ID chars in identifiers


From: Chet Ramey
Subject: Re: RFE: Please allow unicode ID chars in identifiers
Date: Tue, 13 Jun 2017 22:10:09 -0400
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.1.1

On 6/13/17 9:27 PM, George wrote:
> On Tue, 2017-06-13 at 20:14 -0400, Chet Ramey wrote:
>> On 6/13/17 5:19 PM, tetsujin@scope-eye.net <mailto:tetsujin@scope-eye.net> 
>> wrote:
>>> In that case, the answer is simple: The shell swiftly rejects the
>>> script, and provides a clear reason why it cannot be run. ("bash: Script
>>> requires the en_US.utf8 locale which is not installed on this system.
>>> Sorry, dude.") 
>>
>>
>> The shell has no business doing this. If a script requires a certain
>> locale, and won't run correctly without it, the author can ensure that
>> an assignment to LC_CTYPE produces the desired results.
>>
> I already addressed this. Changing LC_CTYPE doesn't just impact how the
> shell interprets the script, it also changes how various other I/O
> operations occur, how filenames are processed, and (presumably, assuming
> locale is exported) the setting is inherited by commands run by the script
> as well.

LC_CTYPE affects case and collation, but not necessarily other I/O ops
or filename processing, except to the extent that they're sorted.  It
does affect regular expression character class processing.

And it's easy enough to avoid exporting LC_CTYPE. If you want to really
nail things down, you can export LC_ALL with its original value, unexport
LC_CTYPE, and make sure LANG is unset.

> If my system's locale were based on GB18030, and I run a shell script
> that's encoded in UTF-8, and the author of that script had the bright idea
> to set LC_CTYPE to en_US.utf8 to make the shell work in any locale - then I
> haven't succeeded in "running a script in a different locale than the one
> it was written in", because once LC_CTYPE has been reset I am no longer IN
> my system's locale for the duration of that script.

Quite true. And that's why the author would set LC_CTYPE -- if he embedded
something that required UTF-8 encoding and character classification. I
would argue that nobody interested in real portability would do that, but
we all know of plenty of non-portable scripts in wide distribution.

> The script doesn't necessarily require to run IN a particular locale, it
> needs to be INTERPRETED according to a certain locale, because locale
> settings influence how the parser works.

They really don't very much.  And if I add provisions to allow alpha-
numerics in the current locale to the set of acceptable characters for
identifiers, that would affect the parser only to the extent that commands
are or are not interpreted as assignment statements. All of the shell's
syntactic elements are single-byte characters, or composed of single-byte
characters, so there are few ways for locale-specific character classes to
affect the parser.

(And for a shell script, running and being interpreted is a distinction
without a difference. You can't really "run" a script without interpreting
it.)

> 
>>> This is also why I think this should be an optional "encoding marker" at
>>> a fairly fixed location in the file, rather than an option setting that
>>> could occur anywhere in the script: It allows an incompatible script to
>>> be immediately identified and rejected before it does anything. 
>>
>>
>> This is relatively trivial to do with a shell function.
>>
> Sure, I just don't think that's the right answer.
> If this method of supporting cross-locale scripts were adopted (and
> honestly, that possibility seems pretty remote, but I'm enjoying the
> discussion anyway), this is a check we'd want in place for pretty much
> every script that uses the feature. Every time the script is run, we'd want
> to know first, "can the shell run this script?" - there's no point
> repeating that bit of boilerplate in every single script. And there's
> nothing out there that can better answer the question "Can the shell run
> this?" than the shell. And if the answer is "no" then it doesn't make sense
> for the shell to do anything BUT error out of processing the script. And we
> should get that answer without *running* the script.

A script that someone writes for his own use won't need it.  A script that
is distributed and expected to run on other systems won't need it unless it
chooses to use locale-specific identifiers. It's one of those portability
decisions -- if you want to do things a certain way, you need to make sure
things can be done that way. Locale verification isn't a feature that would
be used enough to make putting it into the shell worthwhile.

It would, however, be trivial to do with a loadable builtin.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet@case.edu    http://cnswww.cns.cwru.edu/~chet/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]