Re: RFE: Please allow unicode ID chars in identifiers

bug-bash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFE: Please allow unicode ID chars in identifiers

From:	Chet Ramey
Subject:	Re: RFE: Please allow unicode ID chars in identifiers
Date:	Tue, 13 Jun 2017 15:04:24 -0400
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.1.1

On 6/2/17 12:54 PM, tetsujin@scope-eye.net wrote:


> I agree that allowing Unicode in parameter names is problematic:
> - there are characters that should be equivalent in principle, but
> aren't (For instance, the Greek letter pi (π) and the mathematical
> symbol pi (𝛑) - in some fonts they may render the same, but they
> are distinct code points) Some of these characters will look like Bash
> syntax, but be encoded differently.

That problem boils down to how they are encoded in the current locale.

> - there are characters that are equivalent, but can be encoded
> multiple ways (For instance, 'é' may be encoded as $'u00E9' or as
> $'eu0309') - broadly speaking this falls under the scope of "Unicode
> Normalization" - it's a well-explored problem but not a trivial one.
> (And it gets much worse with Asian languages, for instance)

None of the locale functions tackle normalization.  It all comes down
to how characters are encoded in locale description files.

> - Unicode introduces more whitespace characters - to allow Unicode
> glyphs in parameter names, one must also decide whether to interpret
> Unicode whitespace as whitespace (complicating parsing rules, word
> splitting rules, etc.) - or to treat only ASCII whitespace as
> whitespace (in which case Unicode whitespace can form part of a "word"
> without quoting - which could make such code visually confusing)

Sure, but if shell programmers are responsible, they'll avoid these
kinds of confusing constructs.

> - As you pointed out, this requires the shell to somehow establish a
> convention governing the character set used to interpret shell scripts

It's actually the same one that is currently used: the current locale.


> I think the value is that it allows the programmer to express
> themselves in their native language. 

Sure. The question is whether the expressive power gained, and how widely
it's used, compensates for the cost of implementation, support, and
performance.  There are always tradeoffs.




> Greg raises a fair point: Some platforms will simply be unable to view
> scripts written with Unicode symbols...
> ...Likewise, some platforms will be unable to view scripts written
> with other features that are already present, such as Unicode string
> literals, or Unicode command names.

Introducing locale dependencies is dangerous: a variable name that contains
valid alphabetic characters in the writer's locale may not be valid in the
user's locale.  It's up to the writer to set the locale appropriately.

> 
> But, on the other hand:
> - Even if your editor or terminal can't display the UTF-8 code, that
> doesn't mean the shell process can't RUN it.

As long as the locale is set appropriately.


> Frankly, making shell scripts broadly compatible (let alone
> "universally" compatible) is very, very difficult. Can you believe
> someone wrote a shell script that used the colon character as part of
> a "tar" archive name? They clearly didn't have GNU tar in mind when
> they wrote that...  This would be just another case of something
> useful-but-not-universally-compatible. (In other words, if the feature
> is added, it becomes one of those things you don't use if you want
> your script to be portable. Frankly if it were added to Bash now, most
> of us would have to wait 5-10 years for today's version of Bash to
> become "the oldest version our project must still support" before we
> could use the feature...  This is part of why I tend to be rather
> forward-looking when it comes to the question of what should be in the
> shell.)
> 
> To address your questions on related design decisions that would have
> to be made for this to work, here's how I'd approach it for a
> pre-existing language like Bash with a long legacy:
> 
> 1: For an interactive session, the character encoding of commands is
> taken from the locale.

It depends on what you mean by `commands'. If you mean shell reserved
words and builtins, no: those use the existing portable character set.
If you mean shell functions, those already use a superset of the current
locale: the permitted character set for function names is that for
filenames, except when the shell is in posix mode. If you mean commands
read from the file system, these inherit any restrictions placed on them
by the file system.

But that's not really what we're talking about here.

> 2: For a script, the character encoding of commands must be explicitly
> specified, probably via a shell option. 

You can already do this by setting the various locale environment
variables.

> (Ideally I think it should be
> specified per-file, but I don't know if Bash supports any kind of
> per-file shell options. This is so, for instance, a non-Unicode
> session that sources a Unicode shell script does not become a Unicode
> session, or a Unicode script that sources a non-Unicode script does
> not interpret that script as Unicode.)

It does if it sets the locale.

> 3: If a script does not specify its character encoding, then the
> behavior is like current versions of Bash: multi-byte characters are
> supported in some contexts (quoted strings, command names, words) but
> not others (parameter names)

The current behavior does depend on the character encoding defined by
the current locale.

> 4: Sub-shell invocation, command/process substitution, etc. inherit
> the character encoding of the parent shell.

As it does now.

> 5: Enabling Unicode parsing of the script doesn't add Unicode
> whitespace characters to IFS - it affects how the script is
> interpreted, not how the parsed code operates.

No, we're staying away from locale-specific reserved words and builtin
commands. That has real negative security implications.

> 6: If POSIX mode is active, I think all of this gets disabled and
> Unicode characters are disallowed as parameter names.

As I said in another message, you should be able to enable an option to
allow locale-specific alphabetic chartacters even when you're in Posix
mode. It's just disabled by default.

> I think another angle worth considering is that most of what I
> outlined above is only important if you want to take advantage of it
> for other features. For instance:
> - In order to support transcoding, to allow (for instance) a Latin-1
> script to source a Unicode script and for equivalently-named entities
> in the two to connect.

This is very hard to do without some hint from the user about the locale
to use.

> - In order to support Unicode whitespace as part of the language
> syntax

This is a really bad idea in general; you don't want the same script to
parse differently in different locales.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet@case.edu    http://cnswww.cns.cwru.edu/~chet/

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Patch for unicode in varnames..., (continued)

Prev by Date: Re: RFE: Please allow unicode ID chars in identifiers
Next by Date: Re: RFE: Please allow unicode ID chars in identifiers
Previous by thread: Re: RFE: Please allow unicode ID chars in identifiers
Next by thread: Re: RFE: Please allow unicode ID chars in identifiers
Index(es):
- Date
- Thread