bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bash variable names do not comply w/POSIX character set rules


From: Linda Walsh
Subject: Re: bash variable names do not comply w/POSIX character set rules
Date: Sun, 06 Dec 2015 17:14:39 -0800
User-agent: Thunderbird



Eduardo A. Bustamante López wrote:
 This definition (

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_230
 ) states:

  3.230 Name

  In the shell command language, a word consisting solely of underscores,
digits, and alphabetics from the portable character set. The first
character
  of a name is not a digit.
----
  (1) -- It appears you /accidently/ left out part of the text under
section 3.230.  The full text:

 3.230 Name

 In the shell command language, a word consisting solely of
 underscores, digits, and alphabetics from the portable character
 set. The first character of a name is not a digit.

 Note: The Portable Character Set is defined in detail in
 P̲o̲r̲t̲a̲b̲l̲e̲ ̲C̲h̲a̲r̲a̲c̲t̲e̲r̲ ̲S̲e̲t̲⁽¹⁾

[§̲⁽¹⁾ http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tag_06_01 ]
 3.231 ...[next section]
----

   Thank-you.  This slightly clarifies matters as it only
requires the POSIX source.  At the location pointed to by
the hyper-link for "Portable Character Set" under section 6.1
sentences 2-4, it states:

 Each supported locale shall include the portable character set,
 which is the set of symbolic names for characters in Portable
 Character Set. This is used to describe characters within the text
 of IEEE Std 1003.1-2001. The first eight entries in Portable
 Character Set are defined in the ISO/IEC 6429:1992 standard and
 the rest of the characters are defined in the ISO/IEC 10646-1:2000
standard.
----
   FWIW, in full disclosure, in the last dotted paragraph before the
last sentence of section 6.1, there is a requirement that the alphabetic
character fit within 1 byte -- i.e. only characters in what is commonly
called the "Extended ASCII character set" (ex. ISO-8859-1) seem to be
required.  Note, the character 'Ø' is 1 byte.  So, as the quoted
section above mentions using [basically], the Unicode table for "symbolic
names", it doesn't prescribe a specific encoding. I.e. - While the
reference is to ISO-10646 (Unicode), it does not require a
specific encoding.
   For Unicode values 0-255, ISO-8859-1 encodes the first 256
bytes of Unicode with 1 byte (satisfying the 1-byte posix constraint,
though it is not able to encode Unicode values >=256, which makes
posix's reference to ISO-10646 somewhat specious as only the 1st
256 values can be encoded in 1 byte (that I am aware of).

   Nevertheless, the symbolic name "LATIN CAPITAL LETTER O WITH STROKE
(o slash)" or 'U+00D8' is classified as an alphabetic, which is a subset
of the "alphanumeric" requirement of POSIX.
   Note under section 9.3.5 "RE Bracket Expression", subsection 6:

 The following character class expressions shall be supported in
 all locales:

 [:alnum:]   [:cntrl:]   [:lower:]   [:space:]
 [:alpha:]   [:digit:]   [:print:]   [:upper:]
 [:blank:]   [:graph:]   [:punct:]   [:xdigit:]

 In addition, character class expressions of the form:

 [:name:]

 are recognized in those locales where the name keyword has been
 given a charclass definition in the LC_CTYPE category.

Note that "aØb" is classified as fully "alphabetic" by bash's
character-class matching facility -- whether in UTF-8 or ISO-8859-1:

 echo $LC_CTYPE
en_US.ISO-8859-1
LC_CTYPE=en_US.UTF-8
...
 declare -xg a=$(echo -n $'\x61\xd8\x62')
 declare -xg b=${a}c
 [[ $a =~ ^[[:alpha:]]+$ ]] && echo alpha
alpha
[[ $a =~ ^[[:alnum:]]+$ ]] && echo alnum
alnum
 [[ $b =~ ^[[:alpha:]]+$ ]] && echo alpha
alpha
 [[ $b =~ ^[[:alnum:]]+$ ]] && echo alnum
alnum
----
Notice bash classifies the string "aØb" as an alphanumeric AND
as an alphabetic character.  I.e.  bash, itself, claims that
"aØb" is a valid identifier.

Also note, it accepts "aØb" as a var and as an environment var
when used indirectly:

 declare -xg $a='a"slash-O"b'
 declare -xg $b='ab"slash-O"c'
 env|/usr/bin/grep -P '^[ab]...?'|hexdump -C
00000000  61 d8 62 63 3d 61 62 22  73 6c 61 73 68 2d 4f 22  |aab"slash-O"|
00000010  63 0a 61 d8 62 3d 61 22  73 6c 61 73 68 2d 4f 22  |c.a"slash-O"|
00000020  62 0a 61 3d 61 d8 62 0a  62 3d 61 d8 62 63 0a     |b.a=a=a|
0000002f

===



...
So no, it does not mandate arbitrary unicode alphabetics. Only the
ones listed
 there.
----
   Thank-you.  This better makes the case, as it only refers to
the POSIX reference pages.  But it seems that it boils down to the
allowed definition of envirionment variables:
(http://pubs.opengroup.org/onlinepubs/9699919799/)

 2.5.3 Shell Variables

 Variables shall be initialized from the environment (as defined by
 XBD Environment Variables and the exec function in the System
 Interfaces volume of POSIX.1-2008) and can be given new values
 with variable assignment commands.


The XBD interface is a description of API facilities for programs to use --
not an end-user-interface.  In particular, it says: (under section 8.1)

(http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08)

 Environment variable names used by the utilities in the Shell and
 Utilities volume of POSIX.1-2008 consist solely of uppercase
 letters, digits, and the <underscore> ( '_' ) from the characters
 defined in Portable Character Set and do not begin with a digit.
 Other characters may be permitted by an implementation;
 *****emphasis:
 applications shall tolerate the presence of such names.
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I.e. bash, an application should tolerate(allow) the presence of
user-defined names that don't fit the Portable Charset Definition--
though, bash itself, shouldn't create such names to be compatible with
the XBD API.

There are also multiple discussions that point out that UTF-8 is a
valid encoding since the portable chars are all defined as 1 byte, and
bytes above this range are not "state" dependent, but are multibyte.
(State dependent was described as a situation where you needed to know
what state a character decoder was in, when starting to decode a new
character, in order to decode it -- it doesn't refer to the fact that
individual character entities take 1-4 bytes (in std. Unicode).  It was
also pointed out that UTF-8 was 8-bit clean in that all binary values
could be encoded in UTF-8 -- and then decoded to get the original, same
text.

(p.s.  the above is about 6 hours of internet research, so please
excuse internal sequencing oddities...getting a bit brain-dead on
researching this...;-)...).





reply via email to

[Prev in Thread] Current Thread [Next in Thread]