bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug] [[ $'\Ux' = $'\Ux' ]] returns false for some values of x in some l


From: Stephane Chazelas
Subject: [bug] [[ $'\Ux' = $'\Ux' ]] returns false for some values of x in some locales
Date: Fri, 4 Nov 2016 12:29:03 +0000
User-agent: Mutt/1.5.21 (2010-09-15)

(reproduced with bash 4.3 or 4.4 on Debian unstable and Ubuntu 16.04).

perl -le "printf q([[ $'\U%X' = $'\U%X' ]] || echo %06X: $'\U%X').\"\n\",
  \$_,\$_,\$_,\$_ for (1..0xd7FF, 0xE000..0x10FFFF)" |
  LC_ALL=zh_HK.big5hkscs bash | LC_ALL=C sed -n l

Where the perl command outputs:

[[ $'\U1' = $'\U1' ]] || echo 000001: $'\U1'
[[ $'\U2' = $'\U2' ]] || echo 000002: $'\U2'
[[ $'\U3' = $'\U3' ]] || echo 000003: $'\U3'
[[ $'\U4' = $'\U4' ]] || echo 000004: $'\U4'
....

for all valid (albeit not necessarily assigned, let alone available in any
charset) Unicode codepoints.

Gives:

0000CA: $
0000CB: \\u00CB$
0000EA: $
0000EB: \\u00EB$
00011A: \210\\$
0003B1: \243\\$
000436: \310\\$
003075: \307\\$
003618: \234\\$
003661: \215\\$
0044C0: \226\\$
004A35: \232\\$
004AA4: \207\\$
004E48: \244\\$
004F62: \312\\$
004FDE: \253\\$
005045: \324\\$
00509C: \330\\$
00515D: \242\\$
00529F: \245\\$
005412: \246\\$
00542D: \247\\$
0056ED: \373\\$
00577C: \251\\$
0057A5: \316\\$
00587F: \341\\$
0058A6: \274\\$
0058F0: \211\\$
005A09: \256\\$
005A16: \321\\$
005A2B: \230\\$
005AF9: \345\\$
005B1E: \351\\$
005B40: \304\\$
005C10: \311\\$
005CA4: \314\\$
005D24: \261\\$
005E4B: \335\\$
005EC4: \264\\$
0060DD: \325\\$
006127: \267\\$
0063CA: \331\\$
0064FA: \302\\$
00669D: \272\\$
0067AF: \254\\$
0067E6: \317\\$
0069D9: \342\\$
006A9D: \375\\$
006B7F: \252\\$
006C7B: \313\\$
006C94: \250\\$
006D82: \322\\$
006DDA: \262\\$
006EDC: \336\\$
006F7F: \346\\$
007019: \362\\$
007035: \364\\$
00712E: \332\\$
0071E1: \355\\$
00727E: \326\\$
0072D6: \315\\$
007366: \352\\$
0073E2: \227\\$
0073EE: \257\\$
007435: \265\\$
00749E: \277\\$
0075B1: \236\\$
007667: \240\\$
007912: \360\\$
007A1E: \270\\$
007A40: \275\\$
007B0B: \216\\$
007BA4: \343\\$
007CED: \231\\$
007D85: \337\\$
007E37: \301\\$
007F61: \323\\$
0080D0: \320\\$
0080EC: \213\\$
00812A: \223\\$
0082D2: \255\\$
00833B: \333\\$
00838D: \327\\$
0084CB: \273\\$
00850C: \347\\$
00855A: \217\\$
00878F: \353\\$
0087B0: \356\\$
008A31: \263\\$
008C79: \260\\$
008D15: \367\\$
008D68: \340\\$
008DDA: \266\\$
008E0A: \344\\$
008E7E: \212\\$
008EA1: \306\\$
009103: \334\\$
009140: \363\\$
009145: \366\\$
009186: \350\\$
00923E: \271\\$
0093AA: \361\\$
0095B1: \276\\$
0097B8: \233\\$
009910: \300\\$
009924: \354\\$
0099F9: \357\\$
009A31: \365\\$
009ACF: \305\\$
009AE2: \221\\$
009AFF: \237\\$
009C4B: \370\\$
009C6D: \371\\$
009EE0: \303\\$
00FE4F: \241\\$
0205EB: \224\\$
020C3A: \376\\$
023600: \372\\$
0265AD: \225\\$
026C21: \222\\$
0270F8: \374\\$
02870F: \214\\$
02913C: \235\\$
02A014: \220\\$

$ LC_ALL=zh_HK.big5hkscs locale charmap
BIG5-HKSCS

Most of the problematic characters are the ones ending in 0x5c
(which happens to be backslash in ASCII (or in BIG5-HKSCS when
standing alone).

$ LC_ALL=zh_HK.big5hkscs bash -xc "[[ $'\u3b1' = $'\u3b1' ]]" 2>&1 | sed -n l
+ [[ \243\\ = \\\243\\ ]]$

Note that

bash -xc $'[[ \u3b1 = \u3b1 ]]'


also returns false in those locales.

There are similar problems for locales using BIG5, GB18030 or GBK charsets.

Same with "case" or

a=$'\u3b1'; [[ $a = $a ]]
or
[[ "$a" = "$a" ]]
or ${a#"$a"}

[ "$a" = "$a" ] is fine.

The CA and EA ones do look a lot like a bug in the glibc's
locale definition or gconv module (and the CB, EB ones are a
consequence of it)

$ LC_ALL=zh_HK.big5hkscs bash -xc "[[ $'\uca' = $'\uca' ]]" 2>&1 | sed -n l
+ [[ '' = \\\210f ]]$

A $'\uanything' following a $'\uca' always yields 0x88 0x66
(which happens to be the BIG5-HKSCS encoding of U+00CA) in
bash, zsh and ksh93 (though only for anything >= 0x80 in bash).

Those locales are problematic and should be avoided in general.
The problem  is that they are often *available*, so all those
corner cases caused by the fact that some characters contain
ASCII ones can be exploited (think of sudo or many sshd
deployments letting LC_* variables through for instance).

-- 
Stephane



reply via email to

[Prev in Thread] Current Thread [Next in Thread]