[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug with case conversion of UTF-8 characters
From: |
Stephane Chazelas |
Subject: |
bug with case conversion of UTF-8 characters |
Date: |
Thu, 22 Jan 2015 14:43:00 +0000 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS: -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64'
-DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu'
-DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL
-DHAVE_CONFIG_H -I. -I../. -I.././include -I.././lib -D_FORTIFY_SOURCE=2 -g
-O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall
uname output: Linux host 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt2-1
(2014-12-08) x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu
Bash Version: 4.3
Patch Level: 30
Release Status: release
(Debian unstable amd64)
$ LC_ALL=tr_TR.UTF-8 bash -c 'typeset -l a; a=İ; echo $a' | hd
00000000 69 b0 0a |i..|
00000003
$ a=İ LC_ALL=tr_TR.UTF-8 bash -c 'echo ${a,,}' | hd
00000000 69 b0 0a |i..|
00000003
In Turkish locales on a GNU system at least, uppercase i is İ,
not I. And lowercase I is ı, not i.
İ was properly translated to i, but there's a spurious 0xb0
which probably comes from the original İ
$ echo İ | hd
00000000 c4 b0 0a |...|
00000003
The reverse problem:
$ a=i LC_ALL=tr_TR.UTF-8 bash -c 'echo ${a^^}'
i
$ a=I LC_ALL=tr_TR.UTF-8 bash -c 'echo ${a,,}'
I
$ LC_ALL=tr_TR.UTF-8 bash -c 'typeset -u a; a=ia;echo $a' | hd
00000000 69 41 0a |iA.|
00000003
That affects other characters where the lower/upper
case counterpart don't have the same number of bytes in their
UTF-8 encoding. Here, in a en_US.UTF-8:
$ a=$'\u027D' bash -c 'echo $a ${a^^}' | hd
00000000 c9 bd 20 e2 bd a4 03 0a |.. .....|
00000008
$ a=$'\u027D' zsh -c 'echo $a ${(U)a}' | hd
00000000 c9 bd 20 e2 b1 a4 0a |.. ....|
00000007
(this time, the translated character is *larger*, still there's
a spurious 0x03 byte, which this time is not coming from the
original character, possibly from the stack).
--
Stephane
- bug with case conversion of UTF-8 characters,
Stephane Chazelas <=