[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bash-4.3: casemod word expansions broken with UTF-8
From: |
Ulrich Mueller |
Subject: |
bash-4.3: casemod word expansions broken with UTF-8 |
Date: |
Mon, 16 Nov 2015 16:12:15 +0100 |
[Resending, apparently my first message didn't make it to the list.]
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: x86_64-pc-linux-gnu-gcc
Compilation CFLAGS: -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64'
-DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu'
-DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL
-DHAVE_CONFIG_H -I. -I./include -I. -I./include -I./lib
-DDEFAULT_PATH_VALUE='/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
-DSTANDARD_UTILS_PATH='/bin:/usr/bin:/sbin:/usr/sbin'
-DSYS_BASHRC='/etc/bash/bashrc' -DSYS_BASH_LOGOUT='/etc/bash/bash_logout'
-DNON_INTERACTIVE_LOGIN_SHELLS -DSSH_SOURCE_BASHRC -march=core2 -ggdb -O2 -pipe
uname output: Linux juno 3.18.24-gentoo #1 SMP Sun Nov 8 10:43:05 CET 2015
x86_64 Intel(R) Core(TM)2 Duo CPU T6570 @ 2.10GHz GenuineIntel GNU/Linux
Machine Type: x86_64-pc-linux-gnu
Bash Version: 4.3
Patch Level: 42
Release Status: release
Description:
In an UTF-8 locale like en_US.UTF-8, the case-modifying
parameter expansions sometimes return invalid UTF-8 encodings.
This seems to happen when the UTF-8 byte sequences that are
encoding upper and lower case have different lengths.
Repeat-By:
$ LC_ALL=en_US.UTF-8
$ x=$'\xc4\xb1' # LATIN SMALL LETTER DOTLESS I
$ echo -n "${x^}" | od -t x1
0000000 49 b1
0000002
This should have output "49" for "I" only. The "b1" is illegal
as the first byte of an UTF-8 sequence.
$ x=$'\xe1\xba\x9e' # LATIN CAPITAL LETTER SHARP S
$ echo -n "${x,}" | od -t x1
0000000 c3 9f 9e
0000003
This should have output "c3 9f" (for "sharp s") only.
Even more interesting effects happen if the string contains
a character whose UTF-8 encoding gets *longer* after case
conversion, because then the terminating null byte will be
overwritten.
For example, U+0250 "LATIN SMALL LETTER TURNED A" is
represented by a two byte sequence in UTF-8, while its
uppercase equivalent U+2C6F needs three bytes:
$ LC_ALL=en_US.UTF-8
$ x=$'aaaaa\xc9\x90'
$ y=${x^^}
$ echo -n "$y" | od -t x1
0000000 41 41 41 41 41 e2 90 af 6f 6d 65 2f 75 6c 6d
0000017
Variable y contains some trailing garbage (could be a part of
$HOME or $PWD).