[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Improper UTF-8 combining character handling

From: Sean Burke
Subject: Improper UTF-8 combining character handling
Date: Sat, 09 Jun 2007 14:06:27 -0600
User-agent: Thunderbird (X11/20070420)

Configuration Information [Automatically generated, do not change]:
Machine: i686
OS: linux-gnu
Compiler: i686-pc-linux-gnu-gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='i686'
x-gnu' -DCONF_MACHTYPE='i686-pc-linux-gnu' -DCONF_VENDOR='pc'
share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H   -I.  -I.
-I./include -I
./lib   -O2 -march=prescott -fomit-frame-pointer -pipe
uname output: Linux morrigan 2.6.20-gentoo-r8-mactel #4 SMP PREEMPT Sat
May 12 1
0:35:03 MDT 2007 i686 Genuine Intel(R) CPU            1400  @ 1.83GHz
el GNU/Linux
Machine Type: i686-pc-linux-gnu

Bash Version: 3.2
Patch Level: 15
Release Status: release

        When using a UTF-8 combining character sequence, there is a
disparity be
tween what is considered a character for display and for editing. The
entire seq
uence will be treated as a single character for the purpose of editing,
but each
 glyph that is part of the sequence is treated separately. This causes
some glyp
hs to not be removed when deleting characters or for the cursor to be
visually i
n the wrong place.

        The Unicode normalization test data at
DATA/NormalizationTest.txt contains many sequences of this sort. The
first chara
cter sequence, LATIN CAPITAL LETTER D WITH DOT ABOVE, does produce this
 Paste it into the commandline, then backspace through it. The problem
should be
 reproduced immediately.

        Glyphs and character sequences should be treated consistently.
With comb
ining character sequences, it would most likely to be preferable to
treat each c
haracter in the sequence separately to allow for more precise editing,
though th
ere may be other issues I'm unaware of.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]