[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improper UTF-8 combining character handling

From: Sean Burke
Subject: Re: Improper UTF-8 combining character handling
Date: Tue, 12 Jun 2007 13:15:56 -0600
User-agent: Thunderbird (X11/20070420)

I've retried with 3.2-17 with the same results. Notably, the issue isn't
(and has not been) that all multibyte characters are handled properly.
Instead, sequences which contain combining characters seem to treat the
sequence inconsistently. For example, the character that represents D
WITH DOT ABOVE, U+1E0A, is handled properly. However, the equivalent
sequence U+0044 + U+0307, consisting of D and COMBINING DOT ABOVE, is
not handled properly. Backspacing through the sequence removes both
characters with one backspace, but only the COMBINING DOT ABOVE glyph is

Most likely, bash is treating the sequence as a single character, either
because of specific semantics saying that a combining sequence is a
single character, or because the sequence is handled as its
normalization form C equivalent, the single D WITH DOT ABOVE character.
However, either way, the glyphs are being treated separately and deleted
one at a time. The best resolution to this, if it can be reproduced,
seems to be to treat each character and glyph in the combining sequence
separately unless specifically told to normalize (such as when the
argument is a filename).

Sean Burke

Scríobh Benno Schulenberg:
> Sean Burke wrote:
>>         The Unicode normalization test data at
>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt 
>> contains many sequences of this sort. 
>> The first chara cter sequence, LATIN CAPITAL LETTER D WITH DOT 
>> ABOVE, does produce this problem.
>>  Paste it into the commandline, then backspace through it. The
>> problem should be  reproduced immediately.
> Cannot reproduce it with bash-3.2-17.  Please retry with patch level 
> 17.  Patch 16 specifically addresses multibyte characters.
> Benno

reply via email to

[Prev in Thread] Current Thread [Next in Thread]