bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: accents


From: Greg Wooledge
Subject: Re: accents
Date: Tue, 10 May 2011 09:17:03 -0400
User-agent: Mutt/1.4.2.3i

On Tue, May 10, 2011 at 04:47:29AM +0200, Thomas De Contes wrote:
> tDeContes-fixe:~ thomas$ echo "$PS1"
> + echo '\h:\W \u\$ '
> \h:\W \u\$ 

> if i do not
> PS1="&# $PS1"
> then i don't have the problem described in 1

I am not able to reproduce this in my environment.  I'm using Debian 6.0
on i386, I built bash 4.2.10 from source, and I'm in the en_US.UTF-8
locale in a urxvt terminal.

> i don't use colors, at least i don't see them and i don't want them in my 
> terminal

Yes, your PS1 is definitely color-free.

> What do you think about my PS1 ?
> Is there something else important about colors ?

Not in your case.

> > Is the accented character
> > a single-byte character, or a multi-byte character, in your locale?
> 
> a multi-byte character, i think
> How to confirm that ?

> $ echo /Users/thomas/Downloads/réz | h
> + echo $'/Users/thomas/Downloads/re?\201z'
> + hexdump -C
> 00000000  2f 55 73 65 72 73 2f 74  68 6f 6d 61 73 2f 44 6f  |/Users/thomas/Do|
> 00000010  77 6e 6c 6f 61 64 73 2f  72 65 cc 81 7a 0a        |wnloads/re..z.|
> 0000001e

Oh... now this is interesting.  In my locale (not the one I'm writing this
email from, but the one I tested in), an é is 0xc3 0xa9 which is the UTF-8
encoding of the Unicode character U+00E9, LATIN SMALL LETTER E WITH ACUTE.

In yours, however, it is 0x65 0xcc 0x81 which is U+0065 LATIN SMALL
LETTER E followed by U+0301 COMBINING ACUTE ACCENT.

Perhaps Bash does not know how to interpret COMBINING ACUTE ACCENT which
follows a letter...?

I'm not intimately familiar with this stuff myself, but it looks like
a real bastard to me... I thought the point of UTF-8 was that you could
read it a byte at a time, and know when you encountered a byte that
signified the start of a multi-byte character.  But apparently not!
If I'm interpreting this COMBINING ACUTE ACCENT thing properly, the
only indicator that you are in a multi-byte character comes with the
*second* byte, so you have to backtrack.  What idiot thought this up?

With that in mind, let's see if I can reproduce some of this problem.
Please bear in mind that as I paste this from the test environment
terminal into the email-writing terminal, I have to make some manual
adjustments to preserve the observed output.

wooledg@wooledg:~$ touch $'re\xcc\x81z'
wooledg@wooledg:~$ echo r?z
r?z
wooledg@wooledg:~$ echo r*z
réz
wooledg@wooledg:~$ ls -b r*z
réz

The terminal, when presented with the string of bytes that is the filename,
renders it as réz.  However, Bash's globbing does NOT recognize this as
a three-character filename beginning with 'r' and ending with 'z', as
the r?z glob was not expanded.  ls -b also doesn't think there is anything
particularly noteworthy about this filename, which is slightly annoying.

(Bash's failure to glob this might be a second bug, or possibly another
manifestation of the same bug you're pursuing.)

When I double-click and then middle-click to select and paste the filename
as rendered by the terminal back into the terminal, however, I do not
get re\xcc\x81z any more; rather, I get r\xc3\xa9z.  So my attempts
to reproduce your reported problem in this way fail.

The next obvious way to reproduce the problem would be to get bash to
produce the filename itself through tab completion, rather than pasting.
With that in mind, I'll try to move the file to a different name that
will be tab-completable.

The é in the filename is not the same as the é that I produce by typing
Multi_key ' e on my keyboard:

&# wooledg@wooledg:~$ mv réz zzréz
mv: cannot stat `réz': No such file or directory

That was done solely by typing.  As you can see, we've got two different
é characters flying around.  Oh joy.

&# wooledg@wooledg:~$ rm r*z
&# wooledg@wooledg:~$ touch $'zzre\xcc\x81z'
&# wooledg@wooledg:~$ ls zzréz 
zzréz

OK, this time, I did not type zzréz -- rather, I typed zzr and pressed
Tab, and bash supplied the rest of the filename.  This *looks* like the
same filename I would get by typing zzréz but it is not.

However, even with this I still am not able to reproduce the problem:

&# wooledg@wooledg:~$ set -o emacs
&# wooledg@wooledg:~$ PS1="&# $PS1"
&# &# wooledg@wooledg:~$ echo zzréz zzréz zzréz zzréz zzréz zzréz zzréz zzréz 
zzréz zzréz zzréz

All of the zzrézs there were produced by tab completion.  Going up and
down with the arrow keys does not cause any of the symptoms you described.
Not in Debian's bash 4.1.5, and not in the bash 4.2.10 that I compiled.

So, while I can't reproduce the problem, I think I might have uncovered
one of the issues that's triggering it (different encodings of the é
character).  There could be an additional dependency on your terminal,
or your operating system, or some other undiscovered factor.  Good luck.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]