bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: \c escape within $'...' can produce mangled UTF-8


From: Dmitry Groshev
Subject: Re: \c escape within $'...' can produce mangled UTF-8
Date: Sun, 15 Aug 2010 14:02:05 +0400

On 15/08/2010, Dennis Williamson <dennistwilliamson@gmail.com> wrote:
> It only consumes two bytes on my system (or one if it's followed by
> another escape or a closing quote).

You are wrong. Try "echo $'\x{123456}AB'" and look at the result.
Or read the source code: lib/sh/strtans.c

> "Backslash-escaped characters" refers to the "c" in "\c" not the
> characters that follow it.

Given that documentation doesn't say anything like that anywhere, and
given that _every other escape_ operates on characters (accepting only
ASCII chars, and leaving multibyte ones alone) - inventing an
exception specifically for "\c" would look quite contrived.

> It's the responsibility of your code to put an ASCII character after
> the \c.

My code is fine, thank you. ;-) Given that I never had any use for
"\c" when there is "\x".
Instead I found this weirdness in the Bash source code when writing my
own function for interpreting (some of) shell syntax.

> There's no way for Bash to guess that the 0xD0 is part of a
> Unicode character or the byte that it is.

Everything between 0x80 and 0xFF is part of (possibly invalid)
multibyte sequence in UTF-8. Read up on the UTF-8 encoding, and don't
make wrong guesses again.

-- 
-= With best regards, Dmitry Groshev =-



reply via email to

[Prev in Thread] Current Thread [Next in Thread]