bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Consume only up to 8 bit octal input for backslash-escaped chars (ec


From: Eric Blake
Subject: Re: Consume only up to 8 bit octal input for backslash-escaped chars (echo, printf)
Date: Tue, 07 Dec 2010 19:02:30 -0700
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101103 Fedora/1.0-0.33.b2pre.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.6

[adding the Austin Group]

On 12/07/2010 06:19 PM, Chet Ramey wrote:
> On 12/7/10 11:12 AM, Roman Rakus wrote:
>> This one is already reported on coreutils:
>> http://debbugs.gnu.org/cgi/bugreport.cgi?msg=2;bug=7574
>>
>> The problem is with numbers higher than /0377; echo and printf consumes all
>> 3 numbers, but it is not 8-bit number. For example:
>> $ echo -e '\0610'; printf '\610 %b\n' '\610 \0610'
>> Should output:
>> 10
>> 10 10 10
>> instead of
>> �
>> � � �
> 
> No, it shouldn't.  This is a terrible idea.  All other shells I tested
> behave as bash does*, bash behaves as Posix specifies, and the bash
> behavior is how C character constants work.  Why would I change this?
> 
> (*That is, consume up to three octal digits and mask off all but the lower
> 8 bits of the result.)

POSIX states for echo:

"\0num Write an 8-bit value that is the zero, one, two, or three-digit
octal number num."

It does not explicitly say what happens if a three-digit octal number is
not an 8-bit value, so it is debatable whether the standard requires at
most an 8-bit value (two characters, \0061 followed by 0) or whether the
overflow is silently ignored (treated as one character \0210), or some
other treatment.

The C99 standard states (at least in 6.4.4.4 of the draft N1256 document):

"The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape sequence
that does not map to a single-byte execution character, is
implementation-defined."

leaving '\610' as an implementation-defined character constant.

The Java language specifically requires "\610" to parse as "\061"
followed by "0", and this can be a very useful property to rely on in
this day and age where 8-bit bytes are prevalent.

http://austingroupbugs.net/view.php?id=249 is standardizing $'' in the
shell, and also states:

"\XXX yields the byte whose value is the octal value XXX (one to three
octal digits)"

and while it is explicit that $'\xabc' is undefined (as to whether it
maps to $'\xab'c or to $'\u0abc' or to something else), it does not have
any language talking about what happens when an octal escape does not
fit in a byte.

Personally, I would love it if octal escapes were required to stop
parsing after two digits if the first digit is > 3, but given that C99
leaves it implementation defined, I think we need a POSIX interpretation
to resolve the issue.  Also, I think this report means that we need to
tweak the wording of bug 249 (adding $'') to deal with the case of an
octal escape where three octal digits do not fit in 8 bits (either by
explicitly declaring it unspecified, as is the case with \x escapes; or
by requiring implementation-defined behavior, as in C99; or by requiring
explicit end-of-escape after two digits, as in Java).

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]