[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some byte combinations affect UTF-8 string reading

From: Chet Ramey
Subject: Re: Some byte combinations affect UTF-8 string reading
Date: Tue, 26 Feb 2019 14:57:29 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.5.1

On 2/25/19 5:42 PM, Olga Ustuzhanina wrote:
> On Mon, 25 Feb 2019 12:59:38 -0800
> L A Walsh <address@hidden> wrote:
>> In this case, the decode of \xc2 doesn't swallow the following
>> character.
> I want to clarify that \xc2 (and other characters in the range
> mentioned above) can only swallow a \0. Other characters are
> unaffected.

The other characters wouldn't be treated as a delimiter either. The \0
is `swallowed' because it's the C string terminator.

The \0 gets added to the input string, but it's not treated as a delimiter,
since it's part of the invalid multibyte sequence. Then the next character
is read, that \0 is treated as a delimiter, and the input string is
assigned to the variable, including the \0. That gets treated as a normal C
string terminator, since variable values can't contain NULs.

(This is why read discards \0 unless it's a delimiter. It would terminate
the value assigned to the variable.)

Bash-4.4 returned different results because it didn't attempt to validate
reading multibyte characters at all unless it was reading a fixed number of

``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    address@hidden    http://tiswww.cwru.edu/~chet/

reply via email to

[Prev in Thread] Current Thread [Next in Thread]