[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Some byte combinations affect UTF-8 string reading
From: |
Chet Ramey |
Subject: |
Re: Some byte combinations affect UTF-8 string reading |
Date: |
Tue, 26 Feb 2019 14:57:29 -0500 |
User-agent: |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 |
On 2/25/19 5:42 PM, Olga Ustuzhanina wrote:
> On Mon, 25 Feb 2019 12:59:38 -0800
> L A Walsh <bash@tlinx.org> wrote:
>
>> In this case, the decode of \xc2 doesn't swallow the following
>> character.
>
> I want to clarify that \xc2 (and other characters in the range
> mentioned above) can only swallow a \0. Other characters are
> unaffected.
The other characters wouldn't be treated as a delimiter either. The \0
is `swallowed' because it's the C string terminator.
The \0 gets added to the input string, but it's not treated as a delimiter,
since it's part of the invalid multibyte sequence. Then the next character
is read, that \0 is treated as a delimiter, and the input string is
assigned to the variable, including the \0. That gets treated as a normal C
string terminator, since variable values can't contain NULs.
(This is why read discards \0 unless it's a delimiter. It would terminate
the value assigned to the variable.)
Bash-4.4 returned different results because it didn't attempt to validate
reading multibyte characters at all unless it was reading a fixed number of
characters.
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU chet@case.edu http://tiswww.cwru.edu/~chet/