[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Some byte combinations affect UTF-8 string reading
From: |
Chet Ramey |
Subject: |
Re: Some byte combinations affect UTF-8 string reading |
Date: |
Mon, 25 Feb 2019 14:32:32 -0500 |
User-agent: |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 |
On 2/25/19 11:17 AM, Olga Ustuzhanina wrote:
> Bash Version: 5.0
> Patch Level: 2
> Release Status: release
>
> Description:
> When using `IFS= read -r -d '' input` to read null-delimited
> strings on a system with bash 5.0+ and UTF-8 locale, you can
> encounter situation when one of strings being read ends in a
> character in range \xC2-\xFD (inclusive) and the next string is
> empty. Something like this: "...\xC2\0\0...."
>
> We would expect `read` to read this string up until \0 right
> after \xC2 so that the next `read` will get an empty string
> (from first \0 to second \0) and third will read the rest of
> the string, past second \0.
>
> Turns out this isn't the case. In reality, first 'read'
> loads the expected part of the string, but second one
> actually actually reads the rest of the string, not the
> expected empty substring.
>
> Repeat-By:
> # Reproduces bug on Bash 5.0+ with LANG set to a
> # UTF-8 locale (en_US.UTF-8)
>
> # First, let's make a function that translates a
> # null-delimited list into a comma-delimited list
>
> ntc() {
> while IFS= read -r -d '' input; do
> printf "$input;"
> done
> }
>
> # It works in general case:
>
> printf "a\0b\0c\0d\0" | ntc | xxd
>
> # But when some element of a list ends in a character from 0xC2 to
> # 0xFD # and the next element is empty, we end up with the empty
> # element being lost
>
> printf "\xc2\0\0\0\0" | ntc | xxd
This is an invalid multibyte character. The \xc2 is the valid first byte
of a multibyte character, but the next byte read makes the sequence
invalid. The read builtin resynchronizes on the following byte. There's
currently no facility to push back the invalid parts of a multibyte
character. There might be a way to do it if the read is buffered inside
bash, but the `-d' option makes it unbuffered.
Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU chet@case.edu http://tiswww.cwru.edu/~chet/