[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some byte combinations affect UTF-8 string reading

From: Chet Ramey
Subject: Re: Some byte combinations affect UTF-8 string reading
Date: Mon, 25 Feb 2019 14:32:32 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.5.1

On 2/25/19 11:17 AM, Olga Ustuzhanina wrote:

> Bash Version: 5.0
> Patch Level: 2
> Release Status: release
> Description:
>       When using  `IFS= read -r -d '' input` to read null-delimited
>       strings on a system with bash 5.0+ and UTF-8 locale, you can
>       encounter situation when one of strings being read ends in a
>       character in range \xC2-\xFD (inclusive) and the next string is
>       empty. Something like this: "...\xC2\0\0...."
>       We would expect `read` to read this string up until \0 right
>       after \xC2 so that the next `read` will get an empty string
>       (from first \0 to second \0) and third will read the rest of
>       the string, past second \0.
>       Turns out this isn't the case. In reality, first 'read'
>       loads the expected part of the string, but second one
>       actually actually reads the rest of the string, not the
>       expected empty substring.
> Repeat-By:
>       # Reproduces bug on Bash 5.0+ with LANG set to a
>       # UTF-8 locale (en_US.UTF-8)
>       # First, let's make a function that translates a
>       # null-delimited list into a comma-delimited list 
>       ntc() {
>               while IFS= read -r -d '' input; do
>                       printf "$input;"
>               done
>       }
>       # It works in general case:
>       printf "a\0b\0c\0d\0" | ntc | xxd
>       # But when some element of a list ends in a character from 0xC2 to
>       # 0xFD # and the next element is empty, we end up with the empty
>       # element being lost
>       printf "\xc2\0\0\0\0" | ntc | xxd

This is an invalid multibyte character. The \xc2 is the valid first byte
of a multibyte character, but the next byte read makes the sequence
invalid. The read builtin resynchronizes on the following byte. There's
currently no facility to push back the invalid parts of a multibyte
character. There might be a way to do it if the read is buffered inside
bash, but the `-d' option makes it unbuffered.


``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    address@hidden    http://tiswww.cwru.edu/~chet/

reply via email to

[Prev in Thread] Current Thread [Next in Thread]