[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some byte combinations affect UTF-8 string reading

From: L A Walsh
Subject: Re: Some byte combinations affect UTF-8 string reading
Date: Mon, 25 Feb 2019 12:59:38 -0800
User-agent: Thunderbird

On 2/25/2019 11:32 AM, Chet Ramey wrote:
> On 2/25/19 11:17 AM, Olga Ustuzhanina wrote:
> This is an invalid multibyte character. The \xc2 is the valid first byte
> of a multibyte character, but the next byte read makes the sequence
> invalid. The read builtin resynchronizes on the following byte. There's
> currently no facility to push back the invalid parts of a multibyte
> character. There might be a way to do it if the read is buffered inside
> bash, but the `-d' option makes it unbuffered.
    Note: this is in bash 4.4.12 -- is there supposed to be a behavior
difference in 5.0?

If I change the previous example to use default IFS
as a delimiter...same as previous function,
then print the same string, using LF's instead
of NUL's:

ntc() { while read -r input; do printf "$input;" ; done ; }
printf $'\xc2\n\n\n\n'|ntc|hexdump -C        
00000000  c2 3b 3b 3b 3b                                    |.;;;;|

In this case, the decode of \xc2 doesn't swallow the following

But in 4.4.12, using IFS='':

ntc() {  while IFS='' read -r input; do printf "$input;" ; done ; }

gives no output regardless of whether the 1st character is decoded
correctly or not.  I.e.
printf $'\xc2\xa9\x00\x00\x00\x00'|ntc|hd
printf $'\xc2\00\00\00\00'|ntc|hexdump -C

both result in no output.  Is that what happens on 5.x?

reply via email to

[Prev in Thread] Current Thread [Next in Thread]