Re: Some byte combinations affect UTF-8 string reading

bug-bash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some byte combinations affect UTF-8 string reading

From:	L A Walsh
Subject:	Re: Some byte combinations affect UTF-8 string reading
Date:	Mon, 25 Feb 2019 12:59:38 -0800
User-agent:	Thunderbird

On 2/25/2019 11:32 AM, Chet Ramey wrote:
> On 2/25/19 11:17 AM, Olga Ustuzhanina wrote:
>
>   
>
> This is an invalid multibyte character. The \xc2 is the valid first byte
> of a multibyte character, but the next byte read makes the sequence
> invalid. The read builtin resynchronizes on the following byte. There's
> currently no facility to push back the invalid parts of a multibyte
> character. There might be a way to do it if the read is buffered inside
> bash, but the `-d' option makes it unbuffered.
>   
----
    Note: this is in bash 4.4.12 -- is there supposed to be a behavior
difference in 5.0?

If I change the previous example to use default IFS
as a delimiter...same as previous function,
then print the same string, using LF's instead
of NUL's:

ntc() { while read -r input; do printf "$input;" ; done ; }
printf $'\xc2\n\n\n\n'|ntc|hexdump -C        
00000000  c2 3b 3b 3b 3b                                    |.;;;;|
00000005

In this case, the decode of \xc2 doesn't swallow the following
character.

But in 4.4.12, using IFS='':

ntc() {  while IFS='' read -r input; do printf "$input;" ; done ; }

gives no output regardless of whether the 1st character is decoded
correctly or not.  I.e.
printf $'\xc2\xa9\x00\x00\x00\x00'|ntc|hd
 and
printf $'\xc2\00\00\00\00'|ntc|hexdump -C

both result in no output.  Is that what happens on 5.x?

[Prev in Thread]

Current Thread

[Next in Thread]

Some byte combinations affect UTF-8 string reading, Olga Ustuzhanina, 2019/02/25
- Re: Some byte combinations affect UTF-8 string reading, Chet Ramey, 2019/02/25
  - Re: Some byte combinations affect UTF-8 string reading, L A Walsh <=
    - Re: Some byte combinations affect UTF-8 string reading, Olga Ustuzhanina, 2019/02/25
    - Re: Some byte combinations affect UTF-8 string reading, Chet Ramey, 2019/02/26
    - Re: Some byte combinations affect UTF-8 string reading, Grisha Levit, 2019/02/25

Prev by Date: Re: turning on file+line for functions with shopt -s extdebug gives error
Next by Date: Re: "$@" expansion when it is consists of only null strings
Previous by thread: Re: Some byte combinations affect UTF-8 string reading
Next by thread: Re: Some byte combinations affect UTF-8 string reading
Index(es):
- Date
- Thread