Some byte combinations affect UTF-8 string reading

bug-bash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Some byte combinations affect UTF-8 string reading

From:	Olga Ustuzhanina
Subject:	Some byte combinations affect UTF-8 string reading
Date:	Mon, 25 Feb 2019 23:17:49 +0700

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: cc
Compilation CFLAGS: -fstack-clash-protection -D_FORTIFY_SOURCE=2
-mtune=generic -O2 -pipe  -DSYS_BASHRC='/etc/bash/bashrc' -g
-Wno-parentheses -Wno-format-security uname output: Linux laserbook
4.20.12_1 #1 SMP PREEMPT Sat Feb 23 15:05:07 UTC 2019 x86_64 GNU/Linux
Machine Type: x86_64-unknown-linux-gnu

Bash Version: 5.0
Patch Level: 2
Release Status: release

Description:
        When using  `IFS= read -r -d '' input` to read null-delimited
        strings on a system with bash 5.0+ and UTF-8 locale, you can
        encounter situation when one of strings being read ends in a
        character in range \xC2-\xFD (inclusive) and the next string is
        empty. Something like this: "...\xC2\0\0...."

        We would expect `read` to read this string up until \0 right
        after \xC2 so that the next `read` will get an empty string
        (from first \0 to second \0) and third will read the rest of
        the string, past second \0.

        Turns out this isn't the case. In reality, first 'read'
        loads the expected part of the string, but second one
        actually actually reads the rest of the string, not the
        expected empty substring.

Repeat-By:
        # Reproduces bug on Bash 5.0+ with LANG set to a
        # UTF-8 locale (en_US.UTF-8)

        # First, let's make a function that translates a
        # null-delimited list into a comma-delimited list 

        ntc() {
                while IFS= read -r -d '' input; do
                        printf "$input;"
                done
        }

        # It works in general case:

        printf "a\0b\0c\0d\0" | ntc | xxd

        # But when some element of a list ends in a character from 0xC2 to
        # 0xFD # and the next element is empty, we end up with the empty
        # element being lost

        printf "\xc2\0\0\0\0" | ntc | xxd

        # But, setting LANG='C' makes the issue go away

        printf "\xc2\0\0\0\0" | LANG='C' ntc | xxd

        # Also, characters outside of C2-FE range work fine

        printf "\xc1\0\0\0\0" | ntc | xxd
        printf "\xfe\0\0\0\0" | ntc | xxd
        printf "\x9f\0\0\0\0" | ntc | xxd

[Prev in Thread]

Current Thread

[Next in Thread]

Some byte combinations affect UTF-8 string reading, Olga Ustuzhanina <=
- Re: Some byte combinations affect UTF-8 string reading, Chet Ramey, 2019/02/25
  - Re: Some byte combinations affect UTF-8 string reading, L A Walsh, 2019/02/25
    - Re: Some byte combinations affect UTF-8 string reading, Olga Ustuzhanina, 2019/02/25
    - Re: Some byte combinations affect UTF-8 string reading, Chet Ramey, 2019/02/26
    - Re: Some byte combinations affect UTF-8 string reading, Grisha Levit, 2019/02/25

Prev by Date: Re: [address@hidden: Re: Bash 5 change in behavior and SELinux]
Next by Date: Re: Some byte combinations affect UTF-8 string reading
Previous by thread: turning on file+line for functions with shopt -s extdebug gives error
Next by thread: Re: Some byte combinations affect UTF-8 string reading
Index(es):
- Date
- Thread