[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Some byte combinations affect UTF-8 string reading
From: |
Olga Ustuzhanina |
Subject: |
Some byte combinations affect UTF-8 string reading |
Date: |
Mon, 25 Feb 2019 23:17:49 +0700 |
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: cc
Compilation CFLAGS: -fstack-clash-protection -D_FORTIFY_SOURCE=2
-mtune=generic -O2 -pipe -DSYS_BASHRC='/etc/bash/bashrc' -g
-Wno-parentheses -Wno-format-security uname output: Linux laserbook
4.20.12_1 #1 SMP PREEMPT Sat Feb 23 15:05:07 UTC 2019 x86_64 GNU/Linux
Machine Type: x86_64-unknown-linux-gnu
Bash Version: 5.0
Patch Level: 2
Release Status: release
Description:
When using `IFS= read -r -d '' input` to read null-delimited
strings on a system with bash 5.0+ and UTF-8 locale, you can
encounter situation when one of strings being read ends in a
character in range \xC2-\xFD (inclusive) and the next string is
empty. Something like this: "...\xC2\0\0...."
We would expect `read` to read this string up until \0 right
after \xC2 so that the next `read` will get an empty string
(from first \0 to second \0) and third will read the rest of
the string, past second \0.
Turns out this isn't the case. In reality, first 'read'
loads the expected part of the string, but second one
actually actually reads the rest of the string, not the
expected empty substring.
Repeat-By:
# Reproduces bug on Bash 5.0+ with LANG set to a
# UTF-8 locale (en_US.UTF-8)
# First, let's make a function that translates a
# null-delimited list into a comma-delimited list
ntc() {
while IFS= read -r -d '' input; do
printf "$input;"
done
}
# It works in general case:
printf "a\0b\0c\0d\0" | ntc | xxd
# But when some element of a list ends in a character from 0xC2 to
# 0xFD # and the next element is empty, we end up with the empty
# element being lost
printf "\xc2\0\0\0\0" | ntc | xxd
# But, setting LANG='C' makes the issue go away
printf "\xc2\0\0\0\0" | LANG='C' ntc | xxd
# Also, characters outside of C2-FE range work fine
printf "\xc1\0\0\0\0" | ntc | xxd
printf "\xfe\0\0\0\0" | ntc | xxd
printf "\x9f\0\0\0\0" | ntc | xxd
- Some byte combinations affect UTF-8 string reading,
Olga Ustuzhanina <=