help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: any plans for command substitution that preserves trailing newlines?


From: Christoph Anton Mitterer
Subject: Re: any plans for command substitution that preserves trailing newlines?
Date: Tue, 01 Jun 2021 23:03:11 +0200
User-agent: Evolution 3.38.3-1

Hey again.

First perhaps, in the sense of a shell variable - what exactly is it's
content respectively a string?

I tried to read this up in POSIX, but with no definite outcome.

3.375 String says:
>A contiguous sequence of bytes terminated by and including the first
>null byte.

but there are also definitions for wide strings (3.445 Wide-Character
String) where things get character-based instead of byte-based.

3.267 Parameter doesn't says whether the content is a string in the
sense of 3.375 or a 3.92 Character String.


Can a variable just hold a string that is valid in the current encoding
(and what then if the encoding is changed) or is it rather a binary
string (except NUL), an the actual interpretation only happens when
e.g. printed to the console?


On Tue, 2021-06-01 at 11:55 +0900, Koichi Murase wrote:
> It seems the solution is also given there; set temporary LC_ALL=C

Which, if it would work - and it doesn't seem to for me - would be
quite ugly.


> There is no problem in UTF-8 where "x" will never appear as a valid
> trailing byte in multibyte characters. The StackExchange answer you
> linked to mentions the character encoding BIG5, GB18030 and
> BIG5HKSCS.

I tried to reproduce this and actually once thought I did so, but now I
cannot reproduce it anymore (that is: now it always just works for me,
regardless of the encoding).

However, I do get other quite weird results (all bash 5.1.8(1)):

1) UTF-8
********
$ LANG=C.UTF-8
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
$ s=$'\xa9'

$ a=$'\x61\x20\x74\x65\x73\x74\x20\x73\x74\x72\x69\x6e\x67\x20\xc3'
$ printf '%s\n' "$a" ; printf '%s' "$a" | hd
a test string �
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3     |a test string .|
0000000f
==> ok here we seem to have a valid variable, but in invalid UTF-8 encoding

$ b=$a$s
$ printf '%s\n' "$b" ; printf '%s' "$b" | hd
a test string é
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3 a9  |a test string ..|
00000010
===> now the UTF-8 got valid, so the shell itself seems to operate on bytes 
rather than characters?

$ c=${b%$s}
$ printf '%s\n' "$c" ; printf '%s' "$c" | hd
a test string �
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3     |a test string .|
0000000f
$ printf '%s\n' "${b%$s}" ; printf '%s' "${b%$s}" | hd
a test string �
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3     |a test string .|
0000000f
===> same here, shell seems to operate on bytes rather than characters
     also the original value has been restored,
     so at least in this single case, the trailing sentinel value works


So far, as expected.
but now things get weird:


2) zh_TW.BIG5
*************
Repeating the same as above several times with BIG5, I get different
results, and cannot reproduce it when:
a) it works:
$ LANG=zh_TW.BIG5
$ locale
LANG=zh_TW.BIG5
LANGUAGE=
LC_CTYPE="zh_TW.BIG5"
LC_NUMERIC="zh_TW.BIG5"
LC_TIME="zh_TW.BIG5"
LC_COLLATE="zh_TW.BIG5"
LC_MONETARY="zh_TW.BIG5"
LC_MESSAGES="zh_TW.BIG5"
LC_PAPER="zh_TW.BIG5"
LC_NAME="zh_TW.BIG5"
LC_ADDRESS="zh_TW.BIG5"
LC_TELEPHONE="zh_TW.BIG5"
LC_MEASUREMENT="zh_TW.BIG5"
LC_IDENTIFICATION="zh_TW.BIG5"
LC_ALL=
$ s=$'\xa9'
$ a=$'\x61\x20\x74\x65\x73\x74\x20\x73\x74\x72\x69\x6e\x67\x20\xc3'
$ printf '%s\n' "$a" ; printf '%s' "$a" | hd
a test string �
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3     |a test string .|
0000000f
$ b=$a$s
$ printf '%s\n' "$b" ; printf '%s' "$b" | hd
a test string é
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3 a9  |a test string ..|
00000010
$ c=${b%$s}
$ printf '%s\n' "$c" ; printf '%s' "$c" | hd
a test string �
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3     |a test string .|
0000000f
$ printf '%s\n' "${b%$s}" ; printf '%s' "${b%$s}" | hd
a test string �
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3     |a test string .|
0000000f

===> here, unlike claimed in the article, it *does* work even for
BIG5... wasn't this supposed to not work?


But when I repeat this several times,.. every once in a while I get:
calestyo@heisenberg:~$ LANG=zh_TW.BIG5
calestyo@heisenberg:~$ locale
LANG=zh_TW.BIG5
LANGUAGE=
LC_CTYPE="zh_TW.BIG5"
LC_NUMERIC="zh_TW.BIG5"
LC_TIME="zh_TW.BIG5"
LC_COLLATE="zh_TW.BIG5"
LC_MONETARY="zh_TW.BIG5"
LC_MESSAGES="zh_TW.BIG5"
LC_PAPER="zh_TW.BIG5"
LC_NAME="zh_TW.BIG5"
LC_ADDRESS="zh_TW.BIG5"
LC_TELEPHONE="zh_TW.BIG5"
LC_MEASUREMENT="zh_TW.BIG5"
LC_IDENTIFICATION="zh_TW.BIG5"
LC_ALL=
calestyo@heisenberg:~$ s=$'\xa9'
calestyo@heisenberg:~$ 
a=$'\x61\x20\x74\x65\x73\x74\x20\x73\x74\x72\x69\x6e\x67\x20\xc3'
calestyo@heisenberg:~$ printf '%s\n' "$a" ; printf '%s' "$a" | hd
a test string �
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3     |a test string .|
0000000f
calestyo@heisenberg:~$ b=$a$s
calestyo@heisenberg:~$ printf '%s\n' "$b" ; printf '%s' "$b" | hd
a test string é
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3 a9  |a test string ..|
00000010
calestyo@heisenberg:~$ c=${b%$s}
calestyo@heisenberg:~$ printf '%s\n' "$c" ; printf '%s' "$c" | hd
a test string �
00000000  61 20 74 65 73 74 20 73  74 72 69 6e 67 20 c3     |a test string .|
0000000f
calestyo@heisenberg:~$ printf '%s\n' "${b%$'\xa9'}"
bash: ���~�����N: �b "${b%�}" ���S�����X���u}�v


No clue what happens here or why the final printf fails (exit status is
1)... but sometimes it just does.



> But I couldn't reproduce their problems and for me the sentinel value
> just worked, though I only tried this in a UTF-8 locale.

As I've written already, UTF-8 doesn't have a problem.

Hmm, but isn't it strange already, that once the character became an é
one can remove an \xa9 from it again?



> Can someone (Chet?) confirm that the solution with adding *any*
> character and removing it later on works (i.e. with any locale and
> any
> valid variable content, which is, AFAIU, anything but NUL)?

Do you count misencoded strings as "valid" variable content? As far as
the data is correctly encoded in the current LC_CTYPE, it should
always work as expected.

Well that's basically may question in the beginning of that mail: What
is a variable intended to contain?

This is especially important when one takes pathnames. AFAIU e.g. Linux
filesystems don't specify any encoding at all and filenames are just
any bytes except NUL.
Whether these are then interpreted as UTF-8 or according to the current
locale or something else is up to the respective program.

So basically, any bytestring could occur.



> Or does this work with just some characters like claimed in some
> posts
> on stackoverflow?

Another StackExchange answer says that "x" is affected but "." isn't
affected (as far as the answering person tried in Debian, FreeBSD, and
Solaris), but this is not really a robust statement.

No, it isn't... which is also the reason why I brought this up here and
asked for a proper solution.


> Does anyone know whether this is just a feature of bash or works in
> any
> sh compatible shell?

In the StackExchange answer you provided, it is mentioned that it
fails with zsh (though it is also reported in the comment that zsh
doesn't fail). It is also mentioned that the LC_ALL workaround doesn't
work in yash.

But even the LC_ALL workaround doesn't work for me - in the sense that
even without I don't see a problem ^^


Thanks,
Chris




reply via email to

[Prev in Thread] Current Thread [Next in Thread]