help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: any plans for command substitution that preserves trailing newlines?


From: Koichi Murase
Subject: Re: any plans for command substitution that preserves trailing newlines?
Date: Wed, 2 Jun 2021 08:41:00 +0900

 2021年6月2日(水) 6:03 Christoph Anton Mitterer <calestyo@scientia.net>:
>
> Hey again.
>
> First perhaps, in the sense of a shell variable - what exactly is it's
> content respectively a string?

I haven't checked it how it is defined in POSIX, but I suspect it is
not explicitly defined. For example. the zsh variables can contain NUL
characters, but the Bash variables cannot. So, at least there is an
implementation dependency.

> 3.375 String says:
> but there are also definitions for wide strings (3.445 Wide-Character
> String) where things get character-based instead of byte-based.

I think XBD 3.375 and 3.445 talk about strings in the C language.

> 3.267 Parameter doesn't says whether the content is a string in the
> sense of 3.375 or a 3.92 Character String.

If you are interested, you should check XCU 2 (Shell language). In
particular, XCU 2.5.3 may be related. It says the variables should be
initialized by the environment variables, so maybe we can say the
shells should support any values that can be stored in the environment
variables (i.e., any null-terminated byte sequences) if we require all
the environment variables should be exactly reflected in the shell
variables. But at the same time, the standard doesn't say the values
of the environment variables should be faithfully transformed into the
values of shell variables. Also, the standard doesn't require
translating all the environment variables. For example, it is clear
that some of the environment variables whose name doesn't have the
form of NAME (i.e., XBD 3.235 Name) cannot be turned into the shell
variables.

> Can a variable just hold a string that is valid in the current encoding
> (and what then if the encoding is changed) or is it rather a binary
> string (except NUL), an the actual interpretation only happens when
> e.g. printed to the console?

It may depend on the implementation. The Bash variables can hold
misencoded strings. The Zsh variables can hold any binary data
including NUL, etc. I don't know an example, but I wouldn't be
surprised if there is any shell in which misencoded data cannot be
assigned to shell variables.

> On Tue, 2021-06-01 at 11:55 +0900, Koichi Murase wrote:
> > It seems the solution is also given there; set temporary LC_ALL=C
>
> Which, if it would work - and it doesn't seem to for me -

In what sense, do you think it fails? There may be some shell
implementation that doesn't work with this workaround (e.g. yash), but
LC_ALL=C should work in most shells as described in the original Stack
Exchange page. But, I think LC_ALL=C is the most straightforward and
natural solution which should work in a natural implementation.

> would be quite ugly.

How is it ugly? Whether it is ugly or not is a subjective matter, so
you need to explain in what sense it is "ugly" (or more correctly,
problematic). Anyway, workarounds are always "ugly" to some extent,
including the "x" hack.

> > There is no problem in UTF-8 where "x" will never appear as a valid
> > trailing byte in multibyte characters. The StackExchange answer you
> > linked to mentions the character encoding BIG5, GB18030 and
> > BIG5HKSCS.
>
> I tried to reproduce this and actually once thought I did so, but now I
> cannot reproduce it anymore (that is: now it always just works for me,
> regardless of the encoding).
>
> However, I do get other quite weird results (all bash 5.1.8(1)):

Are you talking about Bash? or general POSIX shells? The original
Stack Exchange page mainly discusses the portable way which works in
various shells. I haven't personally tried these behaviors by myself,
but have you tried other shells?

> 1) UTF-8
> ********
>
> [...]
>
> So far, as expected.
> but now things get weird:
>
>
> 2) zh_TW.BIG5
> *************
>
> [...]
>
> ===> here, unlike claimed in the article, it *does* work even for
> BIG5... wasn't this supposed to not work?

That is also already described on the Stack Exchange page. It says
ash, bash, lksh, and mksh would work, but ksh and zsh don't work. When
a solution fails in some cases, we say "the solution doesn't work".
The phrase "doesn't work" doesn't mean it behaves in a different way
for every case.

> But when I repeat this several times,.. every once in a while I get:
>
> [...]
>
> calestyo@heisenberg:~$ printf '%s\n' "${b%$'\xa9'}"
> bash: ���~�����N: �b "${b%�}" ���S�����X���u}�v
>
>
> No clue what happens here or why the final printf fails (exit status is
> 1)... but sometimes it just does.

I don't know, but I guess ${b%$'\xa9'} is parsed as ${b%<Broken byte>}
in the first pass of the processing, and then in the evaluation phase
of the parameter expansion, Bash cannot find the closing "}" because
"<Broken byte>}" is paired as one character by the decoder. What
happens with the following commands with zh_TW.BIG5? (I'm reluctant to
install new locales to my machine.)

$ b=$'\xc3\xa9'
$ shopt -s extquote
$ func() { printf '%s\n' "${b%$'\xa9'}"; }
$ func
$ func2() { printf '%s\n' "${b%$'\xa9'}}"; }
$ func2

> As I've written already, UTF-8 doesn't have a problem.
>
> Hmm, but isn't it strange already, that once the character became an é
> one can remove an \xa9 from it again?

Once you have provided a misencoded data (i.e. \xa9), I think the
behavior is undefined, i.e., one cannot complain about it no matter
what happens as a result. For this reason, one should set LC_ALL=C
which makes it operate on byte sequences where any byte strings are
well-defined and accepted.

> Do you count misencoded strings as "valid" variable content? As far as
> the data is correctly encoded in the current LC_CTYPE, it should
> always work as expected.
>
> Well that's basically may question in the beginning of that mail: What
> is a variable intended to contain?

It depends on the shell implementation.

> This is especially important when one takes pathnames. AFAIU e.g. Linux
> filesystems don't specify any encoding at all and filenames are just
> any bytes except NUL.
> Whether these are then interpreted as UTF-8 or according to the current
> locale or something else is up to the respective program.
>
> So basically, any bytestring could occur.

Yes, so LC_ALL=C.

> > Does anyone know whether this is just a feature of bash or works in
> > any
> > sh compatible shell?
>
> In the StackExchange answer you provided, it is mentioned that it
> fails with zsh (though it is also reported in the comment that zsh
> doesn't fail). It is also mentioned that the LC_ALL workaround doesn't
> work in yash.
>
> But even the LC_ALL workaround doesn't work for me - in the sense that
> even without I don't see a problem ^^

We say "it works" to describe that situation. Also, the LC_ALL=C
workaround is for POSIX shells.

--
Koichi



reply via email to

[Prev in Thread] Current Thread [Next in Thread]