help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: any plans for command substitution that preserves trailing newlines?


From: Christoph Anton Mitterer
Subject: Re: any plans for command substitution that preserves trailing newlines?
Date: Wed, 26 Jan 2022 00:31:39 +0100
User-agent: Evolution 3.42.2-1

Hey.

Coming back to that topic,... mostly for the records (if anyone else
should ever stumble over this).


The following was pointed out[0] on another mailing list, namely that:

Using . or / as sentinel value should be generally fine (even with out
setting LC_ALL=C), as POSIX requires:

- The encoded values associated with <period>, <slash>, <newline>, and
  <carriage-return> shall be invariant across all locales supported by
  the implementation.”
  => which means AFAIU, that these will have the same binary
     representation in any locale/encoding.
- Likewise, the byte values used to encode <period>, <slash>,
  <newline>, and <carriage-return> shall not occur as part of any
  other character in any locale.”
  => which means AFAIU that it cannot happen, that a invalidly
    encoded character + the sentinel form together a valid character
    and thus the sentinel cannot be stripped of, as no partial byte
    sequence could be completed by these bytes/characters to a valid
    character in any locale/encoding. 
(see 6.1 Portable Character Set [1])


So if that holds true... simply appending . or / as sentinel within the
command substitution, and removing that afterwards (without any need
for locale changes) should *always* work, regardless of the
locale/encoding.

Can anyone confirm this?



@Koichi, with respect to your replies back then (especially your
comments about ISO/IEC 2022):


On Tue, 2021-06-01 at 11:55 +0900, Koichi Murase wrote:
> It seems the solution is also given there; set temporary LC_ALL=C
> (though it is pointed out that this doesn't work with yash).

I found several more shells that seem to not support changing LC_ALL
during runtime (at least without effect for the shell itself): [2], [3]


> There is no problem in UTF-8 where "x" will never appear as a valid
> trailing byte in multibyte characters.

But AFAIU, command substitution is defined to capture any stdout (i.e.
also invalid encoded stuff), except for NUL and trailing newlines.
So UTF-8 itself has no problem, but there is no guarantee, that the
command must generate only valid UTF-8.



> but "." isn't
> affected (as far as the answering person tried in Debian, FreeBSD,
> and
> Solaris), but this is not really a robust statement.

It became more robust not with what Thorsten Glaser pointed out.


However, I have no idea how these POSIX requirements relate with
respect what you wrote back then:

>  In theory,
> ISO/IEC 2022 encoding allows to change the meaning of C0 (\x00-\x1F),
> GL (\x21-\x7E), C1 (\x80-\x9F), and GR (\xA0-\xAF) by locking shift
> escape sequences. In particular, all the bit combinations (i.e.
> bytes)
> in GL which contain ASCII "." and "x" can be used for trailing bytes
> of 94^n character sets (such as LC_CTYPE=ja_JP.ISO-2022-JP). The only
> two bit-combinations that are unaffected by the ISO/IEC 2022 shifts
> are SP (space \x20) and DEL (^? or \x7F). But actually, the encodings
> that are fully ISO/IEC 2022 have hardly used as user locales because
> most utilities have problems in dealing with such context-dependent
> encoding schemes.

Would that "shifting" simply not be allowed in a POSIX compliant
shell/locale/encoding?


Cheers,
Chris.



[0] https://lists.zytor.com/archives/klibc/2022-January/004659.html
[1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html
[2] https://lists.zytor.com/archives/klibc/2022-January/004657.html
[3] 
https://lore.kernel.org/dash/e312d45e17b49c418c3a62a56da758977067b563.camel@scientia.org/T/#u



reply via email to

[Prev in Thread] Current Thread [Next in Thread]