help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: any plans for command substitution that preserves trailing newlines?


From: Christoph Anton Mitterer
Subject: Re: any plans for command substitution that preserves trailing newlines?
Date: Fri, 28 Jan 2022 01:29:36 +0100
User-agent: Evolution 3.42.2-1

(guess the actual question is already resolved :-) ... most of what I
write below is rather trying to understand more how it's done
internally)


On Thu, 2022-01-27 at 10:39 -0500, Chet Ramey wrote:
> > They're also listed in  2.5.3 Shell Variables[0]:
> > 
> > > The following variables shall affect the execution of the shell:
> > 
> > which I'd have interpreted as "during runtime"?
> 
> Sure, if you inherit them from the environment. There's no
> requirement that
> the shell update its idea of the locale based on assignments.

I agree in so far, that it's not said explicitly, but doesn't it kinda
follow from:
a) all the other variables there, which AFAIU clearly have a 1:1
   binding between variable value and whatever the shell internally
   thinks is going on, especially:
   PWD  PATH  IFS
   Sure that shell could internally set something different than what
   these contain, but it would kinda defeat their purpose.
b) the wording, e.g.:
   "LC_CTYPE Determine the interpretation..."
   When that is the "thing" that determines the property, that must be 
   bound 1:1 to the actual state, or otherwise it wouldn't determine it
   or then at least have to "act" to immediately set it again, when it
   would differ


> I think you missed the part below where I agreed with that, as long
> as you
> just do it around the expansion that strips the final byte.
> 
> If you want to modify it earlier, you have to check whether or not
> it's
> exported because you'll affect the execution environment of programs
> the
> shell invokes.

No, that part ("just around the expansion"), I got,... and it's also
clear that I run any command in between, that *then* I need to take
care on the exported status...
It's something else that confuses me ever since you wrote it, but more
on that below.



> 
> > AFAIU, there is a subtle difference between the LANG/LC_* shell
> > variables on the one side  and  setlocale() respectively the
> > process'
> > real "internal" locale state on the other side.
> 
> I think the difference is in what the system considers to be the
> "default
> locale."

Which AFAIU is implementation defined, right?

TBH, I didn't even fully understand from the manpage what e.g. glibc
does if *nothing* (no LANG/LC_*/etc) is set.
I.e. what if one calls setlocale(LC_ALL, ""); but no env vars are set?

I'd guess then these apply:
 > If its value is not a valid locale specification, the locale is
 > unchanged, and setlocale() returns NULL.
plus:
 > On  startup of the main program, the portable "C" locale is
 > selected as default.

>From that I'd deduce, that for glibc, the default/"native" locale (if
no envvars are set) would be "C"?

But in principle an implementation would be free to say that the
default locale is set in /etc/defaultlocale or that it depends on the
passwd GECOS field of a user and where that user lives.


So effectively, *within* the shell (e.g. a script) there is no
guaranteed way to determine to original status of the locale.

Right?


> > - With the shell variables, both are stored, the default/overriding
> >    LANG/LC_ALL as well as the "real" categories LC_* (all but ALL).
> > 
> > - With setlocale() however, LC_ALL means basically just to go over
> > each
> >    "real" category and set them,... so only the "real" categories
> > are
> >    stored and internally LC_ALL isn't kept.
> > 
> > Right so far?
> 
> I see what you mean, for some value of "kept."

"kept" in the sense as to what the actual locale state of each non-
LC_ALL category is set to.

Whereas LC_ALL isn't a "real" category, but just some "do it for all"
sepcial value (with setlocale() - unlike with the env vars).

Or at least that's what I'd heave read from POSIX': "Setting all of the
categories of the global locale is similar to successively setting each
individual category of the global locale..."



> 
> > Now for any shell (that supports locales in a proper/sane way):
> > 
> > - When the shell starts it sets its default locale (for each
> > category)
> >    in some implementation defined manner. E.g. by calling
> >    setlocal(LC_ALL, "").
> >    But it doesn't have to set the real values into any of the
> > LANG/LC_*
> >    shell variables. If any of these is there, than because it was
> > in
> >    the environment.
> 
> Correct.
> 
> >    At least glibc seems to only use LC_ALL, LC_* and LANG (in that
> >    order) for that, so most likely some combination of them *is*
> >    actually set in the environment and thus also as shell variable.
> 
> Not always.

Okay, but even if not (which would again also mean, that one cannot
save the original state in the script for later restoring):

If in the shell, any of LANG/LC_* is changed (set to a(nother) value or
unset)... e.g. when unsetting LC_ALL after stripping off the sentinel,
then it has to:

- call setlocale(category, "")
  with, category being the the same as the the shell variable that was
  just changed (and when LANG was changed, I'd guess LC_ALL must be
  used)

  Using "" as value should, AFAIU, automatically consider all the
  LANG/LC_* and ultimately fall back to the implementation defined
  default (which might e.g. be C).
        
And that should already guarantee, that if the LANG/LC_* shell
variables are restored to the original state (in terms of set/unset and
value)... the locale state should be back to what it was before - even
without knowing what it originally was.


Does that seem right?


There's just one aspect which I don't understand yet:
Further below you wrote before, that shell's don't update their own
environment with the values of their own LANG/LC_* shell variables
(Nobody does that.).
If so, how can the setlocale(foo, "") call now what the current
LANG/LC_* is? AFAIU it takes these from the env vars?



> >   I couldn't find what happens when for a category, no value can be
> >   determined (e.g. LANG, LC_ALL and LC_CTYPE unset)... but I guess
> > it
> >    falls back to "C"?!
> 
> "the empty string "" (which denotes the native environment)"

Sure it understood that,... what I couldn't find for glibc was, what it
considers as "native environment".. and I guess *that* would be "C"
(for glibc).


> Not necessarily. If you don't do anything -- no setlocale() call --
> the
> locale starts as "C" and stays there. But if you use "" as the locale
> argument to setlocale(), you get the "native environment" in the
> absence
> of any environment variables. You could, for instance, set that
> native
> environment, or at least a native preferred language, in some
> preference
> pane.

See in the beginning above where I've already tried to find out what
glibc(!) in specifc considers as "native".

"native environment" is from the POSIX wording of setlocale().
For glibc the best I could read out of the manpage that it would be
"C", though it doesn't directly say that.

But:
$ unset LANG
$ set | grep -E '^(LANG|LC_)'
$ export LC_ALL="en_US.UTF-8"
$ locale
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
$ unset LC_ALL 
$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

that seems to fit.


> > - Whenever one sets/unsets any LANG/LC_* (shell) variable, a shell
> >    has to call setlocale(category, localevalue).
> > 
> >    And for localevalue it has to use the right value from the
> > *shell*
> >    variables:
> >    1st from LC_ALL
> >    2nd from LC_<the one that was set>
> >    3rd from LANG
> > 
> >    (*) When LC_ALL was unset, it has to do it for every "real"
> > category,
> >    with 2nd and 3rd.
> 
> Right so far.

(**)

And I'd assume, that if the shell has neither of LANG/LC_* in it's
shell variables, then it simply takes "" ... and then it's up to the
implementation of set the default/"native" locale?


> >    (unless it updated its own environment before then it could
> > simply
> >    use "")?
> 
> Nobody does that.

But... if the above (**) is right... wouldn't it have to at least clear
the LANG/LC_* from it's env, so that when they had all been unset as
shell variables and "" is used with setlocale, the later wouldn't take
them from the env?


> As I said: "The only thing you really need to do is to set and reset
> LC_ALL
> around the single assignment statement that removes the last byte
> from the
> string."

I assume you mean this only with respect to bash, but I'd expect that
with any other shell that handles locales in a "proper" way... it would
work like that the same, right?


>From that reply of yours in the last mail I already suspected that we
just had some misunderstanding.

I was confused by:

> So if you want to temporarily control the locale a command
> substitution, or any program the shell runs, gets, you have to save,
> set, export, and then optionally reset all the variables you care
> about.

Which made me think, that even in my use case, I need to reset *all* of
the variables (i.e. even those that I don't touch).


> You don't need to mess with setting LC_ALL to anything earlier in the
> script, and you don't need to worry about hypotheticals like the
> shell
> doing some character conversion on assignment. Nor do you need to
> worry
> about the effect of adding a byte to some incomplete multibyte
> character.

So long story short:


result="$(command ; e=$?; print '.' ; exit $?)"

#optionally error out if OLD_LC_ALL is already set
unset -v OLD_LC_ALL ; [ "${LC_ALL+is_set}" ] && OLD_LC_ALL="${LC_ALL}"

LC_ALL=C
result="${result%.}"

[ "${OLD_LC_ALL+is_set}" ] && LC_ALL="${OLD_LC_ALL}" || unset -v LC_ALL


Should be *the* solution which works in any shell (and there in any
scope, global or function) that handles locales in a sane way... with
any locale.
And it should in principle also work with other sentinels than '.',
too, but sticking to either '.' or '/' seems still pretty reasonable to
me.

Anyone disagreeing? ;-)



> I skimmed your message to the austin-group mailing list, and I don't
> really see any of these concerns as making a difference.

Saw it, and thanks for your replies there as well.

I'd also say they make no big difference. Question (1) there was mainly
just trying to understand whether one could bail out of the whole
locale stuff (which would make the solution a bit easier),...

But I already suspected that, as Koichi said, one could *not* just
trust in the stripping to work "properly" (or rather "as one would
wish") if the string contains some encoding that is invalid in the
current locale... even if the rightmost character would be invariant
and not allowed to be part in any other encoding.

Question (3) was mostly taking the opportunity and asking for something
semi-related, which I always wanted to know.
And as you saw... even amongst the experts, it doesn't seem fully
clear, what POSIX actually mandates there.


> > Well, I mean from inside.
> 
> Sure. Interrogate the state of the relevant shell variables and apply
> the
> appropriate precedence rules. If none are set, run `locale' and parse
> its
> output, for example
> 
> locale | sed -n 's/^LANG="\(.*\)"/\1/p'
> 
> That will give you a pretty good idea of the native environment.

But that would depend that env (which is AFAIK not a special built-in)
uses the same libc implementation then the shell right?

Otherwise, if neither of LANG/LC_* is set, the shell's call of
setlocale(foo, "") could result in one native locale ... while env's
implementation of setlocale() with another libc might use something
different?

Okay... I'm pedantic, sorry ;)


> > Uhm, I found no portable way to get the export state.
> 
> Parse the output of `export'.

I looked at that, but:
a)
e.g. bash uses a format like this:
declare -x DESKTOP_SESSION="cinnamon"
declare -x DISPLAY=":0"

whereas POSIX would mandate:
export DESKTOP_SESSION='cinnamon'
export DISPLAY=':0'

Also,... what if a variable like:
export FOO=$'\nexport LC_ALL=bar'
would be in it? I couldn't differentiate between what's really a
variable and what's just a value.



Well,... once again,... so many thanks for your help :-)

Cheers,
Chris.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]