bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: quote removal issues within character class


From: Oğuz
Subject: Re: quote removal issues within character class
Date: Sat, 9 Nov 2019 16:45:31 +0200

You've already answered it, thank you. I didn't know that [:, [., [= were
special *sequences*, I guess I overlooked that part. Thanks again for
taking time to explain it in detail, I'm grateful


9 Kasım 2019 Cumartesi tarihinde Robert Elz <kre@munnari.oz.au> yazdı:

>     Date:        Sat, 9 Nov 2019 07:35:16 +0300
>     From:        =?UTF-8?B?T8SfdXo=?= <oguzismailuysal@gmail.com>
>     Message-ID:  <
> CAH7i3Lr68CiVXLR9_HoOgQa7Vd-zyVZ+fck-0K3uQPTNSirU2Q@mail.gmail.com>
>
>   | is correct, as "foo" does not contain a ']' which would be required
>   | > to match there (quoting the ':' means there is no character class,
>   | > hence we have instead (the negation of) a char class containing '['
> ':'
>   | > 'l' 'o' 'w' 'e' ';r' (and ':' again), preceded by anything, and
>   | > followed by ']' and anything.   foo does not match. f]oo would.
>   | >
>   |
>   | where exactly is this documented in the standard?
>
> I'm not sure which part exactly you're looking for, but char sets in sh
> are specified to be the same as in REs, except that ! replaces ^ as the
> negation character (that's in XCU 2.13.1).  Char sets (bracket expressions)
> in RE's are documented in XBD 9.3.5 wherein it states
>
>         A bracket expression is either a matching list expression or a
>         non-matching list expression. It consists of one or more
> expressions:
>         ordinary characters, collating elements, collating symbols,
>         equivalence classes, character classes, or range expressions.
>         The <right-square-bracket> (']') shall lose its special meaning and
>         represent itself in a bracket expression if it occurs first in the
> list
>         (after an initial <circumflex> ('^'), if any).
>
>         Otherwise, it shall terminate the bracket expression,
>
> That is, a ']' that occurs anywhere else terminates the bracket expression
> except:
>
>         unless it       appears in a collating symbol (such as "[.].]")
>
> (not relevant in the given example)
>
>         or is the ending <right-square-bracket> for a collating symbol,
>         equivalence class, or character class.
>
> So the ']' that immediately follows the second ':' would not terminate the
> bracket expression if it is the ending ']' for a character class
> (collating symbols and equiv classes not being relevant to the example).
> Of course, that can only happen if there is a character class to end.
>
> There's also
>
>         The special characters '.', '*', '[', and '\\'
>         (<period>, <asterisk>, <left-square-bracket>, and <backslash>,
>         respectively) shall lose their special meaning within a bracket
>         expression.
>
> whereupon if the [": sequence does not start a char class, the '[' there
> is simply a literal char inside the bracket expression.
>
> Similarly if the bracket expression ends at the first ']' (the one
> imediately
> after the second ':') the following ']' is simply a literal character, as
> ']' chars are special only when following a '['.
>
> So, all that's left to determine is whether the [": sequence can be
> considered as beginning a char class.
>
> In a RE it certainly cannot - quote chars (' and ") are not special in
> REs at all, and [": is no different syntatically than [x: which no-one
> would treat as being the introduction to a char class.
>
> This is also, I believe (Chet can confirm, or refute, if he desires) where
> bash gets the interpretation that "lower" (including the quotes) is the
> name of the char class in [:"lower":] except that it cannot be, as char
> class names cannot contain quote characters (which should lead to the
> whole sub-expression not being treated as a char class at all, instead
> bash treats it, I think, as if it were an unknown but valid class name).
>
> But when it comes from sh, quote chars are "different" and instead of
> just being characters, they instead affect the interpretation of the
> characters that are quoted.  See XCU 2.2:
>
>         Quoting is used to remove the special meaning of certain characters
>         or words to the shell.
>
>         Quoting can be used to preserve the literal meaning of the special
>         characters in the next paragrapyh [...]
>
>         and the following may need to be quoted under certain
> circumstances.
>         That is, these characters may be special depending on conditions
>         described elsewhere in this volume of POSIX.1-2017:
>
>                 * ? [ # ~ = %
>
> to which more chars have been added (as I recall) recently by some
> Austin Group correction (which I think includes ! : - and ]), that is
> to make it clear, that in sh
>
>                 [a'-'z]
>
> is a bracket expression containing 3 chars 'a' '-' and 'z' (which form
> of quoting is used to remove the specialness of the '-' is irrelevant).
> and that "[a-z]" isn't a bracket expression at all (neither of which
> is true in an RE - though the role of \ in RE's is being altered slightlty
> so if it had been [a\-z] in a RE things are less clear.)
>
> The effect of this is that in sh, in an expression like
>
>         [![":lower":]]
>
> the first ':' is not "special" and hence cannot form part of the
> magic opening '[:' sequence for a character class.   Hence this
> expression contains no character class, and consequently the
> ':]' chars are simply a ':' in the bracket expression, and then
> the terminating ']' - which leaves the second ']' being just a
> literal character.
>
>
> While here (these following parts are not relevant to your question I
> believe)
> when used in sh
>
>         [[:"lower":]]
>
> should be treated just the same as
>
>         [[:lower:]]
>
> for the same reason that
>
>         ["abc"]
>
> is treated the same as
>
>         [abc]
>
> That is, quoted characters that are not special are no different
> than the same character unquoted.    That's universal in sh, quoting
> removes special meaning (of lots of things) but where there was none
> the quoting changes nothing at all, eg:
>
>         "ls" \-'l'
>
> is exactly the same as
>
>         ls -l
>
> and
>         x="foo" y=''
> is identical to
>         x=foo y=
> (though not all empty quoted strings are irrelevant that way).
>
> There are other issues that are less clear what should happen, if your
> example had been
>
>         [![:"lower:"]]
>
> then we get into very murky water indeed.   XBD 9.3.5 says:
>
>         The character sequences "[.", "[=", and "[:" (<left-square-bracket>
>         followed by a <period>, <equals-sign>, or <colon>) shall be special
>         inside a bracket expression
>
> [aside: not related to my current point, the "shall be special" is what
> enables sh quoting to stop that from happening, since quoting in the shell
> prevents specialness from happening]
>
>         and are used to delimit collating symbols, equivalence class
>         expressions, and character class expressions.
>
> That part (so far) is clear and non-controversial.
>
>         These symbols shall be followed by a valid expression and the
>         matching terminating sequence ".]", "=]", or ":]", as described
>         in the following items.
>
> That's the part that is less clear.   When a valid expression and the
> terminating sequence appear, there is no issue, and all is fine - what
> is less clear is what happens when one of those reqirements is not met.
>
> Some read this as purely a reqirement on the application - what the
> script writer is required to do - and when they don't the implementation
> (sh or RE library, or whatever) is free to interpret things (which means
> the whole pattern) however it likes (often as not being a pattern at all).
>
> Personally I disagree - I believe it is a requirement on the application
> if it desires the relevant sequence to be interpreted as a char class (etc)
> and if the application does not include a valid expression or terminating
> sequence the implementation should be required to treat the opening
> char sequence as if it did not begin a char class (etc) and the [: were
> simply 2 chars contained in the bracket expression (they must be in
> a bracket expression or the issue doesn't arise at all).
>
> Unfortunately (for the world in general, in that more and more of this
> is becoming unspecified, which makes it harder and harder to know what
> any particular sequence of characters will do) it seems like the former
> interpretation is the more likely to be adopted.
>
> If I have not understoood the "this" in your
>
>         where exactly is this documented
>
> please be more precise, and I will try to answer.
>
> kre
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]