[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: quote removal issues within character class

From: Robert Elz
Subject: Re: quote removal issues within character class
Date: Sat, 09 Nov 2019 19:49:03 +0700

    Date:        Sat, 9 Nov 2019 07:35:16 +0300
    From:        =?UTF-8?B?T8SfdXo=?= <address@hidden>
    Message-ID:  <address@hidden>

  | is correct, as "foo" does not contain a ']' which would be required
  | > to match there (quoting the ':' means there is no character class,
  | > hence we have instead (the negation of) a char class containing '[' ':'
  | > 'l' 'o' 'w' 'e' ';r' (and ':' again), preceded by anything, and
  | > followed by ']' and anything.   foo does not match. f]oo would.
  | >
  | where exactly is this documented in the standard?

I'm not sure which part exactly you're looking for, but char sets in sh
are specified to be the same as in REs, except that ! replaces ^ as the
negation character (that's in XCU 2.13.1).  Char sets (bracket expressions)
in RE's are documented in XBD 9.3.5 wherein it states

        A bracket expression is either a matching list expression or a
        non-matching list expression. It consists of one or more expressions:
        ordinary characters, collating elements, collating symbols,
        equivalence classes, character classes, or range expressions.
        The <right-square-bracket> (']') shall lose its special meaning and
        represent itself in a bracket expression if it occurs first in the list
        (after an initial <circumflex> ('^'), if any).

        Otherwise, it shall terminate the bracket expression,

That is, a ']' that occurs anywhere else terminates the bracket expression

        unless it       appears in a collating symbol (such as "[.].]")

(not relevant in the given example)

        or is the ending <right-square-bracket> for a collating symbol,
        equivalence class, or character class.

So the ']' that immediately follows the second ':' would not terminate the
bracket expression if it is the ending ']' for a character class
(collating symbols and equiv classes not being relevant to the example).
Of course, that can only happen if there is a character class to end.

There's also

        The special characters '.', '*', '[', and '\\'
        (<period>, <asterisk>, <left-square-bracket>, and <backslash>,
        respectively) shall lose their special meaning within a bracket

whereupon if the [": sequence does not start a char class, the '[' there
is simply a literal char inside the bracket expression.

Similarly if the bracket expression ends at the first ']' (the one imediately
after the second ':') the following ']' is simply a literal character, as
']' chars are special only when following a '['.

So, all that's left to determine is whether the [": sequence can be
considered as beginning a char class.

In a RE it certainly cannot - quote chars (' and ") are not special in
REs at all, and [": is no different syntatically than [x: which no-one
would treat as being the introduction to a char class.

This is also, I believe (Chet can confirm, or refute, if he desires) where
bash gets the interpretation that "lower" (including the quotes) is the
name of the char class in [:"lower":] except that it cannot be, as char
class names cannot contain quote characters (which should lead to the
whole sub-expression not being treated as a char class at all, instead
bash treats it, I think, as if it were an unknown but valid class name).

But when it comes from sh, quote chars are "different" and instead of
just being characters, they instead affect the interpretation of the
characters that are quoted.  See XCU 2.2:

        Quoting is used to remove the special meaning of certain characters
        or words to the shell.

        Quoting can be used to preserve the literal meaning of the special
        characters in the next paragrapyh [...]

        and the following may need to be quoted under certain circumstances.
        That is, these characters may be special depending on conditions
        described elsewhere in this volume of POSIX.1-2017:

                * ? [ # ~ = %

to which more chars have been added (as I recall) recently by some
Austin Group correction (which I think includes ! : - and ]), that is
to make it clear, that in sh


is a bracket expression containing 3 chars 'a' '-' and 'z' (which form
of quoting is used to remove the specialness of the '-' is irrelevant).
and that "[a-z]" isn't a bracket expression at all (neither of which
is true in an RE - though the role of \ in RE's is being altered slightlty
so if it had been [a\-z] in a RE things are less clear.)

The effect of this is that in sh, in an expression like


the first ':' is not "special" and hence cannot form part of the
magic opening '[:' sequence for a character class.   Hence this
expression contains no character class, and consequently the
':]' chars are simply a ':' in the bracket expression, and then
the terminating ']' - which leaves the second ']' being just a
literal character.

While here (these following parts are not relevant to your question I believe)
when used in sh


should be treated just the same as


for the same reason that


is treated the same as


That is, quoted characters that are not special are no different
than the same character unquoted.    That's universal in sh, quoting
removes special meaning (of lots of things) but where there was none
the quoting changes nothing at all, eg:

        "ls" \-'l'

is exactly the same as

        ls -l

        x="foo" y=''
is identical to
        x=foo y=
(though not all empty quoted strings are irrelevant that way).

There are other issues that are less clear what should happen, if your
example had been


then we get into very murky water indeed.   XBD 9.3.5 says:

        The character sequences "[.", "[=", and "[:" (<left-square-bracket>
        followed by a <period>, <equals-sign>, or <colon>) shall be special
        inside a bracket expression

[aside: not related to my current point, the "shall be special" is what
enables sh quoting to stop that from happening, since quoting in the shell
prevents specialness from happening]

        and are used to delimit collating symbols, equivalence class
        expressions, and character class expressions.

That part (so far) is clear and non-controversial.

        These symbols shall be followed by a valid expression and the
        matching terminating sequence ".]", "=]", or ":]", as described
        in the following items.

That's the part that is less clear.   When a valid expression and the
terminating sequence appear, there is no issue, and all is fine - what
is less clear is what happens when one of those reqirements is not met.

Some read this as purely a reqirement on the application - what the
script writer is required to do - and when they don't the implementation
(sh or RE library, or whatever) is free to interpret things (which means
the whole pattern) however it likes (often as not being a pattern at all).

Personally I disagree - I believe it is a requirement on the application
if it desires the relevant sequence to be interpreted as a char class (etc)
and if the application does not include a valid expression or terminating
sequence the implementation should be required to treat the opening
char sequence as if it did not begin a char class (etc) and the [: were
simply 2 chars contained in the bracket expression (they must be in
a bracket expression or the issue doesn't arise at all).

Unfortunately (for the world in general, in that more and more of this
is becoming unspecified, which makes it harder and harder to know what
any particular sequence of characters will do) it seems like the former
interpretation is the more likely to be adopted.

If I have not understoood the "this" in your

        where exactly is this documented

please be more precise, and I will try to answer.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]