bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: DEL character treated specially when preceded by a backslash when us


From: Eduardo A . Bustamante López
Subject: Re: DEL character treated specially when preceded by a backslash when used in the RHS of the regex operator ([[ $'\177' =~ $'\\\177' ]])
Date: Fri, 17 Jan 2014 11:30:37 -0800
User-agent: Mutt/1.5.21 (2010-09-15)

On Fri, Jan 17, 2014 at 08:43:46AM -0500, Chet Ramey wrote:
> On 1/16/14 6:46 PM, Eduardo A. Bustamante López wrote:
> > The DEL ($'\177') character does not behave like the other control
> > characters when used with the regex operator inside the test keyword.
> 
> This has to do with the expansion of $r and that $r includes a backslash.
> When combined with the internal quoting bash does, and the fact that the
> backslash is special to pattern matching, we end up with this problem.
> I've only thought about it a little so far, but I don't know if there's a
> quick or simple fix.  This may have to wait until after bash-4.3 is
> released.
> 
> Chet
I understand that the backslash preceding a character *could* make it
to not match, though $'\177' is the *only* non-graphic character that
has this behavior.

This should make it more clear:

ubuntu@ubuntu:~$ for c in $'\001' $'\r' $'\177' $'\277' $'\377'; do
> r="\\$c"; [[ $c =~ $r ]]; printf 'c=%q r=%q %d\n' "$c" "$r" "$?"
> done
c=$'\001' r=$'\\\001' 0
c=$'\r' r=$'\\\r' 0
c=$'\177' r=$'\\\177' 1
c=$'\277' r=$'\\\277' 0
c=$'\377' r=$'\\\377' 0

My issue is more regarding why $'\177' has a different behavior than the other
characters, than if the preceding backslash should make it match or
not.

That is, I would expect either these two outputs:

O1:
c=$'\001' r=$'\\\001' 1
c=$'\r' r=$'\\\r' 1
c=$'\177' r=$'\\\177' 1
c=$'\277' r=$'\\\277' 1
c=$'\377' r=$'\\\377' 1

O2:
c=$'\001' r=$'\\\001' 0
c=$'\r' r=$'\\\r' 0
c=$'\177' r=$'\\\177' 0
c=$'\277' r=$'\\\277' 0
c=$'\377' r=$'\\\377' 0

But the real output shows that the case for c=$'\177' is treated
special:

c=$'\001' r=$'\\\001' 0
c=$'\r' r=$'\\\r' 0
c=$'\177' r=$'\\\177' 1 <-- this one behaves differently.
c=$'\277' r=$'\\\277' 0
c=$'\377' r=$'\\\377' 0


---
Now, regarding the issue of whether the backslash should be treated
in a special way, or treated literally, the only thing I can
contribute is the behavior of GNU sed, which handles non-graphic
characters preceded by a backslash the same as the individual
character:

ubuntu@ubuntu:~$ cat sed
mapfile -t chars < <(
    printf '\\x%x\n' {1..255} | while read -r c; do printf "$c"'\n'; done
);

for sed in sed 'sed -r'; do
    printf -- '--- sed: %s ---\n' "$sed"
    for c in "${chars[@]}"; do
        printf '%q > %q\n' "$c" "$(printf %s\\n "$c" | $sed "s/\\$c//" 2>&1)"
    done | grep -v "''\$"
done
ubuntu@ubuntu:~$ bash sed
--- sed: sed ---
'' > sed:\ -e\ expression\ #1\,\ char\ 5:\ unterminated\ \`s\'\ command
'' > sed:\ -e\ expression\ #1\,\ char\ 5:\ unterminated\ \`s\'\ command
\' > \'
\( > sed:\ -e\ expression\ #1\,\ char\ 6:\ Unmatched\ \(\ or\ \\\(
\) > sed:\ -e\ expression\ #1\,\ char\ 6:\ Unmatched\ \)\ or\ \\\)
1 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
2 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
3 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
4 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
5 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
6 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
7 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
8 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
9 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
\< > \<
\> > \>
B > B
W > W
\` > \`
a > a
b > b
c > sed:\ -e\ expression\ #1\,\ char\ 6:\ Trailing\ backslash
f > f
n > n
r > r
s > s
t > t
v > v
\{ > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ preceding\ regular\ 
expression
\| > \|
--- sed: sed -r ---
'' > sed:\ -e\ expression\ #1\,\ char\ 5:\ unterminated\ \`s\'\ command
'' > sed:\ -e\ expression\ #1\,\ char\ 5:\ unterminated\ \`s\'\ command
\' > \'
1 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
2 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
3 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
4 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
5 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
6 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
7 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
8 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
9 > sed:\ -e\ expression\ #1\,\ char\ 6:\ Invalid\ back\ reference
\< > \<
\> > \>
B > B
W > W
\` > \`
a > a
b > b

As you can see, all non-graphic characters are treated the same as the
non-graphic character preceded by a backslash. I do not know how
other regex engines treat this case.



---
In case you're interested on why I care about this issue ($'\177'),
see the special case I had to make in the ''requote2'' function for
it to work in this case:

https://github.com/lhunath/scripts/issues/3#issuecomment-32551132



---
Regarding the Cygwin issue:

$ for c in $'\177' $'\200' $'\277' $'\376' $'\377'; do
> r=$c; [[ $c =~ $r ]]; printf 'c=%q r=%q %d\n' "$c" "$r" "$?";
> done;
$ echo "$BASH_VERSION $OS"
c=$'\177' r=$'\177' 0
c=$'\200' r=$'\200' 2
c=$'\277' r=$'\277' 2
c=$'\376' r=$'\376' 2
c=$'\377' r=$'\377' 2
4.1.10(4)-release Windows_NT

Notice how even when trying [[ $x =~ $x ]], it fails, and with the 2
status code.


---
So, in short, there are three issues here:

1) Why is $'\177' handled differently (just that non-graphic
character, in comparison to the other non-graphic)?

2) What's the reason of the incompatible behavior between bash in
ubuntu vs bash in cygwin (i.e. the [[ keyword returning 2 for
characters outside the ASCII range when trying to match them with =~)

3) How should bash treat the case of a character preceded by a
backslash in regular expressions (and globs, as Dan reported in a
previous issue)?


I personally care more about 1 & 2, because these two prevent me from
writing a function that works in both linux & cygwin, and at the same
time, the special case for $'\177' makes me feel dirty. However bash
handles 3, as long as it's consistent, I can deal with.

-- 
Eduardo Alan Bustamante López



reply via email to

[Prev in Thread] Current Thread [Next in Thread]