bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to match regex in bash? (any character)


From: Stephane CHAZELAS
Subject: Re: How to match regex in bash? (any character)
Date: Wed, 28 Mar 2012 18:26:38 -0000
User-agent: slrn/pre1.0.0-18 (Linux)

2011-10-1, 14:39(-08), rogerx.oss@gmail.com:
[...]
> I took some time to examine the three regex references:
>
> 1) 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04
>     Written more like a technical specification of regex.  Great if your're
>     going to be modifying the regex code.  Difficult to follow if you're new,
>     looking for info.

One thing to bear in mind is that bash calls a system library to
perform the regexp expansion (except that [*]), so it can't
really document how it's gonna work because it just can't know,
it may differ from system to system. The only thing that is more
or less guaranteed is that all those various implementation
should comply to that specification.

Above is the specification of the POSIX extended regular
expression, so a bash script writer should refer to that
document if he want to write a script for all the systems where
bash might be used.

> 2) regex(7)
>     Although it looks good, upon further examination, I start to see run-on
>     sentences.  It's more like a reference, which is what a man file should 
> be.
>     At the bottom, "AUTHOR - This page was taken from Henry Spencer's regex
>     package"

On the few systems where that man page is available, it may or
may not document the extended regular expressions that are
used when calling the regex(3) API (on my system, it doesn't).
Those regular expressions may or may not have extensions over
the POSIX API, and that document may or may not point out which
ones are extensions and which one are not, so a script writer may
be able to refer to that document if he wants his script to work
on that particular system (except that [*]).

> 3) grep(1)
>     Section "REGULAR EXPRESSIONS".  At about half the size of regex(7), the
>     section clearly explains regex and seems to be easily understandable for a
>     person new to regex.

That's another utility that may or may not use the same API, in
the same way as bash or not. You get no warranty whatsoever that
the regexps covered there will be the same as bash's.

[*] actually, bash does some (undocumented) preprocessing on the
regexps, so even the regex(3) reference is misleading here.

For instance, on my system the regex(3) Extended REs support \1
for backreference, \b for word boundary, but when calling
[[ aa =~ (.)\1 ]], bash changes it to [[ aa =~ (.)1 ]] (note
that (.)\1 is not a portable regex as the behavior is
unspecified) bash won't behave as regex(3) documenta on my
system.

Also (and that could be considered a bug), "[\a]" is meant to
match either "\" or "a", but in bash, because of that
preprocessing, it doesn't:

$ bash -c '[[ "\\" =~ [\a] ]]' || echo no
no
$ bash -c '[[ "\\" =~ [\^] ]]' && echo yes
yes

Once that bug is fixed, bash should probably refer to POSIX EREs
(since its preprocessing would disable any extension introduced
by system libraries) rather than regex(3), as that would be more
accurate.

The situation with zsh:
  - it uses the same API as bash (unless the RE_MATCH_PCRE
    option is set in which case it uses PCRE regexps)
  - it doesn't do the same preprocessing as bash because...
  - it doesn't implement that confusing business inherited from
    ksh whereby quotes RE characters are taken literally.

  So, in zsh
  - [[ aa =~ '(.)\1' ]] works as documented in regex(3) on my
    system (but may work differently on other systems as the
    behavior is unspecified as per POSIX).
  - [[ '\' =~ '[\a]' ]] works as POSIX specifies
  - after "setopt RE_MATCH_PCRE", one gets a more portable
    behavior as there is only one PCRE library (thouh different
    versions).

The situation with ksh93:
  - Not POSIX either but a bit more consistent:
    $ ksh -c '[[ "\\" =~ [\a] ]]' || echo no
    no
    $ ksh -c '[[ "\\" =~ [\^] ]]' || echo no
    no
  - it implements its own regexps with its own many extensions
    which therefore can be and are documented in its man page
    but are not common to any other regex (though are mostly a
    superset of the POSIX ERE).

-- 
Stephane


reply via email to

[Prev in Thread] Current Thread [Next in Thread]