bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Enumerating possibly-used regex constructs in order to offer warnings ab


From: James Youngman
Subject: Enumerating possibly-used regex constructs in order to offer warnings about differences of interpretation
Date: Tue, 25 Mar 2008 10:49:00 +0000

Unix users often have difficulty with tools accepting regexes because
it is difficult and inconvenient to remember which tool uses which
regex variant.  Any given tool is usually not in a position to change
its regex dialect because of the fact that this would be so likely to
break users' existing scripts.

However, a small subset of REs are interpreted identically between a
number of pairs of dialects.  The gnulib regex implementation already
has a bitmask of enabled/disabled regex features allowing the caller
to specify precisely which regex variant they want to use.  Does there
exist any code at this point that could be used to enumerate which of
the actual features were definitely not used by a specific regex
offered for compilation?   If so, it might be possible for a tool to
say:

Warning: the regex <...> is interpreted differently for
--regextype=emacs and --regextype=posix-extended, but you didn't
specify which you wanted.  Assuming --regextype=emacs for now, though
<some warning about potential future changes>

For example if I compile the regex "(foo)\1", I would in principle be
able to compute the bitmask RE_NO_BK_PARENS|RE_NO_BK_REFS.  This value
is computed by taking ~0 and then turning off all the bits whose value
would make no difference to the result of compiling the regex.  The
interpretation of the regex "(foo)\1" would change if I inverted
either of the bits at positions RE_NO_BK_PARENS and RE_NO_BK_REFS in
"mask" and compiled again.    The hypothetical computation I'm
referring to would not also return RE_LIMITED_OPS for example, because
the regex doesn't include any construct (+, ? and |).   I believe the
bitmask generated by my hypothetical computation would be independent
of the actual reg_syntax_t value passed when compiling the regex.

In this example it's not possible to guess whether the user would
really have wanted RE_NO_BK_REFS or not, but guessing that wouldn't be
a goal.  In theory you cold guess the corect value for
RE_CONTEXT_INVALID_DUP for example, but that's not the kind of use
case I have in mind.

Thanks,
James.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]