bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regex-quote.c syntax support


From: Reuben Thomas
Subject: Re: regex-quote.c syntax support
Date: Sat, 5 Mar 2011 16:31:02 +0000

On 5 March 2011 14:51, Bruno Haible <address@hidden> wrote:
> Hello Reuben,
>
>> regex-quote seems only to support two syntaxes at the moment
>
> Yes. POSIX specifies two syntaxes.

regex.h suggests that in practice there are a couple more:

RE_SYNTAX_POSIX_EGREP
RE_SYNTAX_POSIX_AWK

each of which is different from the other and from POSIX basic and extended.

> Rather it's an 'int' with the same meaning as the cflags argument that you
> pass to regcomp().

Any non-zero value counts as selecting extended syntax in
regex_quote*, whereas in regcomp only one bit does that. (I point this
out only as a potential source of ABI breakage.)

> True, but on the other hand if the caller is supposed to determine the
> characters to be escaped ad-hoc, the risk of mistake is pretty high.

> On the other hand, 'grep' supports basic, extended, and PCRE syntaxes,
> but not the Emacs syntax.

Presumably it supports not RE_SYNTAX_POSIX_EXTENDED but rather
RE_SYNTAX_POSIX_EGREP? Or both?

> Before we can decide on this, IMO some analysis is needed:
>
>  - What are the possible effects of reg_syntax_t on the string of
>    characters to be escaped? I can see
>      RE_BK_PLUS_QM                   ->    +?
>      RE_INTERVALS, RE_NO_BK_BRACES   ->    {}
>    What other relations are there?

RE_NO_BK_PARENS -> ()
RE_NO_BK_VBAR -> |
RE_NO_BK_REFS -> [:digit:]

>  - What characters need to be escaped in Emacs syntax?

Emacs syntax is simply the syntax with all the bits switched off, so:

$^.*[]\+?

>  - What characters need to be escaped in PCRE syntax?

According to pcrepattern(3):

^$.[|()?*+{

(Which makes me wonder why we treat ] as special in regex-quote.c.)

>  - Do Emacs and PCRE view a regex as a sequence of bytes or as a sequence
>    of multibyte characters in the locale encoding (given by LC_CTYPE)?

PCRE doesn't do locales; it treats strings as either bytes or, given a
specific flag, UTF-8.

I don't really understand the question about Emacs: someone using
regex-quote in their own programs is worried about Emacs syntax, not
Emacs encodings, because Emacs doesn't have a C API. My understanding
of Emacs is that it has its own universal internal encoding, which
differs from the encoding of a particular buffer being edited; the
latter can be bytes, 7-bit or 8-bit characters, or multibyte
characters, according to the file being editor and the user's selected
encoding.

HTH!

-- 
http://rrt.sc3d.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]