emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scan of regexps in Emacs (March 17)


From: Mattias Engdegård
Subject: Re: Scan of regexps in Emacs (March 17)
Date: Tue, 2 Apr 2019 16:15:13 +0200

2 apr. 2019 kl. 09.33 skrev Paul Eggert <address@hidden>:
> 
>> don't we also need a precise description of exactly how they are interpreted 
>> by the engine?
> 
> In other parts of Emacs, we are typically OK with specs that don't completely 
> specify behavior. This gives us more freedom to make changes in the 
> undocumented behavior later. I think it makes sense to do that here too, for 
> regular expressions like "[z-a-m]" that most readers would find confusing.

Then where does a user go to understand extant regexps? (Do we have any 
latitude at all for changing even obscure corners of regexp syntax and 
semantics today?) That's why I favour expounding on the details in a separate 
section.

>> The terminology is a bit confusing. Is 'raw 8-bit byte' included in 
>> 'unibyte'? Is \x7f ever a raw 8-bit byte?
>> I agree that [å-\xff], say, should be invalid but I've never seen such 
>> constructs.
> 
> After looking into it I realized that I don't really know the semantics here 
> (the text I recently added there seems to be wrong, in some cases), and I 
> have my doubts that anyone else knows the semantics either. The attached 
> patch simply gets rid of that section, leaving the area undocumented. User 
> beware!

Apparently I don't really know it either -- I just discovered that:

(string-match "\xff"     "\xff")  => 0
(string-match "[\xff]"   "\xff")  => 0
(string-match "\xffé?"   "\xff")  => nil
(string-match "[\xff]é?" "\xff")  => 0
(string-match "\xff"     "\xffé") => 0
(string-match "[\xff]"   "\xffé") => nil
(string-match "\xffé?"   "\xffé") => 0
(string-match "[\xff]é?" "\xffé") => nil

> OK, then we should document z-a as the preferred syntax (best go with the 
> flow...). Done in the attached patch.

Actually, the only place where I saw z-a was in auctex (in negated form, 
[^z-a]).

>> As an experiment, I added detection of 'chained' ranges like [a-m-z] to xr 
>> and found a handful in both Emacs and GNU ELPA, but none of them carried a 
>> freeload of bugs. Keeping that check didn't seem worthwhile; the regexps may 
>> be a bit odd-looking, but aren't wrong.
> 
> It depends on what one means by "wrong". If one wants to use the ranges in 
> both Emacs and grep they are "wrong", so it's reasonable for the manual to 
> recommend against them.

Definitely agree that it should be discouraged. I've attached the ones found by 
a modified relint/xr, in case you are interested.

> It might also help for the trawler to warn about [X-Z] where Z = X+2. [XYZ] 
> is clearer and less error-prone than [X-Z]. I shoehorned that into the 
> attached patch too.

These seem to be rare; I found exactly one occurrence 
(lisp/gnus/message.el:1291):

 "[ \t]\\|[][!\"#$%&'()*+,-./0-9;<=>address@hidden|}~]+:"

which uses the punny range ,-. (possibly by benign accident).
Similarly, singleton ranges, X-X, are non-existent save for --- which I presume 
is an XEmacs workaround.

The latest xr version warns about 2-character ranges, except within digits 
because [0-1] etc was found to be common and harmless.

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 748ab586af..72ee9233a3 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
...
+A character alternative can include duplicates.  For example,
address@hidden is less clear than @samp{[XYa-z]}.

Certainly, but does this need to be mentioned? Overlapping ranges are rarely 
written on purpose. Besides, duplication isn't confined to ranges.

More useful, I think, would be to recommend ranges to stay within natural 
sequences (letters, digits, etc) so that a reader needn't consult a table to 
see what is included. Thus [0-9.:/] good, [.-:] bad, even though they denote 
the same set.

address@hidden
+A @samp{-} also appear at the beginning of a character alternative, or

'appears'

Attachment: chained-ranges.log
Description: Binary data






reply via email to

[Prev in Thread] Current Thread [Next in Thread]