[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#53680: Endless loop in peculiar case of string-match and string-matc
From: |
Christian Johansson |
Subject: |
bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50 |
Date: |
Tue, 1 Feb 2022 14:12:19 +0100 |
Alright, thanks for helping me find the correct regexp, this issue only
appeared on some peculiar strings
> 1 feb. 2022 kl. 13:56 skrev Mattias Engdegård <mattiase@acm.org>:
>
>
>>
>> (string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")
>>
>
> The diagnostics by Lars and Andreas is correct. Let's look at it more
> closely, first translating the regexp to rx for ease of reasoning, and see if
> we can make it work:
>
> (rx (* (in "\t\r "))
> "implements"
> (+ (in "\t\r "))
> (+ (group
> (* (in "\t\r "))
> (+ (in "0-9A-Za-z" "\\_"))
> (? ",")))
> (* (in "\t\r "))
> "{" eol)
>
> The first line is meaningless since it can match the empty string, but you
> probably want to anchor the start of "implements" so that it doesn't match
> "house_implements". Let's also drop the capture group, and we get:
>
> (rx symbol-start "implements"
> (+ (in "\t\r "))
> (+ (* (in "\t\r "))
> (+ (in "0-9A-Za-z" "\\_"))
> (? ","))
> (* (in "\t\r "))
> "{" eol)
>
> You clearly want to match a non-empty sequence of 'words' separated with
> whitespace and/or commas, but the pattern is ambiguous -- all inter-word
> separators are optional. Let's make them mandatory:
>
> (rx symbol-start "implements"
> ;; mandatory whitespace
> (+ (in "\t\r "))
> ;; then a word
> (+ (in "0-9A-Za-z" "\\_"))
> ;; then maybe more words, each prefixed by spaces or comma
> (* (+ (in "\t\r ,")) ; fast and loose
> (+ (in "0-9A-Za-z" "\\_")))
> ;; finally whitespace before the curly bracket
> (* (in "\t\r "))
> "{" eol)
>
> which is reasonably efficient, since all ambiguity is now gone: the regexp
> can (almost) only match in one way.
>
> Note the "fast and loose" pattern where we accept any number of spaces or
> commas. Here it depends on your grammar but if you want exactly one comma
> separating each word, that subexpression would be something like
>
> (* (in "\t\r "))
> ","
> (* (in "\t\r "))
>
> instead.
>