[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Improving regexp-opt
From: |
Stefan Monnier |
Subject: |
Re: Improving regexp-opt |
Date: |
Thu, 07 Feb 2019 22:48:01 -0500 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) |
> (regexp-opt '("car" "cdr" "caar" "cadr" "cdar" "cddr"))
> -> "\\(?:c\\(?:\\(?:a[ad]\\|d[ad]\\|[ad]\\)r\\)\\)"
> First of all, there is an (apparently) unnecessary group around the result.
FWIW, I think this is not an error: we want (concat (regexp-opt STRS) "*")
to have a well-defined behavior (i.e. allow any number of repetitions of
STRS).
> (regexp-opt '("master" "monster" "mister"))
> -> "\\(?:m\\(?:\\(?:on\\|[ai]\\)ster\\)\\)"
> It would be better (imo) eliminate unnecessary groups (those without "\\|")
> that the result was
> "m\\(?:on\\|[ai]\\)ster"
Here, OTOH, the second (shy) subgroup is indeed unnecessary.
Regarding improving regexp-opt, in the general case you're looking at
minimizing finite state automatons. When regexp-opt was written, the
main purpose was to try and reduce backtracking and for that it's
perfectly sufficient to turn ("ack" "attack") into
"a\\(?:ck\\|ttack\\)". I later added "tail sharing" so that ("ack"
"attack") turns into "a\\(?:tta\\)?ck" but that's not really much use in
practice. We could try and get fancier, but it will tend to slow down
regexp-opt even more for rather small benefits (except in corner cases).
A much better approach is to go for a real "regexp to NFA/DFA
conversion". The `lex.el` package is one such example, but it's very
inefficient (in terms of building the FA and in the size of the FA, not
in terms of running the FA).
Stefan