|
From: | Syan Tan |
Subject: | [Gnumed-devel] weekly journal reviews |
Date: | Fri, 30 Jun 2006 15:50:24 +0800 |
in the style of journal review sessions, where people intermittently
present what they've been reading in order to facilitate the transmission of
knowledge, I'd like to add another episode to the "how to write a regular _expression_ recognizer program",
which fits in as a useful thing to do in the lexer part of compiler composition,
if compiler composition is divided into the tasks of lexer, parser, syntax translation , code generation, code optimization,
Interestingly , the lexer needs a mini-parser, which is the regular _expression_ parser,
and it converts a regular _expression_ into a binary tree representation of the regular _expression_.
The tree can then be traversed in pre-order, to build a state transition table representing a deterministic
finite automaton, or a non-deterministic finite automaton (NFA).
The NFA build procedure is quite straightforward, and is called Thompson's algorithm.
It basically says what state transitions to do for a given terminal or non-terminal node in the parse tree.
Terminal nodes have literal characters that need to be matched, e.g. (a|b)*abb the characters are a and b.
1. For these terminal symbols, the state transition diagram is
initial-state --- symbol ---> final-state
2. For the operation union or alternate ( | ) ,
the state transition diagram is
initial-state --> e-transition--> node1-state-diagram --> e-transition--> final-state
|__________> e-transition -> node2 -state-diagram--> e-transition ____|
this shows that the union state diagram has 2 possible transitions out of the initial state,
and for the algorithm to work, the symbol for transition is the e symbol ( or empty symbol),
this makes the diagram non-deterministic , because for the one symbol (e) , there is more than one transition.
3. for the concatenate operation, which is one symbol followed by another e.g. abb is a concat b concat b
the algorithm suggests getting the Si ---symbol --> Sf diagrams of each symbol, but have only one state
between transitions , so that the final state of a preceding state is merged with the initial state of the next state.
Actually, this may not be necessary for the algorithm to work, and maybe just inserting an e transition between
symbol state transitions would work just as well, although I haven't tested it.
4. for the kleene-star operation, which is zero or more repititions of the single child node, the
transition diagram is
---------------------e------------------------>
| |
Si --->S1---> child diagram --> S2 ---> Sf
| |
<--------- e --------------<
the outer e-transition is the zero repetition, and the inner e -transition which forms a loop, is the 1 or more repetition.
5. the bracket operation is not in the algorithm, and is implemented by avoiding a pop off the stack, when
building a parse tree.
In order to build the state transition table, the initial and final states of a particular node in the parse tree
can be stored as attributes of the node ; this way, operator nodes, can access the initial and final states
of child nodes to do the thompson algorithm state transitions.
Once the tree has been traversed pre-order, the initial state of the tree's root should be
the initial state of the NFA recognizer for the particular regular _expression_.
In order to see whether the NFA "recognizes" a string,
the nfa algorithm depends on finding the e-closure of the current set of states.
The e-closure set is the set of states that can be reached by doing no transition, or 1 or more
e transitions from a state.
The algorithm starts with the initial set as being the e-closure of the root nodes initial state.
Then the current character of the string being checked is used to move from the states
of the current set of states, to another set of states, using the state transition table from
computing thompson's algorithm.
Then from this set of states,
the e-closure set is found ( using the transition table again),
and this set is used in the processing the next character.
At the end of the string, if the e-closure set includes the final state of the root node,
then the NFA accepts the string. Because the e-closure operation includes a no transition
from the current set of states, this repeated e-closure only makes the set grow, and
it needs to grow to include the final accepting state.
Attached is the regular _expression_ tree parsing program modified to accept the . wildcard character,
and a nfa building program using thompson's algorithm.
The book says that the nfa accepting algorithm can be implemented using two stacks and
a boolean vector indexed by states that have been seen. One stack has the input set of states,
which are individually popped, and a symbol move lookup done on the popped state, and
if the next state isn't marked seen in the boolean vector, is added to the other stack.
The other stack is then popped to examine for the reachable states by e-transitions for each
state on the stack, and then put on the other stack. When the string processing has finished
the bit vector can be examined to see if the root node's final state has been seen.
nfa.h
Description: Binary data
parsenode.h
Description: Binary data
regex_parse1.c
Description: Binary data
nfa.c
Description: Binary data
[Prev in Thread] | Current Thread | [Next in Thread] |