[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Modularized parser files covering similar grammars
From: |
Frans Englich |
Subject: |
Modularized parser files covering similar grammars |
Date: |
Mon, 13 Jun 2005 13:05:24 +0000 |
User-agent: |
KMail/1.8.50 |
Hello,
I have a design dilemma that will become real some time in the future, and
consider how large it is, I thought it could be a good idea to take a quick
look forward.
I am building a Bison parser for a language, or to be precise, multiple
languages which all are very similar. I have a "main" language, followed by
three other languages which all are subsets of the main language.
To be precise, I'm building a parser for the XPath language, and the different
flavours I need to be able to distinguish are:
* XPath 2.0. This is as broad as it gets.
* XPath 1.0. A subset of XPath 2.0. XPath 2.0 is an extension of XPath 1.0
* XSL-T 2.0 Patterns. A small subset of XPath 2.0
* XSL-T 1.0 Patterns. A small subset of XPath 1.0
* W3C XML Schema Selectors. An even smaller subset of XPath 1.0
My wondering is how I practically should modularize the code in order to
efficiently support these different languages.
First of all, my thought is that the scanner(flex) is the same in either
case(e.g, support all tokens in XPath 2.0), and that distinguishing the
various "languages" is done on a higher level(parser).
Distinguishing XPath 1.0/2.0 is from what I can tell the easiest. Since XPath
2.0 is an extension to 1.0, one can pass the parser an argument which
signifies whether it's 1.0 that is parsed, and in the actions for 2.0
expressions error out if 1.0 is being parsed.
In other words, conditional checks on an action basis.
This approach, however, easily becomes complex when taking the other grammars
into account, because one needs to be "context" aware. For example, XSL-T
Patterns is a sub-set, but the constructs that are disallowed are only done
so in certain scenarios. Hence, if one continued with conditional tests("What
language am I parsing?") inside actions, it would require to implement
"non-terminal awareness".
Another approach, which seems attractive to me if it's possible, is to
modularize the grammar on the API/file level. For example, the tokens are
declared in one file, non-terminals grouped in files, and a separate parser
is constructed for each language. It would be preferred if it was also
modularized on the object level, but I guess the disadvantage wouldn't be
that big if it wasn't. In other words, if one could "select start token
depending on language" it would solve my problems, it seems. I don't know how
this "bison modularization" would be done practically though.
What are people's experiences with these kind of problems? What are the
approaches for solving them?
Cheers,
Frans
PS.
For those interested, here are the EBNF productions for what I'm talking
about:
XPath 2.0(1.0 is merely a subset):
http://www.w3.org/TR/xpath20/#nt-bnf
XSL-T Patterns:
http://www.w3.org/TR/xslt20/#pattern-syntax
W3C XML Schema Selectors:
http://www.w3.org/TR/xmlschema-1/#coss-identity-constraint
btw, there's also an interesting document wrt to parser/scanner construction &
XPath, "Building a Tokenizer for XPath or XQuery":
http://www.w3.org/TR/2005/WD-xquery-xpath-parsing-20050404/
- Modularized parser files covering similar grammars,
Frans Englich <=