[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Initial %include implementation
From: |
David M. Warme |
Subject: |
Re: Initial %include implementation |
Date: |
Sat, 17 Nov 2018 18:52:45 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
The %include feature (as coded in this patch) is not terribly useful.
Here is why:
A proper bison source file is partitioned into various sections, and
these sections appear in a certain order (as quoted from the latest
Bison manual):
%{
Prologue
%}
Bison declarations
%%
Grammar rules
%%
Epilogue
The canonical use-case for %include is to be able to share
commonly used grammatical sub-phrases among several top-level grammars.
For example, a non-terminal "distribution" that recognizes any of
several random distributions (constant, uniform, normal, exponential,
gamma, beta, weibull, etc.), each having their own particular syntax.
This %include file (Akim called it a "module") could easily need to
deposit various bits of information into all four of the sections
(prologue, declarations, rules and epilogue) to properly encapsulate
all of the functionality associated with this grammatical sub-phrase.
(Actually, there is a 5th pertinent section -- the "%union items".)
The suggested patch is just a classic hierarchical "character stream"
that would force all %included text to appear immediately at the point
in the particular section in which the %include directive is placed.
Think about how bollocksed up Bison would get if such an include file
contained a %% delimiter!
In reality, you probably want each include file to have structure
similar to the top-level grammar file, with 5 individual sections of
its own.
I actually implemented a facility to do exactly this more than 20 years
ago -- but I implemented it as a set of GNU M4 macros, not internally
to Bison. These macros process a ".g" top-level grammar file,
possibly "including" several ".gh" grammar header files, and produce
a single ".y" Bison source file. It provides the following facilities:
Include_Grammar(distribution.gh) /* Include "grammar header"
file distribution.gh */
C_Declarations
/* Text belonging in Prologue section */
Union_Items
/* Text belonging inside %union { ... } */
Grammar_Declarations
/* Text belonging in Bison declarations section */
Grammar_Rules
/* Text belonging in the Grammar rules section */
Additional_Code
/* Text belonging in the Epilogue section */
These macros use M4's "m4_divert()" facility (a separate diversion
for each of the 5 sections defined above), and upon detecting EOF,
it concatenates all 5 of these diversions into a single Bison
source file having the proper delimiters between these sections.
This is just the "tip of the iceberg," however. The next thing you
encounter when partitioning grammars into modules like this is token
numbers. Each top-level grammar "sees" a distinct set of terminal
symbols -- and Bison assigns a unique token number to each of these
terminals. But you can pretty much guarantee that if Bison picks
token number 163 for the LEFT_PAREN token when processing top-level
grammar "foo.y", you can be pretty certain that when Bison processes
top-level grammar "bar.y" it is going to pick something *other* than
token number 163 for the LEFT_PAREN terminal. This means that you
are going to have to do one of the following things in order to
get properly working lexical analyzers for these grammars:
1. Write a separate, custom lexical analyzer (floor to ceiling)
for each top-level grammar that you process. (Yikes!)
2. Write some sort of "templated" lexical analyzer that is
shared by all top-level grammars, but that you must re-compile
separately for each top-level grammar. For example, it could
#include the "y.tab.h" file that is associated with the
particular top-level grammar this lexical analyzer will be
used with, and do #ifdef LEFT_PAREN ... #endif type stuff to
determine whether this grammar needs to recognize "(" tokens
or not. Still pretty gnarly, and it makes the lexical analyzer
a bottleneck for everyone who is writing new grammar (e.g.,
adding new keywords, etc.)
3. Write a single "chameleon" lexical analyzer that can dynamically
configure itself to the particular grammar it is being asked to
lex. Using a custom Bison parser skeleton, for example, one can arrange
to have the following code executed early in the initialization
phases of the generated yyparse() function:
d_lexer -> initSpecialTokens (YYNTOKENS, yytname);
This function slogs through the yytname[] array, deducing the
token number of each terminal symbol, and what "lexical syntax"
is required for each such token number. Here is a notional
yytname[] that I extracted (and cleansed) from one of our grammars:
#define YYNTOKENS 43
static const char *const yytname[] =
{
"$end", "error", "$undefined", "IDENT", "STRING", "CCONST", "ICONST",
"FCONST", "\",\"", "\"or\"", "\"and\"", "\"eq\"", "\"ne\"", "\"<\"",
"\"<=\"", "\">\"", "\">=\"", "\"+\"", "\"-\"", "\"*\"", "\"/\"",
"\"%\"",
"\"(\"", "\")\"", "\"not\"", "\"true\"", "\"false\"", "\"::\"",
"\"[\"",
"\"]\"", "\":\"", "\"{\"", "\"}\"", "\"keyword1\"", "\"keyword2\"",
"\"keyword3\"", "\"begin!\"", "\"end!\"", "\";\"", "\"=\"",
"\"keyword4\"", "\"keyword5\"", "\"keyword6\"", "$accept",
/* Other entries for non-terminal symbols omitted. */
};
Our "chameleon" lexer internally knows about whitespace, C and C++
style comments, and the following "primitive" token types:
IDENT - C or C++ style identifiers.
STRING - Double-quoted strings.
CCONST - Single-quoted strings (character constants).
ICONST - Integer constants
FCONST - Floating-point constants
All other tokens in yytname[] must be double-quoted strings that split
cleanly into one of the following two buckets:
a. Keywords. These are first recognized as an IDENT, but then looked up
in a hash table of "exceptions". (Our implementation allows you to make
the keyword be "reserved" if the yytname[] entry ends with a !, (e.g.,
"\"begin!\"" and "\"end!\"" in the above example). In this case the
keyword's token number is always returned. If the keyword is NOT
reserved, our implementation checks to see if the keyword's token
number is "shiftable" in the current parsing context. If so, it
returns the keyword's token number, otherwise it returns IDENT's
token number.)
b. Punctuators. These consist of one or more entirely non-alphabetic,
non-digit characters. (Things containing /*, */ and // are illegal.)
After analyzing each token entry of yytname[], our chameleon lexer sets
up all of the hash tables (for keywords), and trie's (for punctuators)
necessary to quickly recognize each of these tokens as they appear in
the input file, and return the proper token number that the grammar
being parsed expects.
This gives us the luxury of a single lexical analyzer (fairly complex, but
write it once and be done with it), and then the ability to write grammar
fragments such as the following:
-----------------------------------------------------
/* In the "distribution.gh" grammar header file... */
#include "distribution_pnode.h" /* Defines struct
distribution_pnode */
Include_Grammar(fnum.gh)
Union_Items
struct distribution_pnode * y_distribution;
Grammar_Declarations
%type <y_distribution> constant_distribution
%type <y_distribution> distribution
%type <y_distribution> exponential_distribution
%type <y_distribution> normal_distribution
%type <y_distribution> uniform_distribution
distribution:
constant_distribution
| uniform_distribution
| normal_distribution
| exponential_distribution
;
constant_distribution: "constant" "(" fnum ")"
{ $$ = constant_distribution_create (@2, $3); };
uniform_distribution: "uniform" "(" fnum "," fnum ")"
{ $$ = uniform_distribution_create (@2, $3, $5); };
normal_distribution: "normal" "(" fnum "," fnum ")"
{ $$ = normal_distribution_create (@2, $3, $5); };
exponential_distribution:
{ $$ = exponential_distribution_create (@2, $3, $5); };
----------------------------------------------
/* In the "fnum.gh" grammar header file... */
Include_Grammar(inum.gh)
Grammar_Declarations
%type <y_double> fnum
Grammar_Rules
fnum:
FCONST { $$ = $1; }
| "+" FCONST { $$ = + $2; }
| "-" FCONST { $$ = - $2; }
| ICONST { $$ = $1; }
| inum { $$ = $1; }
;
---------------------------------------------
/* In the "inum.gh" grammar header file... */
Grammar_Declarations
%type <y_long> inum
inum:
| ICONST { $$ = $1; }
| "+" ICONST { $$ = + $2; }
| "-" ICONST { $$ = - $2; }
;
There are some other things that you discover when you go down this road.
Bison starts generating lots of error messages. For example, when your
last grammatical reference to "distribution" gets deleted from your top-
level grammar, Bison complains about distribution, constant_distribution,
uniform_distribution, normal_distribution, exponential_distribution --
and perhaps even fnum and inum -- as rules/non-terminals that are defined
but never used by the grammar. We implemented the following:
%begin-library "fnum.gh"
...
%end-library
All of the rules and declarations occurring between these two directives
are considered to be part of a library (or module) named "fnum.gh".
It is acceptable to use only a portion of the rules/symbols appearing in
a library, and (our modified version of) Bison does not report any of
them as being "useless". However, if *all* of them are unused, then it
instead reports that the corresponding "library" is useless.
I regret that I cannot share our actual implementation of all this with
you because the code is "owned" by the US Government, and I do not have
the authority to release any of it. None of it is rocket science,
however. Once you understand the issues that need to be addressed, it
is pretty clear how to do so.
We implemented the basic infrastructure more than 20 years ago, and did
some significant additional renovations about 5 years ago:
- The %begin-library and %end-libary directives.
- Make Include_Grammar automatically implement "include guard"
functionality that prevents multiple inclusion.
- Modify M4 (that still implements our "grammar file inclusion"
capability) with the ability to automatically create "make"
dependencies. (The M4 maintainers have *still* not included
these enhancements into any released version of M4.)
These allowed us to update our "make" system to automatically generate
dependencies for grammar files, and we finally obtained completely "clean"
builds of our software (no more useless rule/symbol warnings from Bison
because of partially used include files).
Although I cannot contribute this code, I hope that the knowledge and
experience that I have gained making all of this machinery work (well)
can help inform the Bison developer community about what works, what
does not, and why.
David Warme
On 11/11/2018 12:05 PM, Akim Demaille wrote:
Le 10 nov. 2018 à 15:08, Egor Pugin <address@hidden> a écrit :
Let '#include' die, welcome ‘import'.
I see.
But, isn’t include and import could be orthogonal in bison?
I don’t want to have two features to maintain.
It's nice to have common prologues/epilogues, also parts of grammar,
as you said.
Maybe %include is nice to have feature, before future possible imports?
I confess that I’m kind of worried about %import: the infrastructure
of Bison is far from being ready. We don’t really have an AST for
instance, so parsing several different files will not be convenient.
And I’m not particularly impatient of writing an AST in C. I wish
Bison were in C++…