[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
DOM parsing in bison (was: Parsing a language with optional spaces)
From: |
Adrian Vogelsgesang |
Subject: |
DOM parsing in bison (was: Parsing a language with optional spaces) |
Date: |
Sat, 18 Jul 2020 08:26:22 +0000 |
User-agent: |
Microsoft-MacOutlook/10.10.18.200713 |
Hi Akim, hi Christian,
Really interesting thread and discussion!
I almost feel guilty for only picking up one particular point ;)
Akim's comment
> No, it's also about the targeted model. Most other GLR parser generators
> address a somewhat simpler issue: their parsers generate the final AST.
> They are DOM-parsers. Bison is more like SAX-parsers: we run user actions.
> Memory management is really a problem here.
touches an interesting point to me.
It made me realize: We in Hyper are actually using bison only to generate a DOM
tree. We built our own abstraction on top of bison, a "bison preprocessor"
which translates a kind of ad-hoc "DOM-description language" used in the parser
actions into actual C++ code for our actions which can then be fed into bison
Would it be valuable to have something similar in bison?
I.e. teach bison some kind of short-hand syntax which can be used in actions
and to create a DOM tree?
Would others also be interested in such a DOM-building capability?
Cheers,
Adrian
On 18/07/2020, 08:47, "help-bison on behalf of Akim Demaille"
<help-bison-bounces+avogelsgesang=tableau.com@gnu.org on behalf of
akim@lrde.epita.fr> wrote:
> Le 14 juil. 2020 à 13:31, Christian Schoenebeck
<schoenebeck@crudebyte.com> a écrit :
>
> On Montag, 13. Juli 2020 07:56:57 CEST Akim Demaille wrote:
>
> For the 'married' parser, in the form imagined by me, there would be no
> lookahead token, as a lookahead token implies a context-unaware scanner.
>
> Instead, when the parser would get to a state where it had to decide
between
> reducing or shifting (depending on a next token), it would not decide and
fork
> the parser state instead, in order to allow context-specific tokenization.
That's indeed a possible choice in the implementation of GLR parser
generators:
to sit GLR on top of LR(0). I've never seen performance reports that
compare
both approach though.
However, most implementations use SLR(1), because it is super simple to
implement (especially compared to LALR(1)), yet make a significant
difference.
>> Yes, indeed. That's something which Joel detailed in his PhD. Note
>> though that PSLR targets deterministic parsing. Bison would (first)
>> target that, not GLR. But contributions are always welcome :)
>
> The longer I think about it, the more I wonder whether it wouldn't make
sense
> to make a clear and radical cut with the past when developing a 'married'
> parser to allow unfolding its full potential.
>
> I mean I look at what you are working on Akim, and AFAICS the majority of
time
> you have to spend on maintaining (i.e. not breaking) backward
compatibility
> for ancient use cases. Most of Bison's conceptional design and internals
are
> still based on requirements and opinions from around 1980 which simply no
> longer exist or proofed wrong.
I don't subscribe to this opinion :) The core of Bison is well established,
and I have very little maintenance to do on it. As a matter of fact, if you
look at the recent history, there's hardly any work in the automaton
construction. Most of my recent efforts are about diagnostics. Diagnostics
about the grammar (which is also what the current work on counterexamples is
about), and in the generated parsers (to provide better error messages).
Many many people *want* a *deterministic* and *fast* parser.
They want determinism because with GLR (with conflicts) you just don't know
if some day your parser will fail because of an unforeseen ambiguity. I
know people who build critical software with Bison, and "possible failure"
on valid input is not an option. Bison is strong on this regard. IELR
is arguably the most powerful form of LR(1) parsing you can find on the
market
with the size of LR(0). And you can have canonical LR(1) if you want.
GLR is slow compared to LR. People rely on our generated LR parsers being
fast. That's a true feature of Bison's. Note only does determinism provide
people with guarantees ("your grammar is unambiguous"), it also provides
them with efficiency.
Although I don't have figures about "blind GLR" (i.e., seated on LR(0)),
I'm pretty sure it would disastrous for anything serious. So you'd have to
go at least to SLR(1). But then, what would be the point of throwing away
all the good technology we have for LALR(1), IELR(1) and canonical LR(1)?
If I were to build a GLR parser from scratch today, I would certainly
aim at SLR(1), you are right, LALR(1) is a nightmare, but Robert Corbett
fought that fight some 40 years ago, and IELR(1) was certainly super tricky,
but Joel E. Denny won that battle more than 10 years ago. Why would I
discard so great achievements?
> Besides of what we already discussed, I think it would make sense to get
rid
> of a bunch of other things that Bison still retains:
>
> 1. No longer multiple parser types, only one: a somewhat GLR-alike parser
like
> discussed as real superset of LALR, but without lookahead token (so not
> really LALR based under the hood). Theoretically there is no
disadvantage
> by doing so, because if a developer feeds it with a LALR(1) capable
> grammer, it would still end up having the same parsing
> complexity=efficiency as an old-school LALR would do, probably just
with
> slightly higher memory footprint.
>
> Advantages though: much more user friendly and powerful, and a lot less
> code to maintain.
I bet we are talking about making the parsers at least one order of
magnitude
slower here (blind GLR). Of course this not an option.
Besides, if people want to look for brand new trendy techniques, why would
they come to Bison? There are plenty of alternatives out there, and if
Bison still exists today, it's because there's already tons of software
that rely on its current behavior. I will *not* break that. *Never*.
That's our "raison d'être" (some English native should check my use of
French expressions, I'm not I'm using it properly :-).
I can add *more* options, provide better alternatives, but Bison cannot
afford to drop features just like that.
> 2. No longer raw C tables as pregenerated parser and scanner tables,
instead
> of such C-tables: an intermediate, dynamic in-memory model with
high-level
> API allowing to access and modify grammar and patterns on-the-fly.
>
> Advantage: it would allow self-extensible languages (e.g. an input that
> adds keywords, patterns, drops or adds grammar rules while being
parsed).
> And implementation specific things would finally clearly be isolated by
> introducing such an API -> Easier to maintain.
>
> Disadvantages: Probably none, except of consuming slightly more memory
for
> meta information.
This is interesting, but that's not Bison you are talking about.
> 3. No longer aggressive compression of parser states (which is somewhat
> already implied by [2.]), because such compressions lead to important
> information loss when it comes to grammar debugging or auto completion
> features and other user relevant purposes. The motivation that lead to
such
> compressions (very low RAM space) no longer exist in like >99% of use
cases
> today.
Come on! We all know that today the very same constraints ("be small")
still apply,
but indeed for completely different reasons. Compressing used to be about
fitting in the RAM, today it's about fitting in the cache. Not to mention
lower-end devices.
I don't believe in the Unique True Omnipotent Parser Yielder (let's call it
UTOPY for short). There are far too many different use cases. Some want
to change the language dynamically, some want determinism, some what
something
DOM-oriented, not SAX-like, some have strong CPU/time constraints, some
don't
care, some experiment others need to be production ready, etc., etc., etc.
Bison fits into a niche, and it would be an error to depart from this.
> So these concepts are probably too radical for trying to still fit them
into
> Bison's current (3.x) design.
Don't expect that in Bison 4 either.
>> Parsers such as SGLR have stacks which are graphs, while Bison's is
>> always a tree. All GLR parsers manage to share the past of concurrent
>> stacks, they don't have a set of separate stacks. Bison's GLR stacks,
>> of course, share their common past, it's stored only once. I believe
>> SGLR is also able to share the common "future", so their stacks are
>> no longer trees. This allows SGLR to keep decent performances when
parsing
>> ambiguous grammars and you want to fetch the resulting parse forest.
>
> Ok, I read it as Bison's current GLR parser probably not having seen much
> attention in the past.
No, it's also about the targeted model. Most other GLR parser generators
address a somewhat simpler issue: their parsers generate the final AST.
They are DOM-parsers. Bison is more like SAX-parsers: we run user actions.
Memory management is really a problem here.
If you wish, Bison's GLR is tailored for "mostly deterministic grammars".
It's point is more to grant us with infinite lookahead, rather than a
means to cover ambiguous grammars.
At least that's the mental picture I have to Paul Hilfinger's work on
GLR. And AFAICT, there were never any complains about that.
> BTW, from my perspective a set of parser stacks (with shared past) is IMO
> still a tree, however a tree with 3 dimensions instead of 2. A 'graph'
for me
> is something much more generic, as a graph allows much more complex
structures
> (with less restrictions) like traversing to any other state, which is not
> possible in a tree.
We agree here. You seem to be answering to something I did not mean, so
I venture I was unclear.
Bison's GLR stack is a tree. SGLR's is a DAG. Bison's GLR stacks
only have forks, SGLR's also have joins.
Cheers!
- DOM parsing in bison (was: Parsing a language with optional spaces),
Adrian Vogelsgesang <=