Re: Parsing a language with optional spaces

help-bison
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Parsing a language with optional spaces

From:	Christian Schoenebeck
Subject:	Re: Parsing a language with optional spaces
Date:	Sun, 09 Aug 2020 17:34:21 +0200
On Sonntag, 26. Juli 2020 10:56:32 CEST Akim Demaille wrote:
> >> That's indeed a possible choice in the implementation of GLR parser
> >> generators: to sit GLR on top of LR(0).  I've never seen performance
> >> reports that compare both approach though.
> >> 
> >> However, most implementations use SLR(1), because it is super simple to
> >> implement (especially compared to LALR(1)), yet make a significant
> >> difference.
> > 
> > Forking at this point would merely be an implementation trick to buy the
> > powerful feature of context-aware scanning. Those forks would be very
> > short- lived though as they would usually already be collapsed in the
> > next step.
> That's not the point.  LR(0) is *really* limited.  It cannot even grok
> binary operators.  So you would spend a significant amount of your
> time getting things wrong.

Right, LR(0) is no option.

> > So the theoretical overhead of this would probably be like O(c n^3)
> > instead of O(n^3), and assuming that c would be constant, that overhead
> > would probably be neglectable.
> 
> This is where we disagree :)  Sure, "neglectable" is subjective, but then,
> as the maintainer, I have to have stricter requirements.  I have to aim
> at the strictest requirements of Bison users, not the weakest, or something
> in the middle.

Well, it is common to simplify complexity notation to its hardest class. But 
anyway, the constant factor in O(c n^3) is somewhat misleading, as it just 
looks at the parser aspect. The other side of the story is that context-aware 
scanning could significantly simplify the RegEx state machine and gain 
efficiency in individual parser states.

Which side would weight more, who knows ...

> You know, PostgreSQL people consider their Bison-generated parsers are slow,
> and they are looking for ways to speed it up.  I'm sure they are not the
> only ones.
> https://www.postgresql.org/message-id/CACPNZCvzAjiK-CP-qcwmqTrfwM039AGDw2Bq3
> _aboRot0m1NGA@mail.gmail.com

Righ, Bison is still always generating a generalized parser. It is not 
generating optimizations for individual parts of languages, nor generating 
tables with optimization in mind.

But OTOH when it comes to speeding up SQL queries, it would probably make more 
sense to start switching to precompiled SQL queries first, because that would 
speed up performance more than dealing with Bison optimizations in this case.

> > What I am wondering, independent of implementation aspects, would there be
> > an automated way (from theoretical POV) to identity the minimum required
> > amount of k (a.k.a. amount of lookahead tokens) for _individual_ parts of
> > the language to allow a variable k for one language, in order to find a
> > good compromise between runtime efficiency and flexibility, and yet being
> > easy to use?
> 
> In my experience, this point is little practical value.  It's interesting
> from the theoretical point of view, but little to no value in practice.
> For at least two reasons.
> 
> First one is that even when this k exists, the implementation is a
> nightmare. Using a global k, full LR(k) parsing, is simple to build, it's
> quite a straightforward generalization of LR(1), but then you very quickly
> get huge automata, and you have to stick to a small k.  Local `k`, well, I
> don't know of any successful algorithm to do that for LR (it's quite easier
> in LL), and I'm certainly not willing to have to maintain such a beast. Not
> just generating the automaton, but also adjusting the skeletons to run it.
> 
> Note that any LR(k) grammar can be turned into an LR(1) grammar.  In
> practice, people do that by hand (Adrian already explained his trick, and
> I described how Bison deals with its LR(2) grammar to parse it anyway).
> In that sense, people are already doing adaptive LR(k) parsing.
> When they can.

Yeah, that's the potential other side of dealing with this: instead of 
automated adjustments under the hood like finding an appropriate value for k; 
providing an automated way for transforming the grammar appropriately.

For instance with a modern IDE, trivial issues like wrong printf format 
specifiers, type cast issues, typos, etc. are highlighted with a warning and 
are nowadays just a matter of clicking a "Fix" button that pop up in the IDE. 
That's typically a feature wired via clang API.

I could imagine the same while designing languages: e.g. ambiguities popping 
up as errors/warnings in IDEs with buttons allowing the user to select e.g. 
the precedence, and the grammar would then be transformed accordingly for the 
user.

> And second reason, a fixed k is not appropriate in many situations. 
> Consider the case of the Tiger programming language in which you
> instantiate array types this way:
> 
> array_of_int[42] of 51
> 
> where 42 is the size of the array, and 51 the initial value of these 42
> values.
> 
> In Tiger you also have the usual syntax to read inside an array:
> 
> my_array[42]
> 
> The "natural" rules are
> 
> exp
> 
> : typeid "[" exp "]" "of" exp
> : 
> | lvalue
> 
> lvalue
> 
> : "id"
> : 
> | lvalue "[" exp "]"
> 
> typeid
> 
> :  "id"
> 
> You can easily see that this is not LR(1): you need to read the "of"
> to know whether you're in case 1 or case 2, and LR(1) only see the "[".
> You might think it's LR(4) then, but no it is not, because of course
> the size of the array/the index are *expressions*, of unbounded length,
> 
> array_of_int[0 + 1 + 2 + .. + 65535] of 0
> my_array[0 + 1 + 2 + .. + 65535]
> 
> C++ is also well known for having such a need for unbound lookahead.
> The section about GLR in Bison's manual has a few other examples.
> https://www.gnu.org/software/bison/manual/bison.html#GLR-Parsers
> 
> So, as far as I'm concerned, anything between LR(1) and GLR (or something
> else that supports ambiguities and unbounded lookahead) is a waste
> of energy.

Agreed.

> > But the reality is also a 'bit' different, as many languages are not truly
> > LALR(1) with context-free tokens to full extent. So what happens is that
> > people usually make hacks to fit their language into that concept (e.g.
> > pushing/popping states on scanner side, bloating tokenization rules, hand
> > crafted procedural tokenization code e.g. in C, etc). And that's where
> > determinism often proofed to get broken, and is often hard to be fixed
> > (manually) and maintained as well, as there is always a danger of other
> > edge cases still being present.
> 
> You are very right.
> 
> But some of this tricks _can_ be addressed while still being deterministic.
> And that's exactly what Joel's dissertation demonstrated.  There _is_
> value in "context-sensitive lexical analysis" in LR parsing (I dislike
> using "context-sensitive" here, because that's not what context-sensitive
> grammars are about, we are still in the realm of context-free grammars,
> but it conveys the right idea of what the expressive power the scanner
> gains).

Yes, the term "context-sensitive" is usually only used with grammars, probably 
because historically in language theory literature, there was never the 
concept of tokens being recognized globally, decoupled from the grammar. 
That's more a common implementation feature that evolved because developers 
meant to strictly decouple tokenizers from parsers in unidirectional relation.

To avoid ambiguity of terminology I used the term "context-aware" 
tokenization.

> > I still prefer a machine handling complexities than relying on humans to
> > understand and handle large number of possibilities reliably. Because the
> > larger the amount of possibilities, the higher the chance human loses that
> > battle against a machine.
> 
> Agreed.  But with LR(1) Knuth made us win a significant battle in this war,
> and I don't want to lose what we gained.

The win was the common understanding of runtime features and its complexity 
classes. But I would not say that it is a win people to require their grammars 
always being LR(1) compliant. The lower the usability barriers of a tool, the 
higher the chance people will use it. And so far the potential growth of new 
parser-generator users is still very high.

No doubt, LR(1) or even LL(1) is the goal for achieving high performance. But 
you would not start writing e.g. a program in C with its final optimization 
goal in mind. You start your C code with simplicity in mind first, and 
optimize (if required at all) later on. If you are lucky, the C compiler's 
backend already does the job with -O3 automatically for you.

So IMO the preferred way was if language designers could start prototyping a 
new language, without necessarily knowing anything about LR(1). The generated 
parser might then start off with O(n^3) for sake of just having something 
working in rapid time. Then subsequently the designer could evolve by 
adjusting the language to achieve O(n^2), ... O(n).

> >> Besides, if people want to look for brand new trendy techniques, why
> >> would
> >> they come to Bison?  There are plenty of alternatives out there, and if
> > 
> > For instance?
> 
> ANTLR comes first to my mind, 

IMO not interesting. It's a top-down parser, so it is already from theoretical 
POV less powerful than LR(1). I mean ANTLR allows you to choose any (fixed) k, 
but AFAIK even LL(k) [with any high k] is still a real subset of LR(1).

> and generators for PEG have flourished
> everywhere these past years.  The wikipedia page is endless,
> https://en.wikipedia.org/wiki/Comparison_of_parser_generators.

Those PEGs indeed look somewhat interesting. Especially as they already come 
coupled with context-aware tokenization AFAICS. But somehow I dislike that the 
precedence of ambiguous rules is defined by the order they appear in the 
grammar source. Plus the memory footprint seems to be huge due to memoization 
being used.

What also looks interesting to me are Earley-Algorithm based parsers, as they 
could theoretically gradually deal with all context-free languages 
appropriately. So ambiguous context-free languages would run with O(n^3), 
while memory footprint being linear to input; then non-ambiguous grammars 
would be O(n^2); and finally LR(1) grammars would be O(n). There are not many 
implementations though, which makes me wonder whether I am missing some major 
disadvantages from theoretical POV.

> >> Bison still exists today, it's because there's already tons of software
> >> that rely on its current behavior.  I will *not* break that.  *Never*.
> >> That's our "raison d'être" (some English native should check my use of
> >> French expressions, I'm not I'm using it properly :-).
> >> 
> >> I can add *more* options, provide better alternatives, but Bison cannot
> >> afford to drop features just like that.
> > 
> > Please do, I would appreciate if you share ideas!
> 
> You mean, ideas of new features?  I have cited a bunch of them already,
> and others are in the TODO file, but currently those that are close to my
> heart include:
> 
> - support for multiple start symbols

Hmm... I don't see a big advantage of that in practice. It just saves you one 
non-terminal.

> - integration of the scanner

Definitely, and that context-aware! ;-)

> - scoped precedence/associativity directives

Like in which way did you imagine that?

> - grammar modules

Probably nice to have, but I think not a big game changer in practice either.

> - rewritten glr.cc

Ok, I don't know the implementation details there that made you thinking that.

> - elimination of chain rules

Elimination by transformation to what?

Best regards,
Christian Schoenebeck
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Parsing a language with optional spaces, Christian Schoenebeck <=
Next by Date: C preprocessor
Next by thread: C preprocessor
Index(es):
- Date
- Thread