Re: [PATCH 00/17] RFC: multiple start symbols

bison-patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 00/17] RFC: multiple start symbols

From:	Rici Lake
Subject:	Re: [PATCH 00/17] RFC: multiple start symbols
Date:	Sun, 27 Sep 2020 13:46:41 -0500

Hi, Akim.

Sorry for not responding earlier. I'm in the middle of a move, and your
proposal needed more concentration than I was capable of giving.

I think this is a valuable feature, but I'm not entirely convinced by the
calling interface. On the other hand, I know that's the heart of your
proposal. (This is another reason my response is taking so long.)

There are two issues here: one is the use of a compound return object
instead of splitting the return between a simple status value and one or
more indirect return arguments; the other is the decision to use a specific
type rather than the YYSTYPE union.

With respect to the return mechanism, I feel like the use of a status
return and indirect out arguments is much more normal C style than a
compound return object. Many interfaces work in exactly this fashion
(particularly given the criticism of the alternative of returning a
possibly NULL pointer, raising issues of ownership and thread-safety). So I
think that few C programmers would raise their eyebrows at an interface
which fills in a structure pointed to by a pointer argument, whereas many
would be taken aback by the compound return.

Also, the compound return value does not seem to me to simplify typical use
cases. It's reasonably rare that you will want to bundle a status value
other than success with a returned value, because in most cases the
returned value will only be valid if the status indicates success. (This
doesn't apply to yynerrs, of course; but see below.) So the caller will
generally end up deconstructing the return value anyway, in a pattern such
as:

     SomeStructType parseCompoundResult = yyparse_expression(...);
     if (parseCompoundResult.status != YYSUCCESS) {
         /* signal error and don't try to process value */
     }
     SomeValueType result = parseCompoundResult.value; /* In C++ this might
be a reference */

That doesn't seem to me to offer any value over the more common:

     SomeValueType result;
     if (yyparse_expression(&result) != YYSUCCESS) {
        /* signal error and don't try to process value */
     }

That doesn't address the question of the yynerrs member, which probably
needs more thought. But in general, I don't think that yynerrs is a very
general solution. It seems necessary, but it will often not be sufficient.
In many ways, it's an unhappy proxy for the absence of a way to influence
the status return. In a parser with error recovery, some (possibly all)
errors will render the value result invalid, but there is no interface
which tells yyparse that it should return "PARSED_WITH_ERRORS" instead of
YYSUCCESS. But using yynerrs for this purpose is not ideal either;
integrating that test into the above code samples reveals how annoying it
is. Furthermore, many parsers will want to have a much more articulated
datum for reporting: severe errors; errors; just warnings; no diagnostics
at all. And at least some parsers might prefer to have diagnostics
accumulated in some kind of diagnostic container type, which is produced as
part of the final result. (See clanglib, for example.)

I'm going to leave yynerrs there for now, as something which needs to be
thought about, and return to the question of specific type versus YYSTYPE.

Many parser generators do have the option to parse from various roots. One
interesting case is ANTLR, which provides methods for parsing from *every*
non-terminal (with names generated from the non-terminal). Although the
vast majority of these interfaces will never be used, it turns out to be
extremely convenient for debugging grammars (and for didactic purposes,
such as drawing small parse trees). In ANTLR, these interfaces have little
or no cost, since it fundamentally produces recursive descent parser
anyway, but it might still be reasonable to allow "%start *" for parser
debugging.

Of course, in a C code generator, you most certainly wouldn't want to
generate dozens (or hundreds) of unused interfaces, so this kind of feature
would be better implemented by a general call which took a non-terminal
enumerator as an argument. But that would require that the returned value
type be the same regardless of non-terminal, which effectively reduces to
the YYSTYPE union (or whatever it happens to be).

OK, it's not necessarily a great idea to design a production interface
around a feature only used for debugging. But I still want to focus on the
advantages of using YYSTYPE to represent the returned value.

In particular, it's highly flexible. You don't need to restrict YYSTYPE in
any way. I say this partly because I still prefer tagged types to `#define
api.value.type union`. Using actual C types has a certain appeal but in
practice when I try this style, I inevitably end up using typedefs to
create type aliases, even for primitive types (so that I don't need to edit
every %type declaration when I decide that long should be uint64_t.). Maybe
that's just old-fogeyism :-) But there are a number of parsing applications
in which YYSTYPE is not a C union at all (for example, when it's some kind
of discriminated union), and these should also be usable with the multiple
start symbol interface.

Also, once you decide to use a compound return type (whether it's a direct
return or indirect via an out parameter) which includes a specific value
type alternative, you're committed to create several different structure
types, each used only for a particular call. While there are not likely to
be many such structures, it feels a bit ugly, at least in C. Using YYSTYPE
would require only one compound type (and only one type name), and the
client is going to have to extract the value from the compound anyway. (I
understand that if you `#define api.value.type union` then you don't have a
convenient tagname to extract the value, which means resorting to an ugly
cast. Since your proposal only applies for this particular case, it makes
some sense to do the cast automatically. But as I said above, I think it
would be nicer if multiple start symbols were more general.)

Finally, let me note that (aside from didactic issues, like "how easy is
this to explain to SO questioners?"), this change isn't going to affect me
personally because I almost always use the push interface. With the push
interface, implementing start-symbol sentinels is extremely easy (which is
not to say that a bit of assistance wouldn't be appreciated).  All I need
to do is to define the sentinel, adding a single production (unfortunately,
with a boilerplate action) to the start alternatives. I can then create my
push-parser context object and initialise it by feeding it the sentinel
token, which is basically how one might want the context object to be.

With that architecture, the final parse call -- that is, the one which
doesn't return YYMORE -- can't really return a customised return type, so
the use of YYSTYPE is mandatory. (And perhaps that's the root of my
discomfort with the specific type return values.) But, in general, I'd
prefer to avoid adding this YYSTYPE object to the call to yyparse. I
usually do that by recycling the YYSTYPE object which holds the incoming
semantic value; this object is not used for sending a YYEOF token so it's
freely available to accept a return value. That's less than ideal, though.

Most of the time, what I'd really like is a way of extending the push-parse
context object to hold some extra members, one of which would be the return
value from the final parse call. (Another might be the container of errors
and warnings.) It is possible to add an extra argument to yypush_parse with
my own context object in addition to the bison context object, but since
the push interface is called for every input token, extra arguments seem
like additional overhead.}

Now, let me just plant that issue there for push parsers, and go back to
reentrant pull parsers. (Reentrant, because one would really want to start
promoting reentrant parsers a bit more at this point, since reliance on
globals is a technique from a different era.) BIson's reentrant parsers
don't require a context object, which is cool in its own way but also a
limitation. (For one thing, it makes aspects of the parser state
inaccessible outside of action code.) Suppose, instead, that the multiple
start symbol feature required a reentrant parser with a context object.
With that change, yynerrs and the symbol's return value (i.e. the top of
the stack when the root symbol is reduced) could just be kept in the
context object, in a way which would be consistent with push parsers and
reasonably easy for the client code. I'm not making a formal proposal,
here: it's just an alternative to consider.

Again, sorry for the delay in responding. I'll try to be attentive to my
email, but as a result of the move my internet access will be very
intermittent until Thursday or so.

Rici

El dom., 27 sept. 2020 a las 3:20, Akim Demaille (<akim.demaille@gmail.com>)
escribió:

> Hi Paul, Hi Adrian,
>
> Thanks a lot for your feedback.  I was hoping to get more opinions though.
>
> I am going to install these changes, and work on them from master.  There
> remains quite a lot of work, to address the other skeletons, and other
> issues such as push parsing.  And the documentation, of course.  And more
> tests.
>
> The version I will push has a few minor changes.  In particular,
> yyparse_yyimpl and yyparse_yyimpl_t are now yy_parse_impl and
> yy_parse_impl_t to emphasize more clearly that they are implementation
> details.
>
> I think 3.8 is now defined: official support for D, clean C++
> reimplementation of glr.cc, and multiple start symbols.  That's way enough
> work for a release.
>
> Cheers!
>

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 12/17] todo: more, (continued)
- [PATCH 12/17] todo: more, Akim Demaille, 2020/09/20
- [PATCH 13/17] multistart: adjust reader checks for generated rules, Akim Demaille, 2020/09/20
- [PATCH 14/17] multistart: use b4_accept instead of action post-processing, Akim Demaille, 2020/09/20
- [PATCH 15/17] multistart: allow tokens as start symbols, Akim Demaille, 2020/09/20
- [PATCH 16/17] yacc.c: also count calls to YYERROR in yynerrs, Akim Demaille, 2020/09/20
- [PATCH 17/17] multistart: also give access to yynerrs, Akim Demaille, 2020/09/20
- Re: [PATCH 00/17] RFC: multiple start symbols, Paul Eggert, 2020/09/20
  - Re: [PATCH 00/17] RFC: multiple start symbols, Akim Demaille, 2020/09/23
    - Re: [PATCH 00/17] RFC: multiple start symbols, Adrian Vogelsgesang, 2020/09/23
    - Re: [PATCH 00/17] RFC: multiple start symbols, Akim Demaille, 2020/09/27
    - Re: [PATCH 00/17] RFC: multiple start symbols, Rici Lake <=
    - Re: multistart: returning structs, Akim Demaille, 2020/09/29
    - Re: multistart: yynerrs, Akim Demaille, 2020/09/29
    - Re: multistart: free choice of the start symbol, Akim Demaille, 2020/09/29

Prev by Date: Re: [PATCH 00/17] RFC: multiple start symbols
Next by Date: Re: [PATCH for Dlang support] d: change the return value of yylex() from int to TokenKind
Previous by thread: Re: [PATCH 00/17] RFC: multiple start symbols
Next by thread: Re: multistart: returning structs
Index(es):
- Date
- Thread