bison-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: custom error messages


From: Christian Schoenebeck
Subject: Re: RFC: custom error messages
Date: Thu, 09 Jan 2020 14:50:21 +0100

On Sonntag, 5. Januar 2020 17:52:43 CET Akim Demaille wrote:
> Hi Christian,
> 
> Sorry I missed you message.  For some reason the title of the thread
> was broken in the other answers.

Np, I currently have fairly high latencies as well. :)

I see you already collected a bunch of feedback in the meantime, to keep noise 
low I just comment to them in a single message here.

> > Why not making that a general-purpose function instead that users could
> > call at any time with the current parser state:
> > 
> > // returns NULL terminated list
> > const enum yysymbolid* yynextsymbols(const yystate* currentParserState);
> 
> I don't want to have to deal with allocating space.  Your proposal
> needs to allocate space.  Hence the clumsy interface I provided :)

Well, allocation is just a minor API detail that could easily be addressed. 

But if your suggested function is limited to error cases only (i.e. not 
working at any parser state), that would IMO indeed be a show stopper on the 
other hand for features like auto completion.

> > For that purpose, and to continue the idea about a general purpose push
> > API, it would be very useful to have a function for duplicating the
> > current parser state:
> > 
> > yystate* yydupstate(const yystate* parserState);
> 
> Wow, you're talking about massive surgery in yacc.c.  Roughly,
> stop using local variables for the stacks.  Which is what the
> push-interface does (I'm talking about api.push here).
> 
> Or are you referring to push-parsers when you say "push API"?
> 
> > and one function to push parse on a specific parser state:
> > 
> > bool yypushparse(yystate* parserState, char nextchar);
> > 
> > The latter returning false on parser errors. That way people would have a
> > very flexible and powerful API for all kinds of use cases. Because by
> > being able to duplicate states, you can have "throw away" parser states,
> > where you can try out things without touching the "official" parser
> > state. For instance I am using that to auto correct user typos in some
> > parsers (that is guessing what user had in mind on syntax errors by some
> > limited brute force attempts by parser on throw-away parser states).
> 
> That might be doable with api.push.  I don't see that coming for
> the pull interface.

Right, I forgot that the push parser option exists. I remember I looked at it, 
but haven't used it because of features that would still be missing (getting 
"next" symbols at any parser state, getting both human readable symbol 
description and raw token string at any parser state, and duplicating a parser 
state) for already mentioned use cases like auto completion, auto correction 
and custom error handling. So eventually I decided to write and inject that 
massive amount of code that resembles a LALR(1) parser on top of the raw 
generated Bison tables and which closed these gaps for me.

But frankly, another argument at that point was that this was also more 
convenient for other developers, since they could simply use any Bison 
version. For instance Xcode still ships with Bison 2.3 from 2006, and as I've 
heard from Apple that's not going to change due to their GPLv3 objections. 
That was just a convenience aspect though of course, not a show stopper.

> > I would suggest both. It would make sense to auto generate an enum list
> > for
> > all symbols like:
> > 
> > enum yysymbolid {
> > 
> >    IDENTIFIER,
> >    SWITCH,
> >    IF,
> >    CONST,
> >    ...
> > 
> > };
> > and use that numeric type probably for most Bison APIs for performance
> > reasons. That type could also be condensed to a smaller type if requested
> > (i.e. for embedded systems):
> > 
> > enum yysymbolid : uint8_t {
> > 
> >    IDENTIFIER,
> >    SWITCH,
> >    IF,
> >    CONST,
> >    ...
> > 
> > };
> > 
> > But there should still be a way for people being able to convert that
> > conveniently to its original string representation from source.y:
> > 
> > const char* yysymbolname(enum yysymbolid);
> 
> Yes, of course.  That's not "both", that's just what I refer
> to by "exposing the numbers".  "yysymbolname(x)" is currently
> just "yytname[x]".

Sure, it is just not clear to me what your actual future plans about 
yytname[x] are; I see that you are constantly struggling with numerous issues 
because of people using what were supposed to be skeleton internal-only data 
structures due to lack of official public APIs. So that was my reason to 
suggest considering to add official APIs for common specific use cases. E.g.:

/**
 * Name of the symbol (i.e. token name or LHS grammar rule name)
 * from input.y
 */
const char* yysymbolname(enum yysymbolid);

/** Human readable error description for symbol. */
const char* yysymbolerrdesc(enum yysymbolid);

By introducing official, well-defined, long-term guaranteed public APIs you 
could make a clear cut about internal-only skeleton data vs. public API and 
e.g. change the semantics of yytname[] (and other ones) to anything you want 
in future, or even rename or drop them.

On Montag, 6. Januar 2020 19:23:27 CET Rici Lake wrote:
> So I think that there is still time to consider the wider question of how a
> bison grammar might be able to show both literal keywords and
> human-readable token descriptions in a way that is useful for both
> applications. As a side-benefit, this might also make grammars easier to
> read by humans, because the current mechanism does not make it clear to a
> human reader in all cases whether a quoted string is intended to be
> inserted literally into the parsed text, or whether it is necessary to hunt
> through the grammar's token declarations to find the named token for which
> the quoted string is an alias. (I've been fooled by this several times
> while trying to read grammars, particularly when only an extract of the
> grammar is presented.)

++vote;

I already read several people saying that it was not possible to address both 
use cases. What am I missing here?

- It would require to auto generate a 2nd table.

- On grammer input side it would make sense to handle this issue by more  
  specific declarations which reflect their intended semantics more 
  appropriate, e.g.

        %token LE raw="<=" human-err="operator '<='"

if localization is desired (i.e. translations):

        %token LE raw="<=" human-err=_("operator '<='")

where e.g. gettext / Qt / Xcode, etc. would handle the actual human-readable 
translation.

I know, long-term solution would be scanner features on Bison side. But in the 
meantime these declarations would be useful (also more readable), and it won't 
hurt to leave them in, even in a Bison-included-scanner era in future.

On Sonntag, 5. Januar 2020 17:30:18 CET Akim Demaille wrote:
> > One could tackle this particular use case also from a different angle:
> > We could introduce the concept of "opaque" rules, i.e. rules which are not
> > expanded when reporting syntax errors.
> > 
> > E.g., if I could define "unreserved_keyword" as
> > 
> > > unreserved_keyword [opaqe]: ABORT_P | ABSOLUTE_P | <...>
> > 
> > bison should then create the error message
> > 
> > > expected: Identifier, unreserved_keyword
> > 
> > instead of
> > 
> > > expected: Identifier, <long list containing all unreserved keywords>
> 
> Too complex, and fitting just one special case.  With EDSLs, the
> sheer concept of "keyword" is more complex than this.

Actually I was thinking about the exact same feature before that Adrian 
suggested as "opaque" attribute here, for a different use case though: parsers 
without external scanner:

CREATE  :  'c''r''e''a''t''e'
        ;

In that case you would like e.g. a syntax error message like

        "Expecting 'create', got 'foo'"

instead of

        "Expecting 'c', got 'f'"

I inject code to handle that ATM. I could imagine this to be controlled by 
doxygen style comments, something like:


/**
 * @symbol-visibility opaque
 */
CREATE  :  'c''r''e''a''t''e'
        ;

That would prevent backward compatiblity issues and would handle this "detail" 
feature in a graceful, non-invasive way.

I could imagine that as an alternative for my %token change suggestions above 
BTW, that is e.g.:

/**
 * @raw "<="
 * @human-err "operator '<='"
 */
%token LE

Best regards,
Christian Schoenebeck





reply via email to

[Prev in Thread] Current Thread [Next in Thread]