Re: Tokenizing

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tokenizing

From:	Daniel Colascione
Subject:	Re: Tokenizing
Date:	Mon, 22 Sep 2014 07:14:30 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

On 09/22/2014 07:02 AM, Stephen Leake wrote:
> Daniel Colascione <address@hidden> writes:
> 
>> Wagner's thesis contains a description of a few alternative incremental
>> GLR algorithms that look very promising.
> 
> Thanks for the pointer; I really need better error handling in the Ada
> mode parser. It currently defaults to a _really_ dumb indentation
> algorithm (same as previous line) when the parse fails.

Yep. falling back to counting parens seems reasonable enough.

>> We can start parsing at some region we quickly determine is semantically
>> equivalent to the start of the buffer using something like
>> font-lock-beginning-of-syntax-function.  
> 
> For Ada, that's only start of buffer; the standard convention is only
> one compilation-unit (the single grammar start symbol) per file.
> 
>> Here, we can reliably partially parse the buffer by parsing
>> forward (remember, GLR parsing is an online algorithm) until the
>> parsing limit is after the visible portion of the buffer; we can then
>> use a "shortest-to-completion" production (which we can pre-compute
>> from the grammar) to insert virtual tokens that complete the language.
> 
> I had the same idea; I was hoping it was new and I could make money :).
> Is that in Wagner's thesis (I didn't see it on a quick scan), or written
> up somewhere else?

Not AFAIK.

>> One really nice
>> property of GLR parsing (and LR parsers generally) is that we can
>> trivially pause and restart them, so we can parse in the background with
>> very little impact on user activity, using `while-no-input' to back out
>> to the event loop when we detect we have something more important to
>> do.
> 
> That's what I need! Although a really fast GLR parser implemented in
> compiled Ada that accesses the buffer via an Emacs FFI would be better.

Another nice thing about lexing with an automaton and parsing with a
generic algorithm is that the parsing machine can be pushed (if needed)
to C core with no loss of customizability or generality. I'd prefer that
individual modes not rely on the FFI for basic fontification and
indentation services.

>> Error recovery is a challenge --- we want to do something useful on
>> buffers that are temporarily invalided by user edits.  One of the
>> biggest challenges is dealing with mismatched brace constructs, since
>> one missing "}" in C-mode can cause the entire rest of the buffer to no
>> longer conform to the grammar.
> 
> I'm hoping to use the same "shortest-to-completion" algorithm for this.
> It may distort the rest of the buffer, but it will be better than what
> it does now.

Maybe it's better than nothing, but bridge parsing is a bit more
promising, I think. :-)

>> Bridge parsing is particularly interesting here: the basic idea is that
>> we perform one pass over the buffer using a simplified grammar that
>> looks only at nesting constructs, insert virtual braces to make
>> everything balance, and then apply the usual parsing algorithm (with
>> ad-hoc error recovery) to the result. I haven't tried it, but it
>> apparently works pretty well.
> 
> That could work for Ada as well.
> 
>> I like what SMIE does, but I don't think it's powerful enough to parse
>> C++. 
> 
> I wrote a SMIE parser for Ada, but it ended up always parsing the whole
> buffer. So I switched to LALR. Then to generalized so I would not have
> to mangle the grammar in the Ada standard as much.
> 
>> GLR parsers also leave us with a parse forest instead of a parse tree,
>> and we really need a parse *tree* if we're going to apply fontification
>> rules specified in terms of AST nodes.  We need to prune the parse
>> forest down to a parse tree before using it for fontification.
> 
> In Ada mode, I just assume the results of parallel parses that end in
> the same state are equivalent for the purposes of Ada mode
> (fontification and navigation). So I just throw away all but one result.
> I haven't really investigated whether this is true.

You can still end up in the same state having shifted different
subtrees, so if you do it that way, you can flop between different
parsing alternatives as you edit the buffer for no reason at all.

>> Consider the most vexing parse:
>>
>>   type variable(function())
> 
> One reason I like Ada; the equivalent declarations are not ambiguous.

Writing unambiguous languages seems to have fallen out of fashion.

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Tokenizing, (continued)

Prev by Date: Re: Tokenizing
Next by Date: Re: [Emacs-diffs] trunk r117912: Minor improvements to new stack-allocated Lisp objects.
Previous by thread: Re: Tokenizing
Next by thread: Re: Tokenizing
Index(es):
- Date
- Thread