Re: Tokenizing

emacs-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Tokenizing

From:	Daniel Colascione
Subject:	Re: Tokenizing
Date:	Sun, 21 Sep 2014 15:01:52 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1
On 09/21/2014 11:55 AM, Vladimir Kazanov wrote:
>> I don't normally edit 7000 line files, so the Ada mode parsing delay is
>> not noticeable to me, so I prefer the current Ada mode approach of not
>> using the idle timer to trigger a parse. But it could be a user option.
>>
> 
> I will look into that. Although the main idea is to *keep the token
> list consistent* most of the time. There will definitely be
> customization possibilities.

I've been working (very, very, very slowly) on similar functionality.
The basic idea is based on the incremental lexing algorithm that Tim A.
Wagner sets out in chapter 5 of his thesis [1].  The key is dynamically
tracking lookahead used while we generate each token.  Wagner's
algorithm allows us to incorporate arbitrary lookahead into the
invalidation state, so supporting something like flex's unlimited
trailing context is no problem.

The nice thing about this algorithm is that like the parser, it's an
online algorithm and arbitrarily restartable.

We can't use the built-in Emacs regular expression engine for this
facility because there's no way to tell regex.c to record how much
lookahead it used, and even if there were, I'd rather not rely on its
backtracking matcher. I've written a DFA-based regex matcher that should
help in implementing this algorithm; of course, a DFA matcher has a
lookahead of zero.

Mine supports zero-width predicates though, so we can actually use
achieve nonzero lookahead (and lookbehind!) if needed.

Where my thing departs from flex is that I want to use a regular
expression (in the rx sense) to describe the higher-level parsing
automaton instead of making mode authors fiddle with start states.  This
way, it's easy to incorporate support for things like JavaScript's
regular expression syntax, in which "/" can mean one of two tokens
depending on the previous token.

(Another way of dealing with lexical ambiguity is to let the lexer
return an arbitrary number of tokens for a given position and let the
GLR parser sort it out, but I'm not as happy with that solution.)

>> Ada mode uses text properties to store parse results; the tokenizer
>> results are part of that, but are not stored separately. I don't see
>> much point in separating the tokenizer from the parser; the tokenizer
>> results are not useful by themselves (at least, not in Ada mode).
>>
> 
> First, this not quite right. Tokenization results can be used, for
> example, for granular syntax highlighting. Font Lock basically just
> uses regexps to catch something that looks like
> comments/keywords/whatever. Tokenizer already *knows* for sure what it
> found. And you don't have to build a full parser to use the results.

There are two stages here: you want in *some* cases for fontification to
use the results of tokenization directly; in other cases, you want to
apply fontification rules to the result of parsing that token stream.
Splitting the fontification rules between terminals and non-terminals
this way helps us maintain rudimentary fontification even for invalid
buffer contents --- that is, if the user types gibberish in a C-mode
buffer, we want constructs that look like keywords and strings in that
gibberish stream to be highlighted.

> Second, it not a tokenizer I want to build, there is a
> misunderstanding of sorts. It is a helper mode (similar to Font Lock,
> in a way) for keeping token lists up to date all the time, easy and
> fast. User code - the tokenizer itself - will just have to provide an
> interface to the mode (be restartable and supply required restart
> information in resulting tokens). The mode will use the information to
> avoid extra tokenizing.

Indeed. I don't imagine replacing font-lock per se, but instead just
implementing a font-lock matcher that *lazily* brings the token list
up-to-date and pulls keywords out of it for fontification. It's
important to keep font-lock around for backward compatibility, but
there's no reason that font-lock's data source has to be a bunch of
simple regex matches.

cc-mode, for example, uses font-lock to apply highlights that come from
a parser that looks oddly like a CYK parser.  (We start parsing at all
points in the buffer that *might* start a declaration; there's also some
limited, ad-hoc backtracking.)

>> I have not noticed any problems with the text properties interface; in
>> particular, storing and retrieving text properties is fast compared to
>> parsing. Ada mode stores about two parse result text properties per
>> source line on average.
> 
> I did not know about your mode - and parsers are sort of my hobby :-)

Mine too. :-)

> I will definitely check it out, especially because it uses GLR(it
> really does?!), which can non-trivial to implement.

Wagner's thesis contains a description of a few alternative incremental
GLR algorithms that look very promising.

I have a few extensions in mind too.  It's important to be able to
quickly fontify a particular region of the buffer --- e.g., while scrolling.

If we've already built a parse tree and damage part of the buffer, we
can repair the tree and re-fontify fairly quickly. But what if we
haven't parsed the whole buffer yet?

We can start parsing at some region we quickly determine is semantically
equivalent to the start of the buffer using something like
font-lock-beginning-of-syntax-function.  But what about completing the
parse? Well, it's possible to determine whether any particular buffer
state is being parsed "dynamically" (the GSS is split) or "statically"
(the GSS has one head). In practice, ambiguous regions are short, so we
only have to worry about the latter case.  Here, we can reliably
partially parse the buffer by parsing forward (remember, GLR parsing is
an online algorithm) until the parsing limit is after the visible
portion of the buffer; we can then use a "shortest-to-completion"
production (which we can pre-compute from the grammar) to insert virtual
tokens that complete the language.  If we were parsing C++, we'd be able
to automatically insert virtual "}" tokens and complete the parse tree,
giving us a partial parse tree we can then use for fontification.

Of course, it's better to have a parse tree that covers the whole
buffer, but we can compute that in the background.  One really nice
property of GLR parsing (and LR parsers generally) is that we can
trivially pause and restart them, so we can parse in the background with
very little impact on user activity, using `while-no-input' to back out
to the event loop when we detect we have something more important to do.

Error recovery is a challenge --- we want to do something useful on
buffers that are temporarily invalided by user edits.  One of the
biggest challenges is dealing with mismatched brace constructs, since
one missing "}" in C-mode can cause the entire rest of the buffer to no
longer conform to the grammar.

Bridge parsing is particularly interesting here: the basic idea is that
we perform one pass over the buffer using a simplified grammar that
looks only at nesting constructs, insert virtual braces to make
everything balance, and then apply the usual parsing algorithm (with
ad-hoc error recovery) to the result. I haven't tried it, but it
apparently works pretty well.

I like what SMIE does, but I don't think it's powerful enough to parse
C++. GLR parsers give us all the power we need.

GLR parsers also leave us with a parse forest instead of a parse tree,
and we really need a parse *tree* if we're going to apply fontification
rules specified in terms of AST nodes.  We need to prune the parse
forest down to a parse tree before using it for fontification.

I actually have a pretty neat idea for doing that: we can attach
assertions to parts of the parse forest and prune the tree by deleting
alternatives that contradict what we know.  cc-mode already does a very
limited form of this analysis by recording types it sees during parsing.

Consider the most vexing parse:

  type variable(function())

We don't know whether that's a function declaration or a variable
declaration. In GLR-land, we generate a parse forest with branches for
both options. If later in the buffer we see something that looks like this:

  variable->foo();

That we know that "variable" cannot be a function, so we can prune the
part of the parse forest that assumes that "variable" is a function,
leaving us with an unambiguous parse tree.

This way, we should be able to come up with reasonable parses for C++
buffers without having an IDE-style global view of types declarations,
instead assuming that programmers, most of the time, don't contradict
themselves.

[1] harmonia.cs.berkeley.edu/papers/twagner-thesis.pdf
[2] fileadmin.cs.lth.se/cs/Personal/Emma_Soderberg/docs/SLE08pres.pdf
signature.asc
Description: OpenPGP digital signature
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Tokenizing, (continued)
Prev by Date: Re: Overlay mechanic improvements
Next by Date: Apology for spamming the list
Previous by thread: Re: Tokenizing
Next by thread: Re: Tokenizing
Index(es):
- Date
- Thread