[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tokenizing

From: Daniel Colascione
Subject: Re: Tokenizing
Date: Mon, 22 Sep 2014 06:55:00 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

On 09/22/2014 03:21 AM, Vladimir Kazanov wrote:
> On Mon, Sep 22, 2014 at 1:01 AM, Daniel Colascione <address@hidden> wrote:
>> I've been working (very, very, very slowly) on similar functionality.
>> The basic idea is based on the incremental lexing algorithm that Tim A.
>> Wagner sets out in chapter 5 of his thesis [1].  The key is dynamically
>> tracking lookahead used while we generate each token.  Wagner's
>> algorithm allows us to incorporate arbitrary lookahead into the
>> invalidation state, so supporting something like flex's unlimited
>> trailing context is no problem.
>> The nice thing about this algorithm is that like the parser, it's an
>> online algorithm and arbitrarily restartable.
> I have already mentioned Wagner's paper in the previous letters.
> Actually, it is the main source of inspiration :-) But I think it is a
> bit over-complicated, and the only implementation I saw (Netbean's
> Lexer API) does not even try to implement it completely. Which is
> okay, academic papers tend to idealize things.

That Lexer is a dumb state matcher, last time I checked. So is
Eclipse's. Neither is adequate, at least not if you want to support
lexing *arbitrary* languages (e.g., Python and JavaScript) with
guaranteed correctness in the face of arbitrary buffer modification.

> You do realize that this is a client's code problem? We can only
> recommend to use this or that regex engine, or even set the lookahead
> value for various token types by hand; and the latter case would
> probably work for most real-life cases.
> I am not even sure that it is possible to do it the Wagner's way (have
> a real next_char() function) in Emacs. I would check Lexer API
> solution as a starting point.

Of course it's possible to implement in Emacs. Buffers are strictly more
powerful than character streams.

>> Where my thing departs from flex is that I want to use a regular
>> expression (in the rx sense) to describe the higher-level parsing
>> automaton instead of making mode authors fiddle with start states.  This
>> way, it's easy to incorporate support for things like JavaScript's
>> regular expression syntax, in which "/" can mean one of two tokens
>> depending on the previous token.
>> (Another way of dealing with lexical ambiguity is to let the lexer
>> return an arbitrary number of tokens for a given position and let the
>> GLR parser sort it out, but I'm not as happy with that solution.)
> I do not want to solve any concrete lexing problems. The whole point
> is about supplying a way to do it incrementally. I do not want to know
> anything about the code above or below , be it GLR/LR/flex/etc.
>> There are two stages here: you want in *some* cases for fontification to
>> use the results of tokenization directly; in other cases, you want to
>> apply fontification rules to the result of parsing that token stream.
>> Splitting the fontification rules between terminals and non-terminals
>> this way helps us maintain rudimentary fontification even for invalid
>> buffer contents --- that is, if the user types gibberish in a C-mode
>> buffer, we want constructs that look like keywords and strings in that
>> gibberish stream to be highlighted.
> Yes, and this is a client's code that has to decide those things, be
> it using only the token list to do fontification or let a higher-level
> a parser do it.

Unless the parser itself is incremental, you're going to have
interactivity problems.

>>> I will definitely check it out, especially because it uses GLR(it
>>> really does?!), which can non-trivial to implement.
>> Wagner's thesis contains a description of a few alternative incremental
>> GLR algorithms that look very promising.
> Yes, and a lot more :-) I want to concentrate on a smaller problem -
> don't feel like implementing the whole thesis right now.
>> I have a few extensions in mind too.  It's important to be able to
>> quickly fontify a particular region of the buffer --- e.g., while scrolling.
>> If we've already built a parse tree and damage part of the buffer, we
>> can repair the tree and re-fontify fairly quickly. But what if we
>> haven't parsed the whole buffer yet?
> Nice. And I will definitely need to discuss all the optimization
> possibilities later. First, the core logic has to be implemented.
> Bottom line: I want to take this particular narrow problem, a few user
> code examples (for me it is a port of CPython's LL(1) parser) and see
> if I can solve in an optimal way. A working prototype will take some
> time, a month or more - I am not in a hurry.
> As much as I understand, you want to cooperate on it, right..?

*sigh* It sounds like you want to create something simple. You'll run
into the same problems I did, or you'll produce something less than
fully general. I don't have enough time to work on something that isn't
fully general. I'm sick of writing language-specific text parsing code.

Have fun.

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]