Lynn Winebarger <owinebar@gmail.com> writes:
> On Thu, Dec 29, 2022 at 10:28 AM Gregory Heytings <gregory@heytings.org> wrote:
>> > Do you know how strong the dependency on node is? As I said before, it
>> > seems that it is possible to evaluate the grammar files that use the DSL
>> > using something like quickjs as well
>> >
>>
>> That's not possible, no, at least not without a lot of complications that
>> do not seem worth the price, compared to installing Node.js. And note
>> that even if that were feasible, it would only solve the first half of the
>> problem: to transform a grammar.js file into its corresponding parser.c
>> file, you also need the tree-sitter command line program.
>
> Maybe a better question is - is it possible to adapt the semantic
> parser generators (or others in emacs) to create the ".c" files for
> use with libtreesitter?
This is possible in principle; I've thought about doing it with the
wisitoken parser generator.
However, the format/struture/details of the output is not documented,
and may change in future tree-sitter releases.
True, *but* each parser has as an "ABI version" constant encoded into it, and the source states the library is supposed to be backwards compatible (currently 14, the the grammar.c files I saw are at 13). So if we get something that complies with version 13, it should be fine.
I've looked over a couple of the parser.c files, and they appear to be pretty standard implementations of.LR stack automata. The CLI tool supports ambiguous grammars, but if you start from one an existing LR-type of parser generator (bison or wisent, say) can handle, then that piece should be relatively straightforward.
The novel feature appears to be the automatic derivation of a "flattened" AST data structure, which the library uses to construct ASTs for a given action.
I did come up with my own calculation of a flattened AST structure a year or two ago, though I wasn't thrilled with some of the results I got - automatically picking a symbol to use to "break" recursive inclusion by reverting back to a pointer was tricky. I believe I went with a heuristic of converting the symbol that had the most incoming edges to a reference in each place it appeared as a field, then repeated until there were no data structures that were required to physically contain themselves. I used that rule because it seemed like it would minimize the number of symbols not allowed to appear as fields, thus maximizing the flattening effect (too well, in fact - I had to introduce some constraints to keep some of the nonlinear structure).
So it might be interesting to go through tree-sitters' algorithm to see what they came up with.
Lynn