guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: Re parse-result


From: Zelphir Kaltstahl
Subject: Re: Re: Re parse-result
Date: Fri, 18 Jan 2019 23:27:27 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1

On 1/18/19 6:00 PM, address@hidden wrote:
> For example, what is the difference
>> between PEG parsing and parser combinators?
>>
> Both of them do top-down parsing (try to match the top-level grammar rule,
> which is
> done by trying to match the lower-level rules which make it up, and so on
> until we
> get down to single tokens.  They differ in the amount of lookahead that's
> provided.
>
> Parser combinators are a facade over recursive-descent parsing, where you
> write
> a procedure corresponding to each rule in the parser, and you can only look
> ahead
> one token to guide parsing.  This is known as LL(1) parsing. It's very easy
> to write
> these procedures, so you don't really need an engine like yacc to do the
> job.
> One character of lookahead isn't usually enough, so there is a lower-level
> lexer
> that uses (typically) regular expressions to aggregate things like numbers
> and
> identifiers into single parsing tokens.
>
> PEG parsing allows *infinite* lookahead, all the way to the end of the
> input.
> However, it memoizes the results of parsing to avoid reparsing repeatedly.
> To make this possible, alternatives are ordered: instead of rules like A |
> B,
> where the first token has to be enough to tell you if you have an A or a B,
> you have rules like A / B, which means "assume it's an A, and only if it
> fails, assume it's a B".  If A and B overlap, non-PEG parsing has to treat
> A | B as a grammar ambiguity and recover, but A / B is never ambiguous.

Hi swedebugia,

Thanks for that explanation.

I understand lookahead, but wait, somehow this description seems to go
against what I have seen with parser combinators so far:

I could have "or-connected" procedures where each of them requires more
than one character or else the parser fails for the string. For example
one expects "aa0" the other "aa1", but if neither is present, then the
parsing fails. This seems like a lookahead of more than one character is
possible with parser combinators. In code using Dave Thompson's
parser-combinators library:

~~~
(load "parser-combinators.scm")

(use-modules (parser-combinators)
             (srfi srfi-41))

(define string->stream (compose list->stream string->list))

((parse-any (parse-string "aa0") (parse-string "aa1")) (string->stream "aa2"))  
; fails

((parse-any (parse-string "aa0") (parse-string "aa1")) (string->stream "aa1"))  
; succeeds

((parse-any (parse-each (parse-string "aa") (parse-string "0"))
            (parse-each (parse-string "aa") (parse-string "1")))
 (string->stream "aa1"))  ; succeeds
~~~

Instead of parsing a string, I could also have a complex parser there,
which parses all kinds of numbers and the second thing could be
something else, which itself contains an "or-connected" thing inside.

It seems that the abstract examples you wrote are already possible with
the parser combinators then. I am probably misunderstanding it
somewhere. In the examples A and B overlap, so maybe it already has to
"recover" (make a decision which branch to go next)? But then again the
part after "aa" is not ambiguous, so it does not seem like a problem in
the example.

Maybe if I had an example like the following:

The string "aaa" could become the tokens "aa" and "a" or could become
"a" and "aa", basically A is "aa" and B is "a" and so the parsing
process has to decide between the two. Would that be such a case, where
in PEG parsing there is no ambiguity, but in parser-combinators approach
there is? I think the recovery here is, that the first mentioned parser
in the code is the first one that is tried and if it fails the next one
is tried. First one who succeeds "wins". This seems to be a backtracking
approach. The wiki
(https://en.wikipedia.org/wiki/Recursive_descent_parser) article says:

"Recursive descent with backtracking is a technique that determines
which production to use by trying each production in turn. Recursive
descent with backtracking is not limited to LL(k) grammars, but is not
guaranteed to terminate unless the grammar is LL(k). Even when they
terminate, parsers that use recursive descent with backtracking may
require exponential time."

So I guess "Yes, I can handle these cases using parser-combinators,
however, in the general case it may take exponential time." -- Which is
usually bad.

I wonder how common ambiguity in languages we use are. For example if I
wanted to parse markdown, would there be difficult to handle
ambiguities? I think there might be some about indentation, for example
a block quote inside a list inside a list or things like that, but I
guess those could be easily handled by putting in an assumption, that
the indentation is always foremost for the list and only after that for
anythind indented inside the list, like a blockquote. One could argue,
that the specification has a different than or no assumption like that
(I did not actually go and read it now) and that the language being
parsed is slightly different. Maybe if one wanted 100% compliance PEG
parsing might be necessary? Would that be such a case?

Regards,

Zelphir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]