guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Permissive html parser for guile


From: Panicz Maciej Godek
Subject: Re: Permissive html parser for guile
Date: Wed, 23 Jan 2019 22:04:23 +0100

I believe that the canonical way of working with XML documents in Guile is
through the (sxml simple) module (and others):
https://www.gnu.org/software/guile/manual/html_node/SXML.html

It contains xml->sxml function which allows to convert XML strings to a
more familiar s-expression based format.

śr., 23 sty 2019 o 17:41 swedebugia <address@hidden> napisał(a):

> I just found this LGPL3 parser by Neil Van Dyke (see attachment)
>
> Do we have something similar in guile?
>
> If not is anybody interested in porting it? (I have no idea how much
> work it would be, but Racket seems quite close to guile)
>
> Here is the introduction:
> "The html-parsing library provides a permissive HTML parser. The parser
> is useful for software agent extraction of information from Web pages,
> for programmatically transforming HTML files, and for implementing
> interactive Web browsers. html-parsing emits SXML/xexp, so that
> conventional HTML may be processed with XML tools such as SXPath. Like
> Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a
> permissive tokenizer, but html-parsing extends this by attempting to
> recover syntactic structure.
> The html-parsing parsing behavior is permissive in that it accepts
> erroneous HTML, handling several classes of HTML syntax errors
> gracefully, without yielding a parse error. This is crucial for parsing
> arbitrary real-world Web pages, since many pages actually contain syntax
> errors that would defeat a strict or validating parser. html-parsing’s
> handling of errors is intended to generally emulate popular Web
> browsers’ interpretation of the structure of erroneous HTML."
> https://docs.racket-lang.org/html-parsing/index.html
>
> --
> Cheers Swedebugia
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]