guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Permissive html parser for guile


From: swedebugia
Subject: Permissive html parser for guile
Date: Wed, 23 Jan 2019 17:47:56 +0100

I just found this LGPL3 parser by Neil Van Dyke (see attachment)

Do we have something similar in guile?

If not is anybody interested in porting it? (I have no idea how much work it would be, but Racket seems quite close to guile)

Here is the introduction:
"The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure. The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML." https://docs.racket-lang.org/html-parsing/index.html

--
Cheers Swedebugia

Attachment: main.rkt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]