[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using shtml with htmlprag - output of shtml->html is different to so

From: Kenan Toker
Subject: Re: Using shtml with htmlprag - output of shtml->html is different to some given HTML
Date: Fri, 6 Sep 2019 14:09:53 +1000

Hi Neil,

Thanks heaps, I'll give this fix a go and let you know how it works ASAP.

That all makes sense re: avoiding breaking changes in guile-lib. If this
fix works and is all that's needed, I'll use it instead of the version
currently available in guile-lib.

With that in mind, if I were to choose one of the 'distributions' of
htmlprag, is there one you yourself would pick? - or are the version
available in e.g. guile-lib and standalone for all intents and purposes
the same?


On 6/9/19 8:33 am, Neil Van Dyke wrote:
> Kenan, could you please try the below "one-line" change, and let me
> know what you think?
> (It's an attempt at a minimal fix for the problem you were seeing, and
> for some related problems with modern HTML.  However, it breaks
> backward-compatibility relative to the htmlprag currently in
> guile-lib.  For example, consider someone doing Web scraping of modern
> HTML, and their scraping code only works with the previous, invalid
> parse.  I'm not yet familiar with guile-lib and how the htmlprag in it
> is being used, so I don't want to be too quick to suggest breaking
> changes to it.)
> (Historical note: htmlprag was mostly written 18 years ago, when HTML
> was different in both standards and practice.  Today, I'd write the
> parser very differently, though I think there's a good chance that
> htmlprag will still work for one's purpose, with this change.)
> Neil
> --- htmlprag.scm.ORIG    2019-09-05 18:21:40.850220789 -0400
> +++ htmlprag.scm    2019-09-05 18:21:40.850220789 -0400
> @@ -1099,7 +1099,7 @@
>                (meta     . (head))
>                (noframes . (frameset))
>                (option   . (select))
> -              (p        . (body td th))
> +              (p        . (div blockquote body footer header li td th))
>                (param    . (applet))
>                (tbody    . (table))
>                (td       . (tr))
> @@ -1989,6 +1989,13 @@
>      (t1 "<script>xxx"  '((script "xxx")))
>      (t1 "<script/>xxx" '((script) "xxx"))
> +    (t1 "<div><p>x</p></div>" '((div        (p "x"))))
> +    (t1 "<header><p>x</p></>" '((header     (p "x"))))
> +    (t1 "<footer><p>x</p></>" '((footer     (p "x"))))
> +    (t1 "<blockquote><p>x</p></blockquote>" '((blockquote (p "x"))))
> +    (t1 "<ul><li><p>x</p></li></ul>" '((ul (li     (p "x")))))
> +    (t1 "<ol><li><p>x</p></li></ol>" '((ol (li     (p "x")))))
> +
>      ;; TODO: Add verbatim-pair cases with attributes in the end tag.
>      (t2 '(p)            "<p></p>")

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]